35. Navigate with XPath

DRAFT, 4/28/03

35. Navigate with XPath

Whether you're writing code with DOM, JDOM,dom4j, Sparta, or some other tree-based API, one of the primary tasks is navigating the document to locate particular nodes, whether for query or update. XPath is a very powerful, very robust language for navigating XML documents. It often works better than explicitly specified navigation using methods such as getChildNode() or getNextSibling().

Navigation in a tree-based API is a surprisingly complex and error-prone operation. For example, consider a simple contacts file that includes names and phone numbers that claims to follow the DTD in Example 35-1:

Example 35-1: A DTD for a simple contacts file

<!ELEMENT Contacts (Contact+)>
<!ELEMENT Contact (Name, Phone)>
<!ELEMENT Name (#PCDATA)>
<!ELEMENT Phone (#PCDATA)>

A typical document might look like Example 35-2:

Example 35-2: A simple contacts document

<?xml version="1.0"?>
<!DOCTYPE Contacts SYSTEM "contacts.dtd">
<Contacts>
  <Contact>
    <Name>John Doe</Name>
    <Phone>626-555-3456</Phone>
  </Contact>
  <Contact>
    <Name>Joe Smith</Name>
    <Phone>212-555-3456</Phone>
  </Contact>
  <Contact>
    <Name>Jane and Ben Reilly</Name>
    <Phone>212-555-2341</Phone>
  </Contact>
  <Contact>
    <Name>Mohammed Jones</Name>
    <Phone>718-555-2349</Phone>
  </Contact>
  <Contact>
    <Name>David Smith</Name>
    <Phone>914-555-3145</Phone>
  </Contact>
  <Contact>
    <Name>Xao Li</Name>
    <Phone>212-555-0998</Phone>
  </Contact>
</Contacts>

This is very straight-forward, very simple record-like data. What can go wrong when processing files like this? Actually, quite a lot, especially when you realize that not all instance documents are likely to be so simple or so valid. In fact, all too often you'll find yourself working with documents like Example 35-3 instead, even if you're promised data that precisely adheres to the DTD in Example 35-1. Invalid data is a fact of life. It's not acceptable to just throw up your hands in disgust because someone sent you a document that's slightly different from what you expected.

Example 35-3: A slightly more realistic contacts document

<?xml version="1.0"?>
<Contacts>
  <Contact>
    <Name>John Doe</Name>
    <Phone>626-555-3456</Phone>
  </Contact>
  <Contact>
    <Name>Joe 
        <!--I need to check whether he prefers Joe or Joseph -->
        Smith
    </Name>
    <Phone>212-555-3456</Phone>
  </Contact>
  <Contact>
    <Name>Jane <![CDATA[& Ben]]> Reilly</Name>
    <Phone>212-555-2341</Phone>
  </Contact>
  <Contact>
    <Name>Mohammed Jones</Name>
    <Phone>718-555-2349</Phone>
    <Phone>914-555-7698</Phone>
  </Contact>
  <Contact>
    <Phone>914-555-3145</Phone>
    <Name>David Smith</Name>
  </Contact>
  <Contact>
    <Name>Jason Smith</Name>
    <Phone></Phone>
  </Contact>
  <Contact>
    <Name>Xao Li</Name>
    <Name>Jonathan Li</Name>
    <Phone>212-555-0998</Phone>
  </Contact>
</Contacts>

Potential problems include:

Runs of text that are divided across multiple text nodes in the API though not in the serialized document
Runs of text that are interrupted by comments or processing instructions:

  <Name>Joe 
        <!--I need to check whether he prefers Joe or Joseph -->
        Smith
  </Name>

Runs of text that are interrupted by child elements:

  <Name>Joseph <Nickname>Bud</Nickname> Smith </Name>

Runs of text that are interrupted by CDATA sections that result in multiple nodes:

<Name>Jane <![CDATA[& Ben]]> Reilly</Name>

Elements that occur two or more times, where normally only one would be expected:

  <Contact>
    <Name>Joe Smith</Name>
    <Phone>718-555-1234</Phone>
    <Phone>914-555-3145</Phone>
  </Contact>
  <Contact>
    <Name>Xao Li</Name>
    <Name>Jonathan Li</Name>
    <Phone>212-555-0998</Phone>
  </Contact>

White space only text nodes that change the number of nodes in a sequence. How many child nodes does the Contact element have? It has three child elements, but most APIs will report six total child nodes because they're counting the white space.
Elements with an extra, unexpected parent:

<Contact>
  <Personal>
    <Name>Joe Smith</Name>
    <Phone>718-555-1234</Phone>
    <Phone>914-555-3145</Phone>
  </Personal>
</Contact>

Nodes that are hiding in an entity reference. For instance, in DOM a node loaded from an external entity reference may be a grandchild of its parent element rather than a child:

<Contact>
  &JoeSmith;
</Contact>

Elements that aren't precisely where you expect them; for instance a Contact element where the Name child is the second child element rather than the first:

<Contact>
    <Phone>914-555-3145</Phone>
    <Name>Joe Smith</Name>
</Contact>

Missing elements such as this Contact that doesn't have a phone number:

<Contact>
    <Name>Joe Smith</Name>
</Contact>

Unexpectedly empty elements:

  <Contact>
    <Name>Jason Smith</Name>
    <Phone></Phone>
  </Contact>

Some of these problems are mitigated by particular APIs. XOM resolves all entity references. DOM requires parsers to place the maximum possible contiguous run of text in each text node. MSXML throws away white space only nodes by default. Nonetheless all APIs suffer from at least several of these problems.

Validation would catch some of these problems, though arguably they're not mistakes. For instance, if people do have multiple phone numbers, but the schema only allows one, then it's the schema that's mistaken, not the document. The instance document more correctly reflects reality than the schema does. The same can be said of a program that assumes every person has exactly one phone number. The software should be changed, not the document.

Is it possible to write programs that behave correctly even in the face of unexpected changes in document structure? Surprisingly the answer is yes, it is. Well-written programs can handle a lot of cases their designers did not explicitly plan for. The trick is to write programs that fit document structures loosely rather than tightly. The program should ask only for the information it needs without tying that to information it doesn't really care about. For example, if you want to find Joe Smith's phone numbers, then you need to ask for the Phone siblings of the Name element whose value is "Joe Smith". You don't want to ask for the Phone sibling of the Name element whose first text node is Joe Smith that's a child of a Contact element which is a child of the Contacts element which is the root element of the document. That's too specific. There are too many extra conditions such as "The root element is Contacts" that don't have anything to do with what you really want to know.

It is not possible, of course, to write perfectly robust software. For instance, there's no way to find Joe Smith's phone number if it's not in the document somewhere. However, you can write software that behaves reliably on a much broader class of documents, and fails gracefully when the information truly isn't there.

XPath is a fourth generation declarative language that allows you to specify which nodes you want to process without specifying exactly how the processor is supposed to navigate to those nodes. XPath's data model is very well designed to support exactly what almost all developers want from XML. For instance, it merges all adjacent text including that in CDATA sections, allows values to be calculated that skip over comments and processing instructions` and include text from child and descendant elements, and requires all external entity references to be resolved. In practice, XPath expressions tend to be much more robust against unexpected but perhaps insignificant changes in the input document.

For example, consider a very simple DOM code fragment that prints all the names in the contacts file.

Element contactsElement = doc.getDocumentElement();
NodeList contacts = contactsElement.getChildNodes();
      
for (int i = 0; i < contacts.getLength(); i++) {
  Node current = contacts.item(i);
  if (current.getNodeType() == Node.ELEMENT_NODE) {
     Element contact = (Element) current;
     NodeList children = contact.getChildNodes();
     for (int j = 0; j < contacts.getLength(); j++) {
       Node candidate = children.item(j);
       if (candidate.getNodeType() == Node.ELEMENT_NODE) {
          // Assuming name is the first child element 
          String name = candidate.getFirstChild().getNodeValue();
          System.out.println(name);
          // Assuming there's only one name per contact
          break;
       }
     }
  }
}

This code is very closely tied to the structure of the document. When actually run on Example 35-3 here's the output:

John Doe
Joe 
        
Jane 
Mohammed Jones
914-555-3145
Jason Smith
Xao Li

You can see that it lost half of Joe Smith's name, as well as missing Ben Reilly completely. It also lost track of Jonathan Li and replaced David Smith with his phone number.

With a little effort, we can improve the DOM code to fix most of the problems. The most important improvement we can make is relying on element names rather than positions. The second most important improvement is using the getElementsByTagName() method to search the entire tree rather than just some expected parts of it. The third most important improvement (and the most difficult to implement correctly) is not assuming all text nodes are adjacent. The final fix is removing the assumption that there's only one name per contact. The improved code follows:

      NodeList names = doc.getElementsByTagName("Name");
      
      for (int i = 0; i < names.getLength(); i++) {
          Node name = names.item(i);
          String value = getFullText(name);
          System.out.println(value);
      }
  
   // ...
  
    public static String getFullText(Node node) {
        StringBuffer result = new StringBuffer(12);
        NodeList children = node.getChildNodes();
        for (int i = 0; i < children.getLength(); i++) {
          Node child = children.item(i);
          int type = child.getNodeType();
          if (type == Node.TEXT_NODE) {
            result.append(child.getNodeValue());
          }
          else if (type == Node.CDATA_SECTION_NODE) {
            result.append(child.getNodeValue());
          }
          else if (type == Node.ELEMENT_NODE) {
            result.append(getFullText(child));
          }
          else if (type == Node.ENTITY_REFERENCE_NODE) {
            result.append(getFullText(child));
          }
        }
        return result.toString();
    }

As you can see, this is somewhat more complex. Note especially the recursion that's necessary to accumulate the complete text of the node. Complexity would increase further in an API that doesn't have an equivalent to getElementsByTagName(), a fairly unique feature of DOM.

Is there perhaps an easier way? Yes, there is; and that way is XPath. Now consider this XPath expression:

/Contacts/Contact/Name

It's about ten times simpler than the DOM version, but accomplishes almost as much. This returns the desired content even if the Name elements contain CDATA sections and child elements, and no matter how many of them appear in any given Contact element. This particular, XPath expression would still fail if the hierarchy were not just what was expected; for instance, if the root element were something other than Contacts or if a Name element were not an immediate child of a Contact element. However, an XPath expression that uses the descendant axis would be even more robust. This XPath expression still succeeds:

//Name

There are numerous ways to integrate XPath processing into your programs. Most XSLT processors such as Xalan, libxslt, and MSXML include APIs you can use to merge XPath expressions with your own code. Unfortunately, there is as yet no standard API for XPath queries. For example, using Xalan, the above query might look like this:

NodeList results = XPathAPI.selectNodeList(doc, "//Name");
for (int i = 0; i < results.getLength(); i++) {
  Node result = results.item(i);
  XObject value = XPathAPI.eval(result, "string()");
  System.out.println(value.str());
}

In MSXML XPath evaluation is built-into the DOM implementation classes, which makes for slightly more concise if less portable code. For example, here's the same XPath search implemented in JavaScript such as you might use from Internet Explorer:

var results = doc.selectNodes("//Name");
for (i=0; i < results.length; i++) {
  var node = results.item(i);
  alert(node.text);
}

Many Java APIs are supported by the Jaxen XPath library (http://jaxen.sourceforge.net/) including DOM, JDOM, and dom4j. Using Jaxen on top of DOM, the above query would look something like this:

XPath expression = new org.jaxen.dom.DOMXPath("//Name");
Navigator navigator = expression.getNavigator();
List results = expression.selectNodes(doc);
Iterator iterator = results.iterator();
while (iterator.hasNext()) {
  Node result = (Node) iterator.next();
  String value = StringFunction.evaluate(result, navigator);
  System.out.println(value);
}

There are a few things you need to be wary of when integrating XPath with your more traditional code. First of all, XPath queries return node-sets, not single nodes. You should be prepared to iterate through the result, even if you only expect a single node. This handles cases both where the expected node is missing and where there's more than you expected. Avoid methods like selectSingleNode() in MSXML and Xalan that hide such results from the library client. They significantly reduce robustness. If you absolutely insist that there can be only one result, then by all means validate the document against a schema which requires this (Item 38) before processing the document, and reject it if it fails the schema. Do not simply assume that all documents passed to the program meet the constraints. Such assumptions fail far more often than most developers initially expect, and over a time period of several years almost all such assumptions fail.

Secondly, be aware that you may take a performance hit for choosing XPath, especially if you use the tree-walking descendant and descendant-or-self axes. The following axis also tends to be slow. Some (but not all) XPath engines have major performance problems with the ancestor and ancestor-or-self axes. You can use the child axis and more explicit descent instead. However, as noted above this does reduce your program's robustness. Like all other performance issues, this should be evaluated within the context of a particular program, system, library, and use-case. Some systems have CPU power to spare. Others may be so CPU or memory-limited that the cost of even the most basic XPath processing is prohibitive. Sometimes switching to a different XPath engine can produce an order of magnitude improvement in processing particular XPath expressions. I recommend designing processing for maximum robustness, and only optimizing the XPath expressions when and if profiling proves this is a significant bottleneck.

Personally, I'm willing to trade performance for robustness, and I find the opposite opinion (a willingness to trade correctness for performance) completely incomprehensible. XPath based navigation is both easier to write and easier to write correctly than explicit navigation. It produces much more robust software. If you'd rather process more documents faster, at the cost of processing some of them incorrectly, you can do so; but please at least test the performance of an XPath based alternative before you rule it out.