Chapter 12. The DOM Traversal Module

Chapter 12. The DOM Traversal Module
Prev		Next

Table of Contents

Constructing NodeIterators with DocumentTraversal
Liveness
Filtering by Node Type

The examples in the last few chapters have duplicated quite a bit of tree walking code. Some of them searched for particular information. Others modified documents in memory. What they all had in common was that they navigated a tree from the root to the deepest leaf element in document order. This is an extremely common pattern in DOM programs.

The org.w3c.dom.traversal package is a collection of utility interfaces that implement most of the logic needed to traverse a DOM tree. These include NodeIterator, NodeFilter, TreeWalker and DocumentTraversal. DOM implementations are not required to support these interfaces, but many do including Oracle and Xerces. (Crimson does not. GNU JAXP supports NodeIterator but not TreeWalker.) By reusing these classes you can simplify your programs a great deal and save yourself a significant amount of work.

NodeIterator

The NodeIterator utility interface extracts a subset of the nodes in a DOM document and presents them as a list arranged in document order. In other words, the nodes appear in the order you’d find them in a depth-first, pre-order traversal of the tree. That is,

The document node comes first.
Parents come before their children. Ancestors come before their descendants.
Sibling nodes appear in the same order as their start-tags appear in the text representation of the document.

In other words, this is pretty much the order you’d expect by just reading an XML document from beginning to end. As soon as you see the first character of text from a node, that node is counted.

You can iterate through this list without concerning yourself with the tree structure of the XML document. For many operations this flatter view is more convenient than the hierarchical tree view. For example, a spell checker can check all text nodes one at a time. An outline program can extract the headings in an XHTML document while ignoring everything else. You can do all this by iterating though a list without having to write recursive methods.

Example 12.1 summarizes the NodeIterator interface. The first four getter methods simply tell you how the iterator is choosing from all the available nodes in the document. The nextNode() and previousNode() methods move forwards and backwards in the list and return the requested node. Finally, the detach() method cleans up after the iterator when you’re done with it. It’s analogous to closing a stream.

Example 12.1. The NodeIterator interface

package org.w3c.dom.traversal;

public interface NodeIterator {
  
    public Node       getRoot();
    public int        getWhatToShow();
    public NodeFilter getFilter();
    public boolean    getExpandEntityReferences();
    
    public Node       nextNode() throws DOMException;
    public Node       previousNode() throws DOMException;
    
    public void       detach();

}

As you see, the NodeIterator interface provides only the most basic methods for an iterator. Each iterator can be thought of as having a cursor which is initially positioned before the first node in the list. The nextNode() method returns the node immediately following the cursor and advances the cursor one space. The previousNode() method returns the node immediately before the cursor and backs the cursor up one space. If the iterator is positioned at the end of the list, nextNode() returns null. If the iterator is positioned at the beginning of the list, previousNode() returns null. For example, given a NodeIterator variable named iterator positioned at the beginning of its list, this code fragment prints the names of all the nodes:

Node node;
while ((node = iterator.nextNode()) != null) {
  System.out.println(node.getNodeName());
}

Note

Design pattern aficionados will have recognized this as instance of the iterator pattern (as if the name didn’t already give it away). More precisely, it’s a robust, external iterator. Robust means that the iterator still works even if its backing data structure (the Document object) changes underneath it. External means that client code is responsible for moving the iterator from one node to the next, rather than having the iterator move itself.

Constructing NodeIterators with DocumentTraversal

Not all DOM implementations are guaranteed to support the traversal module, though most do. You can check this with hasFeature("traversal", "2.0") in the DOMImplementation class. For example,

if (!impl.hasFeature("traversal", "2.0")) {
  System.err.println( 
   "A DOM implementation that supports traversal is required.");  
  return;
}

Assuming the implementation does support traversal, the Document implementation class also implements the DocumentTraversal interface. This factory interface, shown in Example 12.2, allows you to create new NodeIterator and TreeWalker objects which traverse the nodes in that document.

Example 12.2. The DocumentTraversal factory interface

package org.w3c.dom.traversal;

public interface DocumentTraversal {
  
  public NodeIterator createNodeIterator(Node root, 
   int whatToShow, NodeFilter filter, 
   boolean entityReferenceExpansion) throws DOMException;

  public TreeWalker createTreeWalker(Node root,
   int whatToShow, NodeFilter filter,
   boolean entityReferenceExpansion) throws DOMException;

}

Thus, to create a NodeIterator you cast the Document object you want to iterate over to DocumentTraversal, and then invoke its createNodeIterator() method. This method takes four arguments:

root

The Node in the document which the iterator starts from. Only this node and its descendants are traversed by the iterator. This means you can easily design iterators that iterate over a subtree of the entire document. For instance, by passing in the root element, you could skip everything in the document’s prolog and epilog.

whatToShow

An int bitfield constant specifying the node types the iterator will include. These constants are:

NodeFilter.SHOW_ELEMENT = 1
NodeFilter.SHOW_ATTRIBUTE = 2
NodeFilter.SHOW_TEXT = 4
NodeFilter.SHOW_CDATA_SECTION = 8
NodeFilter.SHOW_ENTITY_REFERENCE = 16
NodeFilter.SHOW_ENTITY = 32
NodeFilter.SHOW_PROCESSING_INSTRUCTION = 64
NodeFilter.SHOW_DOCUMENT = 128
NodeFilter.SHOW_DOCUMENT_TYPE = 256
NodeFilter.SHOW_DOCUMENT_FRAGMENT = 512
NodeFilter.SHOW_NOTATION = 1024
NodeFilter.SHOW_ALL = 0xFFFFFFFF

These are all powers of two so that they can be combined with the bitwise operators. For example, if you wanted to iterate over text nodes and CDATA section nodes, you would pass NodeFilter.SHOW_TEXT | NodeFilter.SHOW_CDATA_SECTION as whatToShow.

filter

A NodeFilter against which all nodes in the subtree will be compared. Only nodes that pass the filter will be let through. By implementing this interface, you can define more specific filters such as “all elements that have xlink:type="simple" attributes” or “all text nodes that contain the word fnord”. You can pass null to indicate no custom filtering.

entityReferenceExpansion

Pass true if you want the iterator to descend through the children of entity reference nodes, false otherwise. Generally, this should be set to true.

For example, the last chapter demonstrated a comment reader program that recursively descended an XML tree, printing out all the comment nodes that were found. A NodeIterator makes it possible to write the program non-recursively. When creating the iterator, the root argument is the document node, whatToShow is NodeFilter.SHOW_COMMENT, the node filter is null, and entityReferenceExpansion is true. Example 12.3 demonstrates.

Example 12.3. Using a NodeIterator to extract all the comments from a document

import javax.xml.parsers.*;
import org.w3c.dom.*;
import org.w3c.dom.traversal.*;
import org.xml.sax.SAXException;
import java.io.IOException;


public class CommentIterator {

  public static void main(String[] args) {

    if (args.length <= 0) {
      System.out.println("Usage: java DOMCommentReader URL");
      return;
    }
    
    String url = args[0];
    
    try {
      DocumentBuilderFactory factory 
       = DocumentBuilderFactory.newInstance();
      DocumentBuilder parser = factory.newDocumentBuilder();
      
      // Check for the traversal module
      DOMImplementation impl = parser.getDOMImplementation();
      if (!impl.hasFeature("traversal", "2.0")) {
        System.out.println(
         "A DOM implementation that supports traversal is required."
        );  
        return;
      }
      
      // Read the document
      Document doc = parser.parse(url); 
      
      // Create the NodeIterator
      DocumentTraversal traversable = (DocumentTraversal) doc;
      NodeIterator iterator = traversable.createNodeIterator(
       doc, NodeFilter.SHOW_COMMENT, null, true);

      // Iterate over the comments
      Node node;
      while ((node = iterator.nextNode()) != null) {
        System.out.println(node.getNodeValue());
      }
       
    }
    catch (SAXException e) {
      System.out.println(e);
      System.out.println(url + " is not well-formed.");
    }
    catch (IOException e) { 
      System.out.println(
       "Due to an IOException, the parser could not check " + url
      ); 
    }
    catch (FactoryConfigurationError e) { 
      System.out.println("Could not locate a factory class"); 
    }
    catch (ParserConfigurationException e) { 
      System.out.println("Could not locate a JAXP parser"); 
    }
     
  } // end main  
  
}

You can decide for yourself whether or not you prefer the explicit recursion and tree-walking of Example 11.20 in the last chapter or the hidden recursion of CommentIterator here. With a decent implementation, there really shouldn’t be any noticeable performance penalty, so feel free to use whichever feels more natural to you.

Liveness

Node iterators are live. That is, if the document changes while the program is walking the tree, the iterator retains its state. For instance, let’s suppose the program is at node C of a node iterator that’s walking through nodes A, B, C, D, and E in that order. If you delete node D, and then call nextNode(), you’ll get node E. If you add node Z in between B and C and then call previousNode(), you’ll get node Z. The iterator’s current position is always between two nodes (or before the first node or after the last node) but never on a node. Thus it is not invalidated by deleting the current node.

For example, this method deletes all the comments in its Document argument. When the method returns all the comments have been removed.

  public static void deleteComments(Document doc) {
  
    // Create the NodeIterator
    DocumentTraversal traversable = (DocumentTraversal) doc;
    NodeIterator iterator = traversable.createNodeIterator(
     doc, NodeFilter.SHOW_COMMENT, null, true);

    // Iterate over the comments
    Node comment;
    while ((comment = iterator.nextNode()) != null) {
      Node parent = comment.getParentNode();
      parent.removeChild(comment);
    }
    
  }

This method changes the original Document object. It does not change the XML file from which the Document object was created unless you specifically write the changed document back out into the original file after the comments have been deleted.

Filtering by Node Type

You can combine the various flags for whatToShow with the bitwise or operator. For example, the previous chapter used a rather convoluted recursive getText() method in the ExampleExtractor program to accumulate all the text from both text and CDATA section nodes within an element. Example 12.4 shows how NodeIterator can be used to accomplish this task in a much more straightforward fashion.

Example 12.4. Using a NodeIterator to retrieve the complete text content of an element

import org.w3c.dom.*;
import org.w3c.dom.traversal.*;


public class TextExtractor {

  public static String getText(Node node) {
    
    if (node == null) return "";
    
    // Set up the iterator
    Document doc = node.getOwnerDocument();
    DocumentTraversal traversable = (DocumentTraversal) doc;
    int whatToShow 
     = NodeFilter.SHOW_TEXT | NodeFilter.SHOW_CDATA_SECTION;
    NodeIterator iterator = traversable.createNodeIterator(node, 
     whatToShow, null, true);
     
    // Extract the text
    StringBuffer result = new StringBuffer();
    Node current;
    while ((current = iterator.nextNode()) != null) {
      result.append(current.getNodeValue());
    }
    return result.toString();
    
  }

}

I’ll be reusing this class a little later on. Something like this should definitely be in your toolbox for whenever you need to extract the text content of an element.

Note

DOM Level 3 is going to add an almost equivalent getTextContent() method to the Node interface:

public String getTextContent()
    throws DOMException;

The only difference is that this method will not operate on Document objects whereas TextExtractor.getText() will.

Copyright 2001, 2002 Elliotte Rusty Harold	elharo@metalab.unc.edu	Last Modified January 19, 2002
	Up To Cafe con Leche