NodeFilter

NodeFilter
Prev	Chapter 12. The DOM Traversal Module	Next

The whatToShow argument allows you to iterate over only certain node types in a subtree. However, suppose you want to go beyond that. For example, you may have a program that reads XHTML documents and extracts all heading elements but ignores everything else. Or perhaps, you want to find all SVG content in a document, or all the GIFT elements whose price attribute has a value greater than $10.00. Or perhaps you want to find those SKU elements containing an ID of a product that needs to be reordered, as determined by consulting an external database. All of these tasks and many more besides can be implemented through node filters on top of a NodeIterator or a TreeWalker.

Example 12.5 summarizes the NodeFilter interface. You implement this interface in a class of your own devising. The acceptNode() method contains the the custom logic that decides whether any given node passes the filter or not. This method can return one of the three named constants NodeFilter.FILTER_ACCEPT, NodeFilter.FILTER_REJECT, or NodeFilter.FILTER_SKIP to indicate what it wants to do with that node.

Example 12.5. The NodeFilter interface

package org.w3c.dom.traversal;

public interface NodeFilter {
  
  // Constants returned by acceptNode
  public static final short FILTER_ACCEPT = 1;
  public static final short FILTER_REJECT = 2;
  public static final short FILTER_SKIP   = 3;

  // Constants for whatToShow
  public static final int SHOW_ALL               = 0xFFFFFFFF;
  public static final int SHOW_ELEMENT           = 0x00000001;
  public static final int SHOW_ATTRIBUTE         = 0x00000002;
  public static final int SHOW_TEXT              = 0x00000004;
  public static final int SHOW_CDATA_SECTION     = 0x00000008;
  public static final int SHOW_ENTITY_REFERENCE  = 0x00000010;
  public static final int SHOW_ENTITY            = 0x00000020;
  public static final int SHOW_PROCESSING_INSTRUCTION 
   = 0x00000040;
  public static final int SHOW_COMMENT           = 0x00000080;
  public static final int SHOW_DOCUMENT          = 0x00000100;
  public static final int SHOW_DOCUMENT_TYPE     = 0x00000200;
  public static final int SHOW_DOCUMENT_FRAGMENT = 0x00000400;
  public static final int SHOW_NOTATION          = 0x00000800;

  public short acceptNode(Node n);

}

For iterators, there are really only two options for the return value of acceptNode(), FILTER_ACCEPT and FILTER_SKIP. NodeIterator treats FILTER_REJECT the same as FILTER_SKIP. (Tree walkers do make a distinction between these two.) Rejecting a node prevents it from appearing in the list, but does not prevent its children and descendants from appearing. They will be tested separately.

The NodeFilter does not override whatToShow. They work in concert. For example, whatToShow can limit the iterator to only elements. Then the acceptNode() method can confidently cast every node that’s passed to it to Element without first checking its node type.

To configure an iterator with a filter, pass the NodeFilter object to the createNodeIterator() method. The NodeIterator then passes each potential candidate node to the acceptNode() method to decide whether or not to include it in the iterator.

For an example, let’s revisit last chapter’s DOMSpider program. That program needed to recurse through the entire document, looking at each and every node to see whether or not it was an element and, if it was, whether or not it had an xlink:type attribute with the value simple. We can write that program much more simply using a NodeFilter to find the simple XLinks and a NodeIterator to walk through them. Example 12.6 demonstrates the necessary filter.

Example 12.6. An implementation of the NodeFilter interface

import org.w3c.dom.traversal.NodeFilter;
import org.w3c.dom.*;


public class XLinkFilter implements NodeFilter {

  public static String XLINK_NAMESPACE 
   = "http://www.w3.org/1999/xlink";
  
  public short acceptNode(Node node) {
     
    Element candidate = (Element) node;
    String type 
     = candidate.getAttributeNS(XLINK_NAMESPACE, "type");
    if (type.equals("simple")) return FILTER_ACCEPT;
    return FILTER_SKIP;
     
  }

}

Here’s a spider() method that has been revised to take advantage of NodeIterator and this filter. This can replace both the spider() and findLinks() methods of the previous version. The filter replaces the isSimpleLink() method. The code is quite a bit simpler than the version in the last chapter.

  public void spider(String systemID) {
    currentDepth++;
    try {
      if (currentDepth < maxDepth) {
        Document document = parser.parse(systemID);
        process(document, systemID);
        
        Vector uris = new Vector();
        // search the document for uris, 
        // store them in vector, and print them
        DocumentTraversal traversal 
         = (DocumentTraversal) document;
        NodeIterator xlinks = traversal.createNodeIterator(
          document.getDocumentElement(),// start at root element
          NodeFilter.SHOW_ELEMENT,      // only see elements
          new XLinkFilter(),            // only see simple XLinks
          true                          // expand entities
        );
        
        Element xlink;
        while ((xlink = (Element) xlinks.nextNode()) != null) {
          String uri = xlink.getAttributeNS(XLINK_NAMESPACE, 
           "href");
          if (!uri.equals("")) {
            try {
              String wholePage = absolutize(systemID, uri);
              if (!visited.contains(wholePage) 
               && !uris.contains(wholePage)) {
                uris.add(wholePage);
              }        
            }
            catch (MalformedURLException e) {
              // If it's not a good URL, then we can't spider it 
              // anyway, so just drop it on the floor.
            }
          } // end if
        } // end while
        xlinks.detach();
    
        Enumeration e = uris.elements();
        while (e.hasMoreElements()) {
          String uri = (String) e.nextElement();
          visited.add(uri);
          spider(uri); 
        }
      
      }
    
    }
    catch (SAXException e) {
      // Couldn't load the document, 
      // probably not well-formed XML, skip it 
    }
    catch (IOException e) {
      // Couldn't load the document, 
      // likely network failure, skip it 
    }
    finally { 
      currentDepth--;
      System.out.flush();     
    }
      
  }

There is, however, one feature the earlier version had that this NodeIterator based variant doesn’t have. Last chapter’s DOMSpider tracked xml:base attributes. Since the xml:base attributes may appear on ancestors of the XLinks rather than on the XLinks themselves, a NodeIterator really isn’t appropriate for tracking them. The key problem is that xml:base has hierarchical scope. That is, an xml:base attribute only applies to the element on which it appears and its descendants. While the filter could easily be adjusted to notice elements that have xml:base attributes as well as those that have xlink:type="simple" attributes, an iterator really can’t tell which other elements any given xml:base attribute applies to.

DOM Level 3 will add a getBaseURI() method to the Node interface that will alleviate the need to track xml:base attributes manually. In fact, this will be even more effective than the manual tracking of last chapter’s example, because it will also notice different base URIs that arise from external entities. Revising the spider() method to take advantage of this only requires changing a couple of lines of code as follows:

String wholePage = absolutize(xlink.getBaseURI(), uri);

Unfortunately, this method is not yet supported by any common parsers. However, it should be implemented in the not too distant future.

Copyright 2001, 2002 Elliotte Rusty Harold	elharo@metalab.unc.edu	Last Modified January 20, 2002
	Up To Cafe con Leche