The NamedNodeMap Interface

If for some reason, you want all the attributes of an element or you don’t know their names, you can use the getAttributes() method to retrieve a NamedNodeMap inherited from the Node. (Why getAttributes() is in Node instead of Element I have no idea. Elements are the only kind of node that can have attributes. For all other types of node, getAttributes() returns null.) The NamedNodeMap interface, summarized in Example 11.5, has methods to get and set the various named nodes as well to iterate through the nodes like a list. Here it’s used for attributes, but soon you’ll see it used for notations and entities as well.

Example 11.5. The NamedNodeMap interface

package org.w3c.dom;

public interface NamedNodeMap {

  // for iterating through the map as a list
  public Node item(int index);
  public int  getLength();

  // For working with particular items in the list
  public Node getNamedItem(String name);
  public Node setNamedItem(Node arg) throws DOMException;
  public Node removeNamedItem(String name)
   throws DOMException;
  public Node getNamedItemNS(String namespaceURI, 
   String localName);
  public Node setNamedItemNS(Node arg) throws DOMException;
  public Node removeNamedItemNS(String namespaceURI, 
   String localName) throws DOMException;

}

I’ll demonstrate with an XLink spider program like the one you saw in Chapter 6. However, this time I’ll implement the program on top of DOM rather than SAX. You can judge for yourself which one is more natural.

Recall that XLink is an attribute based syntax for denoting connections between documents. The element that is the link has an xlink:type attribute with the value simple and an xlink:href attribute whose value is the URL of the remote document. For example, this book element points to this book’s home page:

<book xlink:type="simple" 
      xlink:href="http://www.cafeconleche.org/books/xmljava/"
      xmlns:xlink="http://www.w3.org/1999/xlink">
  Processing XML with Java
</book>

The customary prefix xlink is bound to the namespace URI http://www.w3.org/1999/xlink. Most of the time you should depend on the specific URI and not the prefix, which may change.

Relative URLs are relative to the nearest ancestor xml:base attribute if one is present or the location of the document otherwise. For example, the book element in this library element also points to http://www.cafeconleche.org/books/xmljava/.

<library xml:base="http://www.cafeconleche.org/"
         xmlns:xlink="http://www.w3.org/1999/xlink">
  <book xlink:type="simple" xlink:href="books/xmljava/">
    Processing XML with Java
  </book>
</library>

The prefix xml is bound to the namespace URI http://www.w3.org/XML/1998/namespace. This is a special case, however. The xml prefix cannot be changed, and does not need to be declared.

Attributes provide all the information needed to process the link. Consequently, a spider can follow XLinks without knowing any details about the rest of the markup in the document. Example 11.6 is such a program. Currently this spider does nothing more than follow the links and print their URLs. However, it would not be hard to add code to load the discovered documents into a database or perform some other useful operation. You’d just subclass DOMSpider while overriding the process() method.

Example 11.6. An XLink spider that uses DOM

import org.xml.sax.SAXException;
import javax.xml.parsers.*;
import java.io.*;
import java.util.*;
import java.net.*;
import org.w3c.dom.*;


public class DOMSpider {

  public static String XLINK_NAMESPACE 
   = "http://www.w3.org/1999/xlink";
  
  // This will be used to read all the documents. We could use
  // multiple parsers in parallel. However, it's a lot easier
  // to work in a single thread, and doing so puts some real 
  // limits on how much bandwidth this program will eat.
  private DocumentBuilder parser; 
  
  // Builds the parser
  public DOMSpider() throws ParserConfigurationException {
  
    try {
      DocumentBuilderFactory factory 
       = DocumentBuilderFactory.newInstance();
      factory.setNamespaceAware(true);
      parser = factory.newDocumentBuilder();
    }
    catch (FactoryConfigurationError e) { 
      // I don't absolutely need to catch this, but I hate to 
      // throw an Error for no good reason.
      throw new ParserConfigurationException(
       "Could not locate a factory class"); 
    }

  }
  
  // store the URLs already visited
  private Vector visited = new Vector();
  
  // Limit the amount of bandwidth this program uses
  private int maxDepth = 5;
  private int currentDepth = 0; 
  
  public void spider(String systemID) {
    currentDepth++;
    try {
      if (currentDepth < maxDepth) {
        Document document = parser.parse(systemID);
        process(document, systemID);
        
        Vector toBeVisited = new Vector();
        // search the document for uris, 
        // store them in vector, and print them
        findLinks(document.getDocumentElement(), 
         toBeVisited, systemID);
    
        Enumeration e = toBeVisited.elements();
        while (e.hasMoreElements()) {
          String uri = (String) e.nextElement();
          visited.add(uri);
          spider(uri); 
        }
      
      }
    
    }
    catch (SAXException e) {
      // Couldn't load the document, 
      // probably not well-formed XML, skip it 
    }
    catch (IOException e) {
      // Couldn't load the document, 
      // likely network failure, skip it 
    }
    finally { 
      currentDepth--;
      System.out.flush();     
    }
      
  }
  
  public void process(Document document, String uri) {
    System.out.println(uri);
  }
  
  // Recursively descend the tree of one document
  private void findLinks(Element element, List uris, 
   String base) {
    
    // Check for an xml:base attribute
    String baseAtt = element.getAttribute("xml:base");
    if (!baseAtt.equals(""))  base = baseAtt;
    
    // look for XLinks in this element
    if (isSimpleLink(element)) {
      String uri 
       = element.getAttributeNS(XLINK_NAMESPACE, "href");
      if (!uri.equals("")) {
        try {
          String wholePage = absolutize(base, uri);
          if (!visited.contains(wholePage) 
           && !uris.contains(wholePage)) {
            uris.add(wholePage);
          }        
        }
        catch (MalformedURLException e) {
          // If it's not a good URL, then we can't spider it 
          // anyway, so just drop it on the floor.
        }
      } // end if 
    } // end if 
    
    // process child elements recursively
    NodeList children = element.getChildNodes();
    for (int i = 0; i < children.getLength(); i++) {
      Node node = children.item(i);
      int type = node.getNodeType();
      if (type == Node.ELEMENT_NODE) {
        findLinks((Element) node, uris, base);
      } 
    } // end for
    
  }

  // If you're willing to require Java 1.4, you can do better 
  // than this with the new java.net.URI class
  private static String absolutize(String context, String uri) 
   throws MalformedURLException {
  
    URL contextURL = new URL(context);
    URL url = new URL(contextURL, uri);
    // Remove fragment identifier if any
    String wholePage = url.toExternalForm();
    int fragmentSeparator = wholePage.indexOf('#');
    if (fragmentSeparator != -1) { 
      // There is a fragment identifier
      wholePage = wholePage.substring(0, fragmentSeparator);
    }  
    return wholePage;
    
  }
  
  private static boolean isSimpleLink(Element element) {
  
    String type 
     = element.getAttributeNS(XLINK_NAMESPACE, "type");
    if (type.equals("simple")) return true;
    return false;
    
  }
  
  public static void main(String[] args) {
    
    if (args.length == 0) {
      System.out.println("Usage: java DOMSpider topURL"); 
      return;
    } 
    
    // start parsing... 
    try {
      DOMSpider spider = new DOMSpider();
      spider.spider(args[0]);
    }
    catch (Exception e) {
      System.err.println(e);
      e.printStackTrace(); 
    }
  
  } // end main

} // end DOMSpider

There are two levels of recursion here. The spider() method recursively spiders documents. The findLinks() method recursively searches through the elements in a document looking for XLinks. It adds the URLs found in these links to a list of unvisited pages. After finishing each of these documents, the next document is retrieved from the list and processed in turn. If it’s an XML document, then it is parsed and passed to the process() method. Non-XML documents found at the end of XLinks are ignored.

I tested this program by pointing it at the Resource Directory Description Language specification, which is one of the few real-world documents I know of that uses XLinks. I was surprised to find out just how much XLinked XML there is out there in the world, though as of yet most of it is just more XML specifications. This must be what the Web felt like circa 1991. Here’s a sample of the more interesting output:

D:\books\XMLJAVA>java DOMSpider http://www.rddl.org/
http://www.rddl.org/
http://www.rddl.org/purposes
http://www.rddl.org/purposes/software
http://www.rddl.org/rddl.rdfs
http://www.rddl.org/rddl-integration.rxg
http://www.rddl.org/modules/rddl-1.rxm
…
http://www.w3.org/2001/XMLSchema
http://www.w3.org/2001/XMLSchema.xsd
http://www.examplotron.org
http://www.examplotron.org/compile.xsl
http://www.examplotron.org/examplotron.xsd
http://www.examplotron.org/0/1/
http://www.examplotron.org/0/2/
http://www.examplotron.org/0/3/
http://webns.net/rdfs/
http://www.w3.org/2000/01/rdf-schema
http://webns.net/rdfs/?format=rdf
http://webns.net/foaf/
http://xmlns.com/foaf/0.1/
http://webns.net/foaf/?format=rdf
http://webns.net/dc/
http://purl.org/dc/elements/1.1/
http://webns.net/dc/?format=rdf
http://openhealth.org/XSet
http://xsltunit.org/0/1/
http://xsltunit.org/0/1/xsltunit.xsl
http://xsltunit.org/0/1/tst_library.xsl
http://xsltunit.org/0/1/library.xml
http://xsltunit.org/0/1/library.xsl
http://venetica.com/venicebridgecontent/
http://www.venetica.com/VeniceBridgeContent
http://www.venetica.com/VeniceBridgeContent/VeniceBridgeContent40.xsd
http://www.venetica.com/VeniceBridgeContent/VeniceBridgeContent.biz
http://www.venetica.com/VeniceBridgeContent/rddl30.html
http://www.w3.org/TR/xhtml-basic
http://www.w3.org/TR/xml-infoset/
http://www.w3.org/TR/xhtml-modularization/

Copyright 2001, 2002 Elliotte Rusty Haroldelharo@metalab.unc.eduLast Modified February 26, 2002
Up To Cafe con Leche