Handling Attributes

Attributes are not reported through separate callbacks. Instead an Attributes object containing all the attributes of an element is passed to the startElement() method for the start-tag or empty-element tag of the element that possesses the attributes. Example 6.8 summarizes the Attributes interface.

Example 6.8. The SAX Attributes interface

package org.xml.sax;


public interface Attributes {

  public int    getLength ();
  
  public String getQName(int index);
  public String getURI(int index);
  public String getLocalName(int index);
  public int    getIndex(String uri, String localPart);
  public int    getIndex(String qualifiedName);
  public String getType(String uri, String localName);
  public String getType(String qualifiedName);
  public String getType(int index);
  public String getValue(String uri, String localName);
  public String getValue(String qualifiedName);
  public String getValue(int index);

}

If you know the qualified name or namespace URI and local name of the attribute you want, Attributes can look up its value and type. If you don’t know the names of the attributes at compile-time, you can iterate through all the attributes of an element instead. Attributes are unordered. However, for programmer convenience the Attributes interface is designed as a list. You can ask for the value, local name, qualified name, type, and namespace URI of an attribute by giving its index into the list. Just don’t assume that the order of the attributes in this list is necessarily the same order they had in the original document. More often than not, it isn’t.

The type of the attribute is reported as one of these nine constant strings, exactly as types would be indicated in an ATTLIST declaration in a DTD:

Enumerated types are reported as having type NMTOKEN. Undeclared attributes are reported as having type CDATA. SAX does not yet support schema types such as int or gYear. Maybe in SAX 3.0.

Caution

A few parsers are not 100% compliant with the SAX specification here. In particular, Crimson and Xerces 2.0.x use the string ENUMERATION for enumerated types instead of NMTOKEN. Xerces 1.4.X reports an enumerated type as a string containing the actual enumeration, for example, ( yes | no | maybe).

If a declared attribute has any type other than CDATA, then the parser normalizes its value. This means that all tabs, carriage returns, and line feeds are converted to a single space, runs of spaces are converted to a single space, and leading and trailing white space is stripped. Only normalized values are reported by the getValue() methods. However, in order to determine an attribute type, the parser must read the DTD. If an attribute is declared in the external DTD subset, then non-validating parsers that do not read the external subset will assume the attribute has type CDATA, and fail to normalize.

If you ask an Attributes object for information (type, name, value, etc.) about an attribute that is not in that particular list, then all the methods that normally return a String return null instead. The getIndex() methods return -1. None of these methods throw any exceptions. However, if you try to use the return values without checking for null or -1 first, you’re asking for a NullPointerException or an ArrayIndexOutOfBoundsException. SAX 2.0 does not distinguish between attributes that were present in the instance document and those that were defaulted in from the DTD or schema. This may be added in SAX 2.1.

For an example, I’m going to develop a web spider that follows simple XLinks. XLink is an attribute-based syntax for embedding hypertext in arbitrary XML documents. Elements are identified as XLinks by an xlink:type attribute with the value simple. (There’s also a more powerful and more complex extended XLink which I’m going to ignore for the purposes of this example.) The URL the link points to is contained in an xlink:href attribute. The xlink prefix is mapped to the namespace URI http://www.w3.org/TR/1999/xlink. As always the prefix can change as long as the URI stays the same. For example, this is an XLink that points to The Nation’s home page:

<magazine xmlns:xlink="http://www.w3.org/TR/1999/xlink" 
 xlink:type="simple" xlink:href="http://www.thenation.com/">
  The Nation
</magazine>

Note especially that the element name and content are irrelevant to the link, which is encoded purely in attributes. The link could be written like this and still indicate the same link:

<foo xmlns:xlink="http://www.w3.org/TR/1999/xlink" 
 xlink:type="simple" xlink:href="http://www.thenation.com/">
  Foo
</foo>

All the information required to process the link is included in the attributes. Consequently, we can use the Attributes interface and the startElement() method to design a spider that follows XLinks. Example 6.9 is such a program. Currently this spider does nothing more than follow the links and print their URLs. However, it would not be hard to add code to load the discovered documents into a database or perform some other useful operation.

Example 6.9. A ContentHandler class that spiders XLinks

import org.xml.sax.*;
import org.xml.sax.helpers.*;
import java.util.*;


public class SAXSpider extends DefaultHandler {

  // Need to keep track of where we've been 
  // so we don't get stuck in an infinite loop
  private List spideredURLs = new Vector();

  // This linked list keeps track of where we're going.
  // Although the LinkedList class does not guarantee queue like
  // access, I always access it in a first-in/first-out fashion.
  private LinkedList queue = new LinkedList();
  
  private String    currentURL;
  private XMLReader parser;
  
  public SAXSpider(XMLReader parser, String url) {
    this.parser = parser;
    this.currentURL = url;
  }
  
  public void endDocument() { 
    
    spideredURLs.add(currentURL);
    System.out.println("Visited " + currentURL);
    String url;
    try {
      url = (String) queue.removeLast();
    }
    catch (NoSuchElementException e) {
      // The queue is empty; we're finished.
      return;
    }
    this.currentURL = url;
    try {
      parser.parse(url);
    }
    catch (Exception ex) { 
      // just skip this one and move on to the next
      this.endDocument();
    }
    
  }

  public void startElement(String namespaceURI, String localName, 
   String qualifiedName, Attributes atts) {
    
    String type 
     = atts.getValue("http://www.w3.org/1999/xlink", "type");
    if (type != null) {
      String href 
       = atts.getValue("http://www.w3.org/1999/xlink", "href");
      if (href != null) {
        if (!spideredURLs.contains(href)) {
          queue.addFirst(href);
        }
      }
    }
    
  }
  

  public static void main(String[] args) {
    
    if (args.length == 0) {
      System.out.println("Usage: java SAXSpider URL"); 
    } 
    String url = args[0];
    
    try {
      XMLReader parser = XMLReaderFactory.createXMLReader(
        "org.apache.xerces.parsers.SAXParser"
      );
      
    // Install the ContentHandler   
    ContentHandler spider = new SAXSpider(parser, url);   
    parser.setContentHandler(spider);
    parser.parse(url);

    }
    catch (Exception e) {
      System.err.println(e);
    }
        
  } // end main

} // end SAXSpider

The startElement() method simply inspects the tag for the two relevant XLink attributes. It looks for them by namespace and local name. If it finds any and it hasn’t yet visited that URL, then it adds that URL to the end of the queue of URLs that need to be visited.

The endDocument() method prints out the URL of the document it’s just finished parsing. Then it retrieves the next URL from the top of the queue and parses it. This program is a little unusual in that not only does the XMLReader call back to the ContentHandler, but the ContentHandler calls back to its XMLReader.

The main() method reads the starting URL from the command line, constructs an XMLReader and a SAXSpider, and parses the initial URL. The program runs automatically from there. There’s no limit to the depth or number of documents this spider will search, though currently the paucity of XLinked documents on the Web makes it unlikely that this program will run forever. Furthermore, since it isn’t designed to run in parallel, there’s little chance of it overwhelming anybody’s server. Nonetheless, limiting its search depth would be a good feature to add.


Copyright 2001, 2002 Elliotte Rusty Haroldelharo@metalab.unc.eduLast Modified October 16, 2001
Up To Cafe con Leche