SAX

SAX
Prev	Chapter 5. Reading XML	Next

SAX, the Simple API for XML, was the first standard API shared across different XML parsers. SAX is unique among XML APIs in that it models the parser rather than the document. In particular the parser is represented as an instance of the XMLReader interface. The specific class that implements this interface varies from parser to parser. Most of the time you only access it through the common methods of the XMLReader interface.

A parser reads a document from beginning to end. As it does so it encounters start-tags, end-tags, text, comments, processing instructions, and more. In SAX, the parser tells the client application what it sees as it sees it by invoking methods in a ContentHandler object. ContentHandler is an interface the client application implements to receive notification of document content. The client application will instantiate a client-specific instance of the ContentHandler interface and register it with the XMLReader that’s going to parse the document. As the reader reads the document, it calls back to the methods in the registered ContentHandler object. The general pattern is very similar to how events are handled in the AWT and Swing.

Example 5.3 is a simple SAX program that communicates with the XML-RPC service introduced Chapter 3. It sends the request document using basic output stream techniques and then receives the response through SAX.

Example 5.3. A SAX based client for the Fibonacci XML-RPC server

import java.net.*;
import java.io.*;
import org.xml.sax.*;
import org.xml.sax.helpers.*;


public class FibonacciSAXClient {

  public final static String DEFAULT_SERVER 
   = "http://www.elharo.com/fibonacci/XML-RPC";  
  
  public static void main(String[] args) {
      
    if (args.length <= 0) {
      System.out.println(
       "Usage: java FibonacciSAXClient number url"
      );
      return;
    }
    
    String server = DEFAULT_SERVER;
    if (args.length >= 2) server = args[1];
      
    try {
      // Connect to the server
      URL u = new URL(server);
      URLConnection uc = u.openConnection();
      HttpURLConnection connection = (HttpURLConnection) uc;
      connection.setDoOutput(true);
      connection.setDoInput(true); 
      connection.setRequestMethod("POST");
      OutputStream out = connection.getOutputStream();
      Writer wout = new OutputStreamWriter(out);
      
      // Transmit the request XML document
      wout.write("<?xml version=\"1.0\"?>\r\n");  
      wout.write("<methodCall>\r\n"); 
      wout.write(
       "  <methodName>calculateFibonacci</methodName>\r\n");
      wout.write("  <params>\r\n"); 
      wout.write("    <param>\r\n"); 
      wout.write("      <value><int>" + args[0] 
       + "</int></value>\r\n"); 
      wout.write("    </param>\r\n"); 
      wout.write("  </params>\r\n"); 
      wout.write("</methodCall>\r\n"); 
      
      wout.flush();
      wout.close();      

      // Read the response XML document
      XMLReader parser = XMLReaderFactory.createXMLReader(
        "org.apache.xerces.parsers.SAXParser"
      );
      // There's a name conflict with java.net.ContentHandler
      // so we have to use the fully package qualified name.
      org.xml.sax.ContentHandler handler 
       = new FibonacciHandler();
      parser.setContentHandler(handler);
    
      InputStream in = connection.getInputStream();
      InputSource source = new InputSource(in);
      parser.parse(source);
      System.out.println();

      in.close();
      connection.disconnect();
    }
    catch (Exception e) {
      System.err.println(e); 
    }
  
  } 

}

Since SAX is a read-only API, I used the same code as before to write the request sent to the server. The code for reading the response, however, is quite different. Rather than reading directly from the stream, SAX bundles the InputStream in an InputSource, a generic wrapper for all the different things an XML document might be stored in— InputStream, Reader, URL, File, etc. This InputSource object is then passed to the parse() method of an XMLReader.

Several exceptions can be thrown at various points in this process. For instance, a IOException will be thrown if the socket connecting the client to the server is broken. A SAXException will be thrown if the org.apache.xerces.parsers.SAXParser class can’t be found somewhere in the class path. A SAXParseException will be thrown if the server returns malformed XML. For now, Example 5.3 lumps all these together in one generic catch block. Later chapters go into the different exceptions in more detail.

There’s no code in this class to actually find the double response and print it on the console. Yet, when run it produces the expected response:

C:\XMLJAVA>java FibonacciSAXClient 42
267914296

The real work of understanding and processing documents in this particular format is happening inside the ContentHandler object. The specific implementation of the ContentHandler interface used here is FibonacciHandler, shown in Example 5.4. In this case I chose to extend the DefaultHandler adapter class rather than implement the ContentHandler interface directly. The pattern is similar to using WindowAdapter instead of WindowListener in the AWT. It avoids having to implement a lot of do-nothing methods that don’t matter in this particular program.

Example 5.4. The ContentHandler for the SAX client for the Fibonacci XML-RPC server

import org.xml.sax.*;
import org.xml.sax.helpers.DefaultHandler;


public class FibonacciHandler extends DefaultHandler {

  private boolean inDouble = false;
  
  public void startElement(String namespaceURI, String localName,
   String qualifiedName, Attributes atts) throws SAXException {
    
    if (localName.equals("double")) inDouble = true;
    
  }

  public void endElement(String namespaceURI, String localName,
   String qualifiedName) throws SAXException {
    
    if (localName.equals("double")) inDouble = false;
    
  }

  public void characters(char[] ch, int start, int length)
  throws SAXException {

    if (inDouble) {
      for (int i = start; i < start+length; i++) {
        System.out.print(ch[i]); 
      }
    }   
   
  }
  
}

What this ContentHandler needs to do is recognize and print the contents of the single double element in the response while ignoring everything else. Thus, when the startElement() method, which is invoked by the parser every time it encounters a start-tag or an empty-element tag, sees a start-tag with the name double, it sets a private boolean field named inDouble to true. When the endElement() method sees an end-tag with the name double, it sets the same field back to false. The characters() method prints whatever it sees on System.out, but only when inDouble is true.

Unlike the earlier stream and string based solution, this program will detect any well-formedness errors in the document. It will not be tripped up by the unexpected appearance of <double> tags in comments or processing instructions or ignorable white space between tags. This program would not detect problems that occurred as a result of multiple double elements or other invalid markup. However, in later chapters I’ll show you how to use a schema to add this capability. The parser-based client is much more robust than Example 5.2 and it’s almost as simple. As the markup becomes more complex and the amount of information you need to extract from the document grows, parser-based solutions become far easier and cheaper to implement than any alternative.

The big advantage to SAX compared to other parser APIs is that it’s quite fast and extremely memory-efficient. You only need to store in memory those parts of the document you actually care about. You can ignore the rest. DOM, by contrast, must keep the entire document in memory at once. ^[3] Furthermore, the DOM data structures tend to be substantially less efficient than the serialized XML itself. A DOM Document object can easily take up ten times as much memory as would be required to just hold the characters of the document in an array. This severely limits the size of documents that can be processed with DOM and other tree-based APIs. SAX, by contrast, can handle documents that vastly exceed the amount of available memory. If your documents cross the gigabyte threshold, there is really no alternative to SAX.

Furthermore, SAX works very well in streaming applications. A SAX program can begin working with the start of a document before the parser has reached the middle. This is particularly important in low-bandwidth, high-latency environments like most network applications. For example, if a client sent a brokerage an XML document containing a list of stocks to buy, the brokerage could execute the first trade before the entire document had been received or parsed. Multi-threading can be especially useful here.

The downside to SAX is that most programs are more concerned with XML documents than with XML parsers. In other words, a class hierarchy that models the XML document is a lot more natural and closer to what you’re likely to need than a class hierarchy that models parsers. SAX programs tend to be more than a little obtuse. It’s rare that SAX gives you all the information you need at the same time. Most of the time you find yourself building data structures in memory to store the parts of the document you’re interested in until you’re ready to use them. In the worst case, you can end up inventing your own tree model for the entire document, in which case you’re probably better off just using DOM or one of the other tree models in the first place and saving yourself the work.

^[3] Xerces does give you the option of using a lazy DOM that only parses parts of the document as they’re needed to help reduce memory usage.

Copyright 2001, 2002 Elliotte Rusty Harold	elharo@metalab.unc.edu	Last Modified December 14, 2002
	Up To Cafe con Leche