Receiving Characters

When the parser reads #PCDATA, it passes this text to the characters() method as an array of chars. While it would be simpler if characters() just took a String as an argument, using a char[] array allows certain performance optimizations. In particular, parsers often store a large chunk of the original document in a single array, and repeatedly pass that same array to the characters() method, while updating the values of start and length.

On the flip side, when there’s a large amount of text between two tags with no intervening markup, the parser may choose to call characters() multiple times even though it doesn’t need to. Xerces generally won’t pass more than 16K of text in one call. Crimson is limited to about 8K of text per call. At the extreme, I have even seen a parser pass a single character at a time to the characters() method. You must not assume that the parser will pass you the maximum contiguous run of text in a single call to characters().

This can lead to some uncomfortable contortions when processing many documents. Given an element such as <Name>Birdsong Clock</Name>, you typically want to process the entire content as a unit. This requires you to set a boolean flag at the start-tag for the element in startElement(), accumulate the data into a buffer of some kind, often a StringBuffer, and only act on the data when you reach the end-tag for the element as signaled by the endElement() method.

For an example, I’m going to revisit the Fibonacci XML-RPC client program from the last chapter. However, this time rather than printing the result on System.out, I’m going to collect the result and make it available as a BigInteger. Once again, this will require the ContentHandler to recognize the contents of the single double element in the response while ignoring everything else. Example 6.10 demonstrates.

Example 6.10. A SAX client for the Fibonacci XML-RPC server

import java.net.*;
import java.io.*;
import java.math.BigInteger;
import org.xml.sax.*;
import org.xml.sax.helpers.*;


public class NewFibonacciClient {

  public final static String DEFAULT_SERVER 
   = "http://www.elharo.com/fibonacci/XML-RPC";  

  public static BigInteger calculateFibonacci(int index, 
   String server) throws IOException, SAXException {

      // Connect to the the server
      URL u = new URL(server);
      URLConnection uc = u.openConnection();
      HttpURLConnection connection = (HttpURLConnection) uc;
      connection.setDoOutput(true);
      connection.setDoInput(true); 
      connection.setRequestMethod("POST");
      OutputStream out = connection.getOutputStream();
      Writer wout = new OutputStreamWriter(out, "UTF-8");
      
      // Transmit the request XML document
      wout.write("<?xml version=\"1.0\"?>\r\n");  
      wout.write("<methodCall>\r\n"); 
      wout.write(
       "  <methodName>calculateFibonacci</methodName>\r\n");
      wout.write("  <params>\r\n"); 
      wout.write("    <param>\r\n"); 
      wout.write("      <value><int>" + index 
       + "</int></value>\r\n"); 
      wout.write("    </param>\r\n"); 
      wout.write("  </params>\r\n"); 
      wout.write("</methodCall>\r\n"); 
      
      wout.flush();
      wout.close();      

       // Read the response XML document
      XMLReader parser = XMLReaderFactory.createXMLReader(
        "org.apache.xerces.parsers.SAXParser"
      );
      FibonacciHandler handler = new FibonacciHandler();
      parser.setContentHandler(handler);
    
      InputStream in = connection.getInputStream();
      InputSource source = new InputSource(in);
      parser.parse(source);

      in.close();
      connection.disconnect();
      return handler.result;    
    
  }
   
  static class FibonacciHandler extends DefaultHandler {

    StringBuffer buffer = null;
    BigInteger result = null;
  
    public void startElement(String namespaceURI, 
     String localName, String qualifiedName, Attributes atts) {
    
      if (qualifiedName.equals("double")) {
        buffer = new StringBuffer();
      }
      
    }

    public void endElement(String namespaceURI, String localName,
     String qualifiedName) {
    
      if (qualifiedName.equals("double")) {
        String accumulatedText = buffer.toString();
        result = new BigInteger(accumulatedText);
        buffer = null;
      }
    
    }

    public void characters(char[] text, int start, int length)
     throws SAXException {

      if (buffer != null) {
        buffer.append(text, start, length); 
      }
   
    }
    
  }
    
  public static void main(String[] args) {
      
    int index;
    try {
      index = Integer.parseInt(args[0]);
    }
    catch (Exception e) {
      System.out.println(
       "Usage: java NewFibonacciClient number url"
      );
      return;
    }

    String server = DEFAULT_SERVER;
    if (args.length >= 2) server = args[1];
    
    try {
      BigInteger result = calculateFibonacci(index, server);
      System.out.println(result);
    }
    catch (Exception e) {
      e.printStackTrace(); 
    }
  
  } 

}

The return value is stored in a private BigInteger field named result. The value of this field only makes sense after the response has been received and parsed, so I hide the ContentHandler in a static inner class which is accessed through the static calculateFibonacci() method. Because ContentHandler methods often need to be called in specific order from a certain context, the strategy of hiding them inside a non-public, possibly inner class is quite common. It’s not absolutely required, but it does make the class safer and the public interface much simpler.

What’s really new here is how the characters() method operates. Fibonacci numbers grow arbitrarily large exponentially quickly. There does exist a Fibonacci number, the exact size depending on the parser, which will not be completely given in a single call to characters(). Consequently, rather than simply storing a boolean that tells us whether we’re in the double element, we use a StringBuffer field. This is null outside the double element. It is non-null inside the double element. When it is non-null, the characters() method appends data to the buffer. That data is acted on —in this case, converted to an integer in this case —only when an end-tag is spotted and the endElement() method invoked.

This general approach of accumulating data into a buffer and only acting on it after the last character of data has been seen is very common in SAX programs. Elements that contain mixed content are handled similarly. Elements that can recursively contain other elements with the same name (e.g. in XHTML a div can contain another div) are trickier, but can normally be handled by using a stack of element name flags rather than a single boolean flag. Indeed stacks are often very convenient data structures when processing XML with SAX as has been seen in earlier examples and as will be seen again before this chapter is done.


Copyright 2001, 2002 Elliotte Rusty Haroldelharo@metalab.unc.eduLast Modified February 07, 2002
Up To Cafe con Leche