The XMLFilter interface

The XMLFilter interface
Prev	Chapter 8. SAX Filters	Next

Example 8.1 shows the actual code for the XMLFilter interface. Besides the methods it inherits from the XMLReader superinterface, XMLFilter has just two new methods, getParent() and setParent(). The parent of a filter is the XMLReader to which the filter delegates most of its work. (In the context of SAX filters, the parent is not normally understood to be the superclass of the filter class.)

Example 8.1. The XMLFilter interface

package org.xml.sax;

public interface XMLFilter extends XMLReader {

  public void      setParent(XMLReader parent);
  public XMLReader getParent();

}

A class that implements this interface must provide a minimum of 16 methods, the getParent() and setParent() methods declared here and the 14 methods of the XMLReader superinterface. Example 8.2 is a minimal XML filter that implements all of these methods but doesn’t actually do anything.

Example 8.2. A filter that blocks all events

import org.xml.sax.*;


public class OpaqueFilter implements XMLFilter {

  private XMLReader parent;
  
  public void setParent(XMLReader parent) {
    this.parent = parent;
  }
  
  public XMLReader getParent() {
    return this.parent; 
  }

  public boolean getFeature(String name)
   throws SAXNotRecognizedException {
    throw new SAXNotRecognizedException(name);
  } 

  public void setFeature(String name, boolean value) 
   throws SAXNotRecognizedException { 
    throw new SAXNotRecognizedException(name);
  }

  public Object getProperty(String name) 
   throws SAXNotRecognizedException {
    throw new SAXNotRecognizedException(name);
  }


  public void setProperty(String name, Object value)
   throws SAXNotRecognizedException {
    throw new SAXNotRecognizedException(name);
  }

  public void setEntityResolver(EntityResolver resolver) {}
  public EntityResolver getEntityResolver() {
    return null; 
  }
  
  public void setDTDHandler(DTDHandler handler) {}
  public DTDHandler getDTDHandler() {
    return null; 
  }

  public void setContentHandler(ContentHandler handler) {}
  public ContentHandler getContentHandler() {
    return null; 
  }

  public void setErrorHandler(ErrorHandler handler) {}
  public ErrorHandler getErrorHandler() {
    return null; 
  }

  public void parse(InputSource input) {}
  public void parse(String systemID) {} 
  
}

The effect of attaching this filter to a parser is to totally block events. It’s like a brick wall between the application and the data in the XML document. A client application that wants to use this filter (though I really can’t imagine why one would) would construct an instance of it and an instance of a real parser, then pass the real parser to the filter’s setParent() method.

  XMLReader parser = XMLReaderFactory.createXMLReader();
  OpaqueFilter filter = new OpaqueFilter();
  filter.setParent(parser);

From this point forward, the client application should only interact with the filter. It should forget that the original parser exists. Going behind the back of the filter, for instance, by calling setContentHandler() on parser instead of on filter, runs the risk of confusing the filter by violating constraints it expects to be true. In fact, if at all possible, you should eliminate any references to the original parser so that you can’t accidentally access it later. For example,

  XMLReader parser = XMLReaderFactory.createXMLReader();
  OpaqueFilter filter = new OpaqueFilter();
  filter.setParent(parser);
  parser=filter;

In some cases the filter may set up its own parent parser, typically in its constructor. This avoids the need for the client application to provide an XMLReader to the filter. For example,

  public OpaqueFilter(XMLReader parent) {
    this.parent = parent;
  }
  
  public OpaqueFilter() throws SAXException {
    this(XMLReaderFactory.createXMLReader());
  }

You might even design the setParent() method so it’s impossible to change the parent parser once it’s initially been set in the constructor. For example,

  public void setParent(XMLReader parent) {
    throw new UnsupportedOperationException( 
     "Can’t change this filter’s parent"
    );
  }

However, this does tend to limit the flexibility of the filter. In particular, it prevents you from putting it in the middle of a long chain of filters.

Example 8.3 is a marginally more interesting implementation of the XMLFilter interface. It delegates to the parent XMLReader by forwarding all method calls from the client application. It does not change or filter anything.

Example 8.3. A filter that filters nothing

import org.xml.sax.*;
import java.io.IOException;


public class TransparentFilter implements XMLFilter {

  private XMLReader parent;
  
  public void setParent(XMLReader parent) {
    this.parent = parent;
  }
  
  public XMLReader getParent() {
    return this.parent; 
  }

  public boolean getFeature(String name)
   throws SAXNotRecognizedException, SAXNotSupportedException {
    return parent.getFeature(name);
  } 

  public void setFeature(String name, boolean value) 
   throws SAXNotRecognizedException, SAXNotSupportedException { 
    parent.setFeature(name, value);
  }

  public Object getProperty(String name) 
   throws SAXNotRecognizedException, SAXNotSupportedException {
    return parent.getProperty(name);
  }


  public void setProperty(String name, Object value)
   throws SAXNotRecognizedException, SAXNotSupportedException {
    parent.setProperty(name, value);
  }

  public void setEntityResolver(EntityResolver resolver) {
    parent.setEntityResolver(resolver);
  }
  
  public EntityResolver getEntityResolver() {
    return parent.getEntityResolver();
  }
  
  public void setDTDHandler(DTDHandler handler) {
    parent.setDTDHandler(handler);
  }
  
  public DTDHandler getDTDHandler() {
    return parent.getDTDHandler();
  }

  public void setContentHandler(ContentHandler handler) {
    parent.setContentHandler(handler);  
  }
  
  public ContentHandler getContentHandler() {
    return parent.getContentHandler();
  }

  public void setErrorHandler(ErrorHandler handler) {
    parent.setErrorHandler(handler);
  }
  
  public ErrorHandler getErrorHandler() {
    return parent.getErrorHandler();
  }

  public void parse(InputSource input)
   throws SAXException, IOException {
    parent.parse(input);
  }
  
  public void parse(String systemId) 
   throws SAXException, IOException {
    parent.parse(systemId);
  } 
  
}

Of course, in most cases you’re not going to go to either of these extremes. You’re going to pass some events through unchanged, block others, and modify still others. Let’s continue with a filter that adds a property to the list of those normally supported by an XML parser. This property will provide the wall-clock time needed to parse an XML document, and might be useful for benchmarking. I’ll write it as a filter so that it can be attached to different underlying parsers and used in benchmarks that include various content handlers.

The property name will be http://cafeconleche.org/properties/wallclock/. The value will be a java.lang.Long object containing the number of milliseconds needed to parse the last document. This can be stored in a private field initialized to null:

  private Long wallclock = null;

The wallclock time is available only after the parse() method has returned. At other times, requesting this property throws a SAXNotSupportedException. This will be a read-only property so trying to set it will always throw a SAXNotSupportedException. This will be implemented through the setProperty() and getProperty() methods:

  public Object getProperty(String name) 
   throws SAXNotRecognizedException, SAXNotSupportedException {
     
    if ("http://cafeconleche.org/properties/wallclock/"
     .equals(name)) {
      if (wallclock != null) {
        return wallclock;
      }
      else {
        throw 
         new SAXNotSupportedException("Timing not available");
      }
    }
    return parent.getProperty(name);
    
  }

  public void setProperty(String name, Object value)
   throws SAXNotRecognizedException, SAXNotSupportedException {
     
    if ("http://cafeconleche.org/properties/wallclock/"
     .equals(name)) {
      throw new SAXNotSupportedException(
       "Wallclock property is read-only");
    }
    parent.setProperty(name, value);
    
  }

For any property other than http://cafeconleche.org/properties/wallclock/, these calls just delegate the work to the parent parser.

The parse() method is responsible for tracking the wallclock time. I’ll put the work in the parse() method that takes an InputSource as an argument and call this method from the other overloaded parse() method that takes a system ID as the argument.

Using the filter enables some standard benchmarking techniques. First, I’ll read the entire document into a byte array named cache so that it can be parsed from memory. This will eliminate most of the I/O time which would otherwise be likely to swamp the actual parsing time, especially if the test document were read from a slow network connection. This actually requires separate handling for the three possible sources an InputSource may offer: character stream, byte stream, and system ID:

    ByteArrayOutputStream out = new ByteArrayOutputStream();
    Reader charStream = input.getCharacterStream();
    InputStream byteStream = input.getByteStream();
    
    String encoding = null; // I will only set this variable if 
                            // we have a reader because in this
                            // case we know the encoding is UTF-8
                            // regardless of what the encoding
                            // declaration says
    if (charStream != null) {
      OutputStreamWriter filter 
       = new OutputStreamWriter(out, "UTF-8");
      int c;
      while ((c = charStream.read()) != -1) filter.write(c);
      encoding = "UTF-8";
    }
    else if (byteStream != null) {
      int c;
      while ((c = byteStream.read()) != -1) out.write(c);
    }
    else {
      URL u = new URL(input.getSystemId());
      InputStream in = u.openStream();
      int c;
      while ((c = in.read()) != -1) out.write(c);
    }
    out.flush();
    out.close();
    byte[] cache = out.toByteArray();

Next, I’ll warm up the JIT with ten untimed parses of the document before I begin taking measurements:

    for (int i=0; i < 10; i++) {
      InputStream in = new ByteArrayInputStream(cache);
      is.setByteStream(in); 
      parent.parse(is);
    }

Finally I’ll parse the same document 1000 times and set wallclock to the average of the 1000 parses:

    Date start = new Date();
    for (int i=0; i < 1000; i++) {
      InputStream in = new ByteArrayInputStream(cache);
      is.setByteStream(in);
      parent.parse(is); 
    }
    Date finish = new Date();
    long totalTime = finish.getTime() - start.getTime();
    
    // Average the time
    this.wallclock = new Long(totalTime/1000);

Example 8.4 shows the complete benchmarking filter. Besides the previously described methods, it contains implementations of the other XMLReader methods that all forward their arguments to the equivalent method in the parent parser.

Example 8.4. A filter that times all parsing

import org.xml.sax.*;
import java.io.*;
import java.util.Date;
import java.net.URL;


public class WallclockFilter implements XMLFilter {

  private XMLReader parent;
  private Long wallclock = null;
  
  public Object getProperty(String name) 
   throws SAXNotRecognizedException, SAXNotSupportedException {
     
    if ("http://cafeconleche.org/properties/wallclock/"
     .equals(name)) {
      if (wallclock != null) {
        return wallclock;
      }
      else {
        throw 
         new SAXNotSupportedException("Timing not available");
      }
    }
    return parent.getProperty(name);
    
  }

  public void setProperty(String name, Object value)
   throws SAXNotRecognizedException, SAXNotSupportedException {
     
    if ("http://cafeconleche.org/properties/wallclock/"
     .equals(name)) {
      throw new SAXNotSupportedException(
       "Wallclock property is read-only");
    }
    parent.setProperty(name, value);
    
  }

  public void setParent(XMLReader parent) {
    this.parent = parent;
  }
  
  public XMLReader getParent() {
    return this.parent; 
  }

  public void parse(InputSource input)
   throws SAXException, IOException {
     
    //Reset the time
    this.wallclock = null;
     
    // Cache the document 
    ByteArrayOutputStream out = new ByteArrayOutputStream();
    Reader charStream = input.getCharacterStream();
    InputStream byteStream = input.getByteStream();
    
    String encoding = null; // I will only set this variable if 
                            // we have a reader because in this
                            // case we know the encoding is UTF-8
                            // regardless of what the encoding
                            // declaration says
    if (charStream != null) {
      OutputStreamWriter filter 
       = new OutputStreamWriter(out, "UTF-8");
      int c;
      while ((c = charStream.read()) != -1) filter.write(c);
      encoding = "UTF-8";
    }
    else if (byteStream != null) {
      int c;
      while ((c = byteStream.read()) != -1) out.write(c);
    }
    else {
      URL u = new URL(input.getSystemId());
      InputStream in = u.openStream();
      int c;
      while ((c = in.read()) != -1) out.write(c);
    }
    out.flush();
    out.close();
    byte[] cache = out.toByteArray();
     
    InputSource is = new InputSource(); 
    if (encoding != null) is.setEncoding(encoding);
     
    // Warm up the JIT
    for (int i=0; i < 10; i++) {
      InputStream in = new ByteArrayInputStream(cache);
      is.setByteStream(in); 
      parent.parse(is);
    }
    System.gc();
    
    // Parse 1000 times 
    Date start = new Date();
    for (int i=0; i < 1000; i++) {
      InputStream in = new ByteArrayInputStream(cache);
      is.setByteStream(in);
      parent.parse(is); 
    }
    Date finish = new Date();
    long totalTime = finish.getTime() - start.getTime();
    
    // Average the time
    this.wallclock = new Long(totalTime/1000);
    
  }
  
  public void parse(String systemID) 
   throws SAXException, IOException {
    this.parse(new InputSource(systemID));
  }
 
  // Methods that delegate to the parent XMLReader
  public boolean getFeature(String name)
   throws SAXNotRecognizedException, SAXNotSupportedException {
    return parent.getFeature(name);
  } 

  public void setFeature(String name, boolean value) 
   throws SAXNotRecognizedException, SAXNotSupportedException { 
    parent.setFeature(name, value);
  }

  public void setEntityResolver(EntityResolver resolver) {
    parent.setEntityResolver(resolver);
  }
  
  public EntityResolver getEntityResolver() {
    return parent.getEntityResolver();
  }
  
  public void setDTDHandler(DTDHandler handler) {
    parent.setDTDHandler(handler);
  }
  
  public DTDHandler getDTDHandler() {
    return parent.getDTDHandler();
  }

  public void setContentHandler(ContentHandler handler) {
    parent.setContentHandler(handler);  
  }
  
  public ContentHandler getContentHandler() {
    return parent.getContentHandler();
  }

  public void setErrorHandler(ErrorHandler handler) {
    parent.setErrorHandler(handler);
  }
  
  public ErrorHandler getErrorHandler() {
    return parent.getErrorHandler();
  }  
 
}

We still need a driver class that constructs a filter XMLReader and a normal parser XMLReader, connects them to each other, and parses the test document. Example 8.5 is such a class that contains a simple main() method to benchmark a document named on the command line. No handlers are installed so it tests raw parsing time. If I wanted to test the behavior of different parsers with various callback interfaces, I could install them on the parser before parsing. After parsing, WallclockDriver reads the value of the wallclock property from the filter. The parser tested can be adjusted by setting different values for the org.xml.sax.driver system property.

Example 8.5. Parsing a document through a filter

import org.xml.sax.*;
import org.xml.sax.helpers.XMLReaderFactory;
import java.io.IOException;


public class WallclockDriver {

  public static void main(String[] args) {

    if (args.length <= 0) {
      System.out.println("Usage: java WallclockDriver URL");
      return;
    }
    String document = args[0];
    
    try {
      XMLFilter filter = new WallclockFilter();
      filter.setParent(XMLReaderFactory.createXMLReader());
      filter.parse(document);
      Long parseTime = (Long) filter.getProperty(
       "http://cafeconleche.org/properties/wallclock/");
       double seconds = parseTime.longValue()/1000.0;
      System.out.println("Parsing " + document + " took "
       + seconds + " seconds on average.");   
    }
    catch (SAXException e) {
      e.printStackTrace();
      System.out.println(e);
    }
    catch (IOException e) { 
        e.printStackTrace();
     System.out.println(
       "Due to an IOException, the parser could not check " 
       + args[0]
      ); 
    }
    
  }
  
}

I ran the XML form of the second edition of the XML 1.0 specification through this program with a few different parsers using Sun’s Java Runtime Environment 1.3.1 on my 300 MHz Pentium II running Windows NT 4.0SP6. This isn’t a scientific test (At a minimum this would require testing many different documents on multiple virtual and physical machines, considering pauses that might be caused by garbage collection, making sure background processes were kept to a minimum, and taking multiple measurements to test reproducibility of the results), but the output’s nonetheless mildly interesting:

% java -Dorg.xml.sax.driver=gnu.xml.aelfred2.XmlReader
  WallclockDriver http://www.w3.org/TR/2000/REC-xml-20001006.xml
Parsing http://www.w3.org/TR/2000/REC-xml-20001006.xml 
 took 1.209 seconds on average.
% java -Dorg.xml.sax.driver=org.apache.crimson.parser.XMLReaderImpl
  WallclockDriver http://www.w3.org/TR/2000/REC-xml-20001006.xml
Parsing http://www.w3.org/TR/2000/REC-xml-20001006.xml 
 took 1.414 seconds on average.
% java -Dorg.xml.sax.driver=org.apache.xerces.parsers.SAXParser
  WallclockDriver http://www.w3.org/TR/2000/REC-xml-20001006.xml
Parsing http://www.w3.org/TR/2000/REC-xml-20001006.xml 
 took 1.133 seconds on average.
% java -Dorg.xml.sax.driver=com.bluecast.xml.Piccolo
  WallclockDriver http://www.w3.org/TR/2000/REC-xml-20001006.xml
Parsing http://www.w3.org/TR/2000/REC-xml-20001006.xml 
 took 0.849 seconds on average.

The four parsers I tested here were all fairly close to each other in raw performance. In fact, given that I’m testing the wallclock time instead of the actual time used by this program alone, I’d venture that the differences are all within the margin of error for the test. Of course, when choosing a parser, you’d want to run this across your own documents with your own content handlers in place.

Copyright 2001, 2002 Elliotte Rusty Harold	elharo@metalab.unc.edu	Last Modified May 26, 2002
	Up To Cafe con Leche