Features and Properties

SAX parsers, that is, XMLReader objects, are configured by setting features and properties. A feature has a boolean true/false value. A property has an object value. Both features and properties are named by absolute URIs. This allows just a handful of standard methods to support an arbitrary number of standard and non-standard features and properties of various types.

Features and properties can be read-only, write-only (rare), or read-write. If you attempt to change a read-only feature or property, a SAXNotSupportedException, a subclass of SAXException. is thrown. The accessibility of a feature or property can change depending on whether or not the XMLReader is currently parsing a document. For example, you can turn validation on or off before or after parsing a document, but not while the XMLReader is parsing a document.

Getting and Setting Features

The XMLReader interface provides these two methods to turn features on and off:

public void setFeature(String name, boolean value)
    throws SAXNotRecognizedException, SAXNotSupportedException;

public boolean getFeature(String name)
    throws SAXNotRecognizedException, SAXNotSupportedException;

The first argument is the name of the feature to set or get. Feature names are absolute URIs. Standard features that are supported by multiple parsers have names that begin with http://xml.org/sax/features/. For example, this next code fragment checks to see if the XMLReader object parser is currently validating; and, if it isn’t, turns on validation by setting the feature http://xml.org/sax/features/validation to true.

if (!parser.getFeature("http://xml.org/sax/features/validation")) {
  parser.setFeature("http://xml.org/sax/features/validation", true);
}

However, different parsers also support non-standard, custom features. The names of these features begin with URLs somewhere in the parser vendor’s domain. For example, non-standard features of the Xerces parser from the XML Apache Project begin with http://apache.org/xml/features/.

If the XMLReader object can never access the feature you’re trying to get or set, setFeature() throws a SAXNotRecognizedException. On the other hand, if you try to get or set a feature that the parser recognizes but cannot access at the current time, setFeature() throws a SAXNotSupportedException. Both are subclasses of SAXException. For example, if parser is a non-validating parser like gnu.xml.aelfred2.SAXDriver, then the above code would throw SAXNotRecognizedException. However, if parser is a validating parser like Xerces but the setFeature() method were invoked while it was parsing a document, then it would throw a SAXNotSupportedException because you can’t turn on validation halfway through a document. Since these are checked exceptions, you’ll need to either catch these exceptions or declare that your method throws them. For example,

try {
  if (!parser.getFeature("http://xml.org/sax/features/validation")) {
    parser.setFeature("http://xml.org/sax/features/validation", true);
  }
}
catch (SAXNotRecognizedException) {
  System.out.println(parser + " is not a validating parser.");
}
catch (SAXNotSupportedException) {
  System.out.println(
   "Cannot turn on validation right now. Try again later."
  ); 
}

Getting and Setting Properties

The XMLReader interface uses these two methods to set and get the values of properties:

public void setProperty(String name, Object value)
    throws SAXNotRecognizedException, SAXNotSupportedException;

public Object getProperty(String name)
    throws SAXNotRecognizedException, SAXNotSupportedException;

Properties are named by absolute URIs, just like features. Standard properties have names that begin with http://xml.org/sax/properties/ such as http://xml.org/sax/properties/declaration-handler and http://xml.org/sax/properties/xml-string. However, most parsers also support some non-standard, custom properties. The names of these will begin with URLs somewhere in the parser vendor’s domain. For example, non-standard properties of the Xerces parser from the XML Apache Project begin with http://apache.org/xml/properties/, for instance http://apache.org/xml/properties/schema/external-noNamespaceSchemaLocation.

The value of a property is an object. The type of the object depends on which property it is. For instance, the value of the http://xml.org/sax/properties/declaration-handler property is an org.xml.sax.ext.DeclHandler while the value of the http://xml.org/sax/properties/xml-string property is a java.lang.String. Passing an object of the wrong type for the property to setProperty() results in a SAXNotSupportedException.

For example, suppose you’re using Xerces and you want to set the schema location for elements that are not in any namespace to http://www.example.com/schema.xsd. This code fragment accomplishes that:

try {
  parser.setProperty("http://apache.org/xml/properties/schema/"
   + "external-noNamespaceSchemaLocation", 
   "http://www.example.com/schema.xsd");
}
catch (SAXNotRecognizedException) {
  System.out.println(parser 
   + " is not a schema-validating parser.");
}
catch (SAXNotSupportedException) {
  System.out.println(
   "Cannot change the schema right now. Try again later."
  ); 
}

Required Features

There are only a couple of features which all SAX parsers must support and no absolutely required properties. The two required features are:

  • http://xml.org/sax/features/namespaces

  • http://xml.org/sax/features/namespace-prefixes

The http://xml.org/sax/features/namespaces feature determines whether namespace URIs and local names are passed to startElement() and endElement(). The default, true, passes both namespace URIs and local names. However, if http://xml.org/sax/features/namespaces is false, then the parser may pass the namespace URI and the local name, or it may just pass empty strings for these two arguments. The default is true, and there’s not a lot of reason to change it. (You can always ignore the URI and local name if you don’t need them.)

The http://xml.org/sax/features/namespace-prefixes feature determines two things:

  • Whether or not namespace declaration xmlns and xmlns:prefix attributes are included in the Attributes list passed to startElement(). The default, false, is not to include them.

  • Whether or not the qualified names should be passed as the third argument to the startElement() method. The default, false, is, not to require qualified names. However, even if http://xml.org/sax/features/namespace-prefixes is false, parsers are allowed to report the qualified name, and most do so.

For example, consider this start-tag:

<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
         xmlns:dc="http://www.purl.org/dc/" id="R1">

If http://xml.org/sax/features/namespace-prefixes is false and http://xml.org/sax/features/namespaces is true, then when a SAX parser reads this tag it may invoke the startElement() method in its registered ContentHandler object with these arguments:

startElement(
  namespaceURI="http://www.w3.org/1999/02/22-rdf-syntax-ns#",
  localName = "RDF",
  qualifiedName="",
  attributes={id="R1"}
)

Alternately, it can choose to provide the qualified name even though it isn’t required to:

startElement(
  namespaceURI="http://www.w3.org/1999/02/22-rdf-syntax-ns#",
  localName = "RDF",
  qualifiedName="rdf:RDF",
  attributes={id="R1"}
)

However, if http://xml.org/sax/features/namespace-prefixes is true and http://xml.org/sax/features/namespaces is true, then when a SAX parser reads this tag it invokes the startElement() method like this:

startElement(
  namespaceURI="http://www.w3.org/1999/02/22-rdf-syntax-ns#",
  localName = "RDF",
  qualifiedName="rdf:RDF",
  attributes={id="R1", xmlns:dc="http://www.purl.org/dc/",
   xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"}
)

If http://xml.org/sax/features/namespace-prefixes is true and http://xml.org/sax/features/namespaces is false, then when a SAX parser reads this tag it may invoke the startElement() method like this:

startElement(
  namespaceURI="",
  localName = "",
  qualifiedName="rdf:RDF",
  attributes={id="R1", xmlns:dc="http://www.purl.org/dc/",
   xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"}
)

Then again it may provide the namespace URI and local name anyway, even though it doesn’t have to:

startElement(
  namespaceURI="http://www.w3.org/1999/02/22-rdf-syntax-ns#",
  localName = "RDF",
  qualifiedName="rdf:RDF",
  attributes={id="R1", xmlns:dc="http://www.purl.org/dc/",
   xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"}
)

In other words,

  • The parser is only guaranteed to provide the namespace URIs and local names of elements and attributes if http://xml.org/sax/features/namespaces is true (which it is by default).

  • The parser is only guaranteed to provide the qualified names of elements and attributes if http://xml.org/sax/features/namespace-prefixes is true (which it is not by default).

  • The parser provides namespace declaration attributes if and only if http://xml.org/sax/features/namespace-prefixes is true (which it is not by default).

  • The parser always has the option to provide the namespace URI, local name, and qualified name, regardless of the values of http://xml.org/sax/features/namespaces and http://xml.org/sax/features/namespace-prefixes. However, you should not rely on this behavior.

To summarize, the defaults are fine as long as you don’t care about namespace prefixes, only local names and URIs.

Standard Features

Besides the two required features, SAX defines a number of standard features which parsers may support if they choose. These have names which are consistent across different parsers and include:

  • http://xml.org/sax/features/external-general-entities

  • http://xml.org/sax/features/external-parameter-entities

  • http://xml.org/sax/features/string-interning

  • http://xml.org/sax/features/validation

external-general-entities

If http://xml.org/sax/features/external-general-entities is true, then the parser resolves all external general entity references. If it’s false, it does not. If the parser is validating, then this feature is required to be true.

The default value is parser-dependent. Not all parsers are able to resolve external entity references. Attempting to set this to true with a parser that cannot resolve external entity references will throw a SAXNotRecognizedException.

external-parameter-entities

If http://xml.org/sax/features/external-parameter-entities is true, then the parser resolves all external parameter entity references. If it’s false, it does not. If the parser is validating, then this feature is required to be true.

The default value of this feature is parser-dependent. Not all parsers are able to resolve external entity references. Attempting to set this to true with a parser that cannot resolve external entity references will throw a SAXNotRecognizedException.

string-interning

If http://xml.org/sax/features/string-interning is true, then the parser internalizes all XML names using the intern() method of the String class before passing them to the various callback methods. Thus if there are 100 different paragraph elements in your document, the parser will only use one "paragraph" string for all 100 start-tags and 100 end-tags rather than 200 separate strings. This can save memory as well as allowing you to compare element names using the == operator instead of the equals() method. Besides element names, this also affects attribute names, entity names, notation names, namespace prefixes, and namespace URIs. The default value is parser-dependent.

validation

If http://xml.org/sax/features/validation is true, then the parser validates the document against its DTD. Of course not all parsers are capable of doing this. Attempting to set http://xml.org/sax/features/validation to true for a parser that doesn’t know how to validate will throw a SAXNotRecognizedException.

Since validation requires resolving all external entity references, setting http://xml.org/sax/features/validation to true automatically sets http://xml.org/sax/features/external-general-entities and http://xml.org/sax/features/external-parameter-entities to true as well.

The default value of this feature is allegedly parser-dependent, though I’ve yet to encounter a parser that turns it on by default.

Example 7.9 is a program that uses this feature to validate documents. As well as setting the http://xml.org/sax/features/validation feature to true, it’s also necessary to register an ErrorHandler object that can receive messages about validity errors.

Example 7.9. A SAX program that validates documents

import org.xml.sax.*;
import org.xml.sax.helpers.XMLReaderFactory;
import java.io.IOException;


public class SAXValidator implements ErrorHandler {
  
  // Flag to check whether any errors have been spotted.
  private boolean valid = true;
  
  public boolean isValid() {
    return valid; 
  }
  
  // If this handler is used to parse more than one document, 
  // its initial state needs to be reset between parses.
  public void reset() {
    // Assume document is valid until proven otherwise
    valid = true; 
  }
  
  public void warning(SAXParseException exception) {
    
    System.out.println("Warning: " + exception.getMessage());
    System.out.println(" at line " + exception.getLineNumber() 
     + ", column " + exception.getColumnNumber());
    // Well-formedness is a prerequisite for validity
    valid = false;
    
  }
  
  public void error(SAXParseException exception) {
     
    System.out.println("Error: " + exception.getMessage());
    System.out.println(" at line " + exception.getLineNumber() 
     + ", column " + exception.getColumnNumber());
    // Unfortunately there's no good way to distinguish between
    // validity errors and other kinds of non-fatal errors 
    valid = false;
    
  }
  
  public void fatalError(SAXParseException exception) {
     
    System.out.println("Fatal Error: " + exception.getMessage());
    System.out.println(" at line " + exception.getLineNumber() 
     + ", column " + exception.getColumnNumber()); 
     
  }
  

  public static void main(String[] args) {
  
    if (args.length <= 0) {
      System.out.println("Usage: java SAXValidator URL");
      return;
    }
    String document = args[0];
    
    try {
      XMLReader parser = XMLReaderFactory.createXMLReader();
      SAXValidator handler = new SAXValidator();
      parser.setErrorHandler(handler);
      // Turn on validation. 
      parser.setFeature(
       "http://xml.org/sax/features/validation", true);
      parser.parse(document);
      if (handler.isValid()) {
        System.out.println(document + " is valid.");
      }
      else {
        // If the document isn't well-formed, an exception has
        // already been thrown and this has been skipped.
        System.out.println(document + " is well-formed.");
      }
    }
    catch (SAXParseException e) {
      System.out.print(document + " is not well-formed at ");
      System.out.println("Line " + e.getLineNumber() 
       + ", column " +  e.getColumnNumber() );
    }
    catch (SAXException e) {
      System.out.println("Could not check document because " 
       + e.getMessage());
    }
    catch (IOException e) { 
      System.out.println(
       "Due to an IOException, the parser could not check " 
       + document
      ); 
    }
  
  }

}

Here’s the beginning of the output from running it across the Docbook XML source code for an early version of this chapter:

%java -Dorg.xml.sax.driver=org.apache.xerces.parsers.SAXParser SAXValidator
 xmlreader.xml
Error: Element type "xinclude:include" must be declared.
 at line 344, column 92
Error: Attribute "href" must be declared for 
 element type "xinclude:include".
 at line 344, column 92
Error: Attribute "parse" must be declared for 
 element type "xinclude:include".
 at line 344, column 92
Error: The content of element type "programlisting" must match 
"(#PCDATA|footnoteref|xref|abbrev|acronym|citation|citerefentry
|citetitle|emphasis|firstterm|foreignphrase|glossterm|footnote
|phrase|quote|trademark|wordasword|link|olink|ulink|action
|application|classname|methodname|interfacename|exceptionname
|ooclass|oointerface|ooexception|command|computeroutput|database
|email|envar|errorcode|errorname|errortype|filename|function
|guibutton|guiicon|guilabel|guimenu|guimenuitem|guisubmenu
|hardware|interface|keycap|keycode|keycombo|keysym|literal
|constant|markup|medialabel|menuchoice|mousebutton|option
|optional|parameter|prompt|property|replaceable|returnvalue
|sgmltag|structfield|structname|symbol|systemitem|token|type
|userinput|varname|anchor|author|authorinitials|corpauthor
|modespec|othercredit|productname|productnumber|revhistory|remark
|subscript|superscript|inlinegraphic|inlinemediaobject
|inlineequation|synopsis|cmdsynopsis|funcsynopsis|classsynopsis
|fieldsynopsis|constructorsynopsis|destructorsynopsis
|methodsynopsis|indexterm|beginpage|co|lineannotation)*".
 at line 344, column 110
…
xmlreader.xml is well-formed.

SAXValidator is complaining about the XInclude elements I use to merge in source code examples like Example 7.9. These are not expected by the Docbook DTD. They need to be replaced before the file becomes valid. Once I do that, the merged file (ch07.xml) is valid:

%java -Dorg.xml.sax.driver=org.apache.xerces.parsers.SAXParser SAXValidator
 ch07.xml
ch07.xml is valid.

Standard Properties

SAX does not require parsers to support any properties. However, it does define four standard properties which parsers may support if they choose. These are:

  • http://xml.org/sax/properties/declaration-handler

  • http://xml.org/sax/properties/dom-node

  • http://xml.org/sax/properties/lexical-handler

  • http://xml.org/sax/properties/xml-string

xml-string

http://xml.org/sax/properties/xml-string is a read-only property that contains the string of text corresponding to the current SAX event. For example, in the startElement() method, this property would contain the actual start-tag that caused the method invocation.

This property can be used in a very straightforward program that echoes an XML document onto a Writer, as shown in Example 7.10. Assuming a validating parser, the parsing process merges a document that was originally split across multiple parsed entities into a single entity. Here each callback method in the ContentHandler simply invokes a private method that writes out the current value of the http://xml.org/sax/properties/xml-string property.

Example 7.10. A SAX program that echoes the parsed document

import org.xml.sax.*;
import org.xml.sax.helpers.XMLReaderFactory;
import java.io.*;


public class DocumentMerger implements ContentHandler {

  private XMLReader parser;
  private Writer out;
  
  public DocumentMerger(XMLReader parser, Writer out) {
    this.parser = parser;
    this.out = out;   
  }
  
  private void output() throws SAXException {
    
    try {
      String s = (String) parser.getProperty(
       "http://xml.org/sax/properties/xml-string");
      out.write(s);
    }
    catch (IOException e) {
      throw new SAXException("Nested IOException", e);  
    }    
    
  }
    
  public void setDocumentLocator(Locator locator) {}
  
  public void startDocument() throws SAXException {
    this.output();
  }
  public void endDocument() throws SAXException {
    this.output();
  }
  
  public void startPrefixMapping(String prefix, String uri)
   throws SAXException {
    this.output();
  }
  
  public void endPrefixMapping(String prefix) 
   throws SAXException {
    this.output();
  }
  
  public void startElement(String namespaceURI, String localName,
   String qualifiedName, Attributes atts) throws SAXException {
    this.output();
  }
  
  public void endElement(String namespaceURI, String localName,
   String qualifiedName) throws SAXException {
    this.output();
  }
  
  public void characters(char[] text, int start, int length)
   throws SAXException {
    this.output();
  }
  
  public void ignorableWhitespace(char[] text, int start, 
   int length) throws SAXException {
    this.output();
  }
  
  public void processingInstruction(String target, String data)
   throws SAXException {
    this.output();
  }
  
   
  public void skippedEntity(String name)
   throws SAXException {
    this.output();
  }

  public static void main(String[] args) {
      
    if (args.length <= 0) {
      System.out.println(
       "Usage: java DocumentMerger url"
      );
      return;
    }
          
    try {
      XMLReader parser = XMLReaderFactory.createXMLReader();
      
      // Since this just writes onto the console, it's best
      // to use the system default encoding, which is what
      // we get by not specifying an explicit encoding here.
      Writer out = new OutputStreamWriter(System.out);
      ContentHandler handler = new DocumentMerger(parser, out);
      parser.setContentHandler(handler);
    
      parser.parse(args[0]);
      
      out.flush();
      out.close();
    }
    catch (Exception e) {
      System.err.println(e); 
    }
  
  }   
  
}

The document that's output may not be quite the same as the document that was read. Character references will have been resolved. General entity references will probably have been resolved. Parts of the prolog, especially the DOCTYPE declaration, may be missing. Attributes that were read in from defaults in the DTD will be explicitly specified. However, the complete information content of the original document should be present, even if the form is different.

The biggest issue with this program is finding a parser that recognizes the http://xml.org/sax/properties/xml-string property. In my tests, Xerces 1.4.3, Crimson, and Ælfred all threw a SAXNotRecognizedException or a SAXNotSupportedException. I have not yet found a parser that supports this property, and there's some suspicion in the SAX community that defining it in the first place may have been` a mistake.

dom-node

The http://xml.org/sax/properties/dom-node property contains the org.w3c.dom.Node object corresponding to the current SAX event. For example, in the startElement() and endElement() methods, this property contains an org.w3c.dom.Element object representing that element. In the characters() method, this property contains the org.w3c.dom.Text object which contained the characters from which the text had been read.

lexical-handler

Lexical events are those ephemera of parsing that don’t really mean anything. In some sense, they really aren’t part of the document’s information. Comments are the most obvious example. However, lexical data also includes entity boundaries, CDATA section delimiters, and the DOCTYPE declaration. What unifies all these is that they really don’t matter 99.9% of the time. Unfortunately, there’s that annoying 0.1% when you really do care about some lexical detail you’d normally ignore.

Parsers are not required to report lexical data; but if they want to do so, SAX provides a standard callback interface they can use, LexicalHandler, shown in Example 7.11. However, this interface is optional. Parsers are not required to support it. Notice that it is in the org.xml.sax.ext package, not the core org.xml.sax package.

Example 7.11. The LexicalHandler interface

package org.xml.sax.ext;

public interface LexicalHandler {

  public void startDTD(String name, String publicId, 
   String systemId) throws SAXException;
  public void endDTD() throws SAXException;

  public void startEntity(String name)
   throws SAXException;
  public void endEntity(String name) throws SAXException;

  public void startCDATA() throws SAXException;
  public void endCDATA() throws SAXException;

  public void comment(char[] text, int start, int length)
   throws SAXException;

}

Because parsers are not required to support the LexicalHandler interface, it can’t be registered with a setLexicalHandler() method in XMLReader like the other callback interfaces. Instead, it’s set as the value of the http://xml.org/sax/properties/lexical-handler property.

For example, Example 7.12 is a concrete implementation of LexicalHandler that dumps comments from an XML document onto System.out.

Example 7.12. An implementation of the LexicalHandler interface

import org.xml.sax.*;
import org.xml.sax.ext.LexicalHandler;
import org.xml.sax.helpers.XMLReaderFactory;
import java.io.IOException;


public class CommentReader implements LexicalHandler {

  public void comment (char[] text, int start, int length)
   throws SAXException {

    String comment = new String(text, start, length);
    System.out.println(comment);

  }

  public static void main(String[] args) {

    // set up the parser
    XMLReader parser;
    try {
      parser = XMLReaderFactory.createXMLReader();
    }
    catch (SAXException e) {
      System.err.println("Error: could not locate a parser.");
      System.err.println(
       "Try setting the org.xml.sax.driver system property to "
       + "the fully package qualified name of your parser class."
      );
      return;
    }

    // turn on comment handling
    try {
      LexicalHandler handler = new CommentReader();
      parser.setProperty(
       "http://xml.org/sax/properties/lexical-handler", handler);
    }
    catch (SAXNotRecognizedException e) {
      System.err.println(
       "Installed XML parser does not provide lexical events...");
      return;
    }
    catch (SAXNotSupportedException e) {
      System.err.println(
       "Cannot turn on comment processing here");
      return;
    }

    if (args.length == 0) {
      System.out.println("Usage: java CommentReader URL");
    }

    // start parsing...
    try {
      parser.parse(args[0]);
    }
    catch (SAXParseException e) { // well-formedness error
      System.out.println(args[0] + " is not well formed.");
      System.out.println(e.getMessage()
       + " at line " + e.getLineNumber()
       + ", column " + e.getColumnNumber());
    }
    catch (SAXException e) { // some other kind of error
      System.out.println(e.getMessage());
    }
    catch (IOException e) {
      System.out.println("Could not read " + args[0]
       + " because of the IOException " + e);
    }

  }

  // do-nothing methods not needed in this example
  public void startDTD(String name, String publicId, 
   String systemId) throws SAXException {}
  public void endDTD() throws SAXException {}
  public void startEntity(String name) throws SAXException {}
  public void endEntity(String name) throws SAXException {}
  public void startCDATA() throws SAXException {}
  public void endCDATA() throws SAXException {}

}

The main() method builds an XMLReader, constructs an instance of CommentReader, and uses setFeature() to make this CommentReader the parser’s LexicalHandler. Then it parses the document indicated on the command line.

It’s amusing to run this across the XML source for various W3C specifications. For example, Here’s the output when the XML version of the XML 1.0 specification, second edition, is fed into CommentReader:

%java CommentReader http://www.w3.org/TR/2000/REC-xml-20001006.xml
ArborText, Inc., 1988-2000, v.4002
 ...............................................................
 XML specification DTD .........................................
 ...............................................................

TYPICAL INVOCATION:
#  <!DOCTYPE spec PUBLIC
#       "-//W3C//DTD Specification V2.1//EN"
#       "http://www.w3.org/XML/1998/06/xmlspec-v21.dtd">

PURPOSE:
  This XML DTD is for W3C specifications and other technical reports.
  It is based in part on the TEI Lite and Sweb DTDs.
…      

The comments you’re seeing are actually from the DTD used by the XML specification. Comments and processing instructions in the DTD, both internal and external subsets, are reported to their respective callback methods, just like comments and processing instructions in the instance document.

Example 7.12 is a pure LexicalHandler that does not implement any of the other SAX callback interfaces like ContentHandler. However, it’s not uncommon to implement several callback interfaces in one class. Among other advantages, that makes it a lot easier to write programs that rely on information that’s available in different interfaces.

declaration-handler

The http://xml.org/sax/properties/declaration-handler property identifies the parser’s DeclHandler. DeclHandler, summarized in Example 7.13, is an optional interface in the org.xml.sax.ext package used by parsers to report those parts of the DTD that don’t affect the content of instance documents, specifically ELEMENT, ATTLIST and parsed ENTITY declarations. Together with the information reported by the DTDHandler, this gives you enough information to reproduce a parsed document’s DTD. The reproduced DTD may not be exactly the same as the original DTD. For instance, parameter entities will have been resolved and only the first declaration of each general entity will be reported. Nonetheless, the model represented by the entire DTD should be intact.

Example 7.13. The DeclHandler interface

package org.xml.sax.ext;

public interface DeclHandler {

  public void elementDecl(String name, String model)
   throws SAXException;
  public void attributeDecl(String elementName, 
   String attributeName, String type, String mode, 
   String defaultValue) throws SAXException;
  public void internalEntityDecl(String name, String value) 
   throws SAXException;
  public void externalEntityDecl(String name, String publicID, 
   String systemID) throws SAXException;

}

Example 7.14 is a little DeclHandler I whipped up to help me make sense out of heavily modular, very customizable DTDs like XHTML 1.1 or SMIL 2.0. It takes advantage of the fact that all parameter entity references and conditional sections are replaced before the methods of DeclHandler are called. It implements the DeclHandler interface with methods that copy each declaration onto System.out. However, because parameter entity references and conditional sections are resolved before these methods are invoked, it outputs a single monolithic DTD. I can see, for example, exactly what the content model for an element such as blockquote really is without having to manually trace the parameter entity references through seven separate modules, and figuring out which modules are likely to be included and which ignored.

Example 7.14. A program that prints out a complete DTD

import org.xml.sax.*;
import org.xml.sax.ext.DeclHandler;
import org.xml.sax.helpers.XMLReaderFactory;
import java.io.IOException;


public class DTDMerger implements DeclHandler {

  public void elementDecl(String name, String model)
   throws SAXException {
    System.out.println("<!ELEMENT " + name + " " + model + " >");
  }
  
  public void attributeDecl(String elementName, 
   String attributeName, String type, String mode, 
   String defaultValue) throws SAXException {
     
    System.out.print("<!ATTLIST ");
    System.out.print(elementName);
    System.out.print(" ");
    System.out.print(attributeName);
    System.out.print(" ");
    System.out.print(type);
    System.out.print(" ");
    if (mode != null) {
      System.out.print(mode + " ");
    }
    if (defaultValue != null) {
      System.out.print('"' + defaultValue + "\" ");
    }
    System.out.println(">");   
     
  }
  
  public void internalEntityDecl(String name, 
   String value) throws SAXException {
     
    if (!name.startsWith("%")) { // ignore parameter entities
      System.out.println("<!ENTITY " + name + " \"" 
       + value + "\">");        
    }
    
  }
  
  public void externalEntityDecl(String name, 
   String publicID, String systemID) throws SAXException {
     
    if (!name.startsWith("%")) { // ignore parameter entities
      if (publicID != null) { 
        System.out.println("<!ENTITY " + name + " PUBLIC \"" 
         + publicID + "\" \"" + systemID + "\">");        
      
      }
      else {
        System.out.println("<!ENTITY " + name + " SYSTEM \"" 
         + systemID + "\">");        
      }
    }
    
  }

  public static void main(String[] args) {

    if (args.length <= 0) {
      System.out.println("Usage: java DTDMerger URL");
      return;
    }
    String document = args[0];
    
    XMLReader parser = null;
    try {
      parser = XMLReaderFactory.createXMLReader();
      DeclHandler handler = new DTDMerger();
      parser.setProperty(
       "http://xml.org/sax/properties/declaration-handler", 
       handler);
      parser.parse(document);
    }
    catch (SAXNotRecognizedException e) {
      System.err.println(parser.getClass() 
       + " does not support declaration handlers.");
    }
    catch (SAXNotSupportedException e) {
      System.err.println(parser.getClass() 
       + " does not support declaration handlers.");

    }
    catch (SAXException e) {
      System.err.println(e);
      // As long as we finished with the DTD we really don't care
    }
    catch (IOException e) { 
      System.out.println(
       "Due to an IOException, the parser could not check " 
       + document
      ); 
    }
   
  }
   
}

I ran this program across the start of an XHTML document (actually the XHTML 1.1 specification itself, though that detail doesn’t really matter since it’s the DTD we care about here, not the instance document. In fact, the instance document doesn’t even need to be well-formed as long as the error isn’t spotted until after the DOCTYPE declaration has been read.) Here’s the beginning of the merged DTD:

%java DTDMerger http://www.w3.org/TR/xhtml11
<!ATTLIST a onfocus CDATA #IMPLIED >
<!ATTLIST a onblur CDATA #IMPLIED >
<!ATTLIST form onsubmit CDATA #IMPLIED >
<!ATTLIST form onreset CDATA #IMPLIED >
<!ATTLIST label onfocus CDATA #IMPLIED >
…

If what you really want to know is the content specification for a particular element type, the output from this program is a lot easier to read than the original DTD. For example, here’s the original ELEMENT declaration for the p element:

<!ENTITY % p.element  "INCLUDE" >
<![%p.element;[
<!ENTITY % p.content
     "( #PCDATA | %Inline.mix; )*" >
<!ENTITY % p.qname  "p" >
<!ELEMENT %p.qname;  %p.content; >
<!-- end of p.element -->]]>

Now here’s the merged version:

<!ELEMENT p 
  (#PCDATA|br|span|em|strong|dfn|code|samp|kbd|var|cite|abbr
   |acronym|q|tt|i|b|big|small|sub|sup|bdo|a|img|map|object|input
   |select|textarea|label|button|ruby|ins|del|script|noscript)* >

I think you’ll agree the second version is a lot easier to follow and understand. There are good and valid reasons to write the original DTD in the form used by the first declaration. However, that’s just not a form you want to present to a human being instead of a computer.

Xerces Custom Features

Individual parsers generally have a set of their own custom features and properties that control their own special capabilities. This allows you to configure a parser without having to go outside the standard SAX API, and thus binding your code to one specific parser. There’s generally no problem with using these non-standard features. Just make sure you watch out for SAXNotRecognizedException in case you later need to switch to a different parser that doesn’t support the same features.

For purposes of illustration, I’ll look at the custom features in Xerces 1.4.3, all of which are in the http://apache.org/xml/features/ hierarchy. Other parsers will have some features similar to these and some unique ones of their own. However, all will be in a domain of that parser vendor.

http://apache.org/xml/features/validation/schema

If true, Xerces will use any XML schemas it finds for applying default attribute values, assigning types to attributes, and possibly validation. (Validation also depends on the http://xml.org/sax/features/validation feature.) If false, then Xerces won’t use schemas at all, just the DTD. The default is true, use the schema if present.

http://apache.org/xml/features/validation/schema-full-checking

A number of features of the W3C XML Schema Language are extremely compute intensive. For example, the rather technical requirement for “Unique Particle Attribution” requires that given any element it’s possible to tell which part of a schema that element matches without considering the items the element contains or the elements that follow that element. However, this is extremely difficult to state, much less implement, in a precisely correct way. Consequently, Xerces by default skips these expensive checks. However, if you want them performed despite their cost you can turn them on by setting this feature to true.

http://apache.org/xml/features/validation/dynamic

If true, then Xerces will only attempt to validate documents that have a DOCTYPE declaration or an xsi:schemaLocation attribute. It will not attempt to validate merely well-formed documents that have neither.

http://apache.org/xml/features/validation/warn-on-duplicate-attdef

It is technically legal to declare the same attribute twice, and the declarations don’t even have to be compatible. For example,

<!ATTLIST Order id ID #IMPLIED>
<!ATTLIST Order id CDATA #REQUIRED>

The parser simply picks the first declaration and ignores the rest. Nonetheless, this probably indicates a mistake in the DTD. If the warn-on-duplicate-attdef feature is true, then Xerces should warn of duplicate attribute declarations by invoking the warning() method in the registered ErrorHandler. The default is to warn of this problem.

http://apache.org/xml/features/validation/warn-on-undeclared-elemdef

It is technically legal to declare an attribute for an element that has not been declared. This might happen if you delete an ELEMENT declaration but forget to delete one of the ATTLIST declarations for that element. Nonetheless, this almost certainly indicates a mistake. If this feature is true, then Xerces will warn of attribute declarations for non-existent elements. The default is to warn of this problem.

http://apache.org/xml/features/allow-java-encodings

By default Xerces only recognizes the standard encoding names like ISO-8859-1 and UTF-8. However, if this feature is turned on, then Xerces will also recognize Java style encoding names like 8859_1 and UTF8. The default is false.

http://apache.org/xml/features/continue-after-fatal-error

If true, Xerces will continue to parse a document after it detects a well-formedness error in order to detect and report more errors. This is useful for debugging because it allows you to be informed of and correct multiple errors before parsing a document again. The default is false. Note that the only thing Xerces will do after it sees the first well-formedness error, is look for more errors. It will not invoke any methods in any of the callback interfaces except ErrorHandler.

http://apache.org/xml/features/nonvalidating/load-dtd-grammar

If true, Xerces will attach default attributes to elements and specify attribute types even if it isn’t validating. If false it won’t. The default is true; and if validation is turned on, this feature is automatically turned on and cannot be turned off.

http://apache.org/xml/features/nonvalidating/load-external-dtd

If true, Xerces will load the external DTD subset. If false it won’t. The default is true. If validation is turned on, this feature is automatically turned on and cannot be turned off.

Example 7.15 is a variation of the earlier SAXValidator program that uses Xerces custom features to provide as many warnings and errors as possible. It uses dynamic validation so it only reports validity errors if the document is in fact trying to be valid. It turns on all optional warnings. And it continues parsing after a fatal error so it can find and report any more errors it spots in the document. This program is more useful for checking documents than the earlier generic program in Example 7.9. The downside is that it is totally dependent on the Xerces parser. It will not run with any other parser. Indeed it might even have troubles with earlier or later versions of Xerces. (I wrote this with 1.4.3.)

Example 7.15. Making maximal use of Xerces’s special abilities

import org.xml.sax.*;
import org.xml.sax.helpers.XMLReaderFactory;
import java.io.IOException;


public class XercesChecker implements ErrorHandler {
  
  // Flag to check whether any errors have been spotted.
  private boolean valid = true;
  
  public boolean isValid() {
    return valid; 
  }
  
  // If this handler is used to parse more than one document, 
  // its initial state needs to be reset between parses.
  public void reset() {
    // Assume document is valid until proven otherwise
    valid = true; 
  }
  
  public void warning(SAXParseException exception) {
    
    System.out.println("Warning: " + exception.getMessage());
    System.out.println(" at line " + exception.getLineNumber() 
     + ", column " + exception.getColumnNumber());
    System.out.println(" in entity " + exception.getSystemId());
    
  }
  
  public void error(SAXParseException exception) {
     
    System.out.println("Error: " + exception.getMessage());
    System.out.println(" at line " + exception.getLineNumber() 
     + ", column " + exception.getColumnNumber());
    // Unfortunately there's no good way to distinguish between
    // validity errors and other kinds of non-fatal errors 
    valid = false;
    
  }
  
  public void fatalError(SAXParseException exception) {
     
    System.out.println("Fatal Error: " + exception.getMessage());
    System.out.println(" at line " + exception.getLineNumber() 
     + ", column " + exception.getColumnNumber()); 
    System.out.println(" in entity " + exception.getSystemId());
     
  }
  
  public static void main(String[] args) {
  
    if (args.length <= 0) {
      System.out.println("Usage: java XercesChecker URL");
      return;
    }
    String document = args[0];
    
    try {
      XMLReader parser = XMLReaderFactory.createXMLReader(
       "org.apache.xerces.parsers.SAXParser"
      );
      XercesChecker handler = new XercesChecker();
      parser.setErrorHandler(handler);
      
      // This is a hack to fit some long lines of code that 
      // follow between the margins of this printed page
      String features = "http://apache.org/xml/features/";
      
      // Turn on Xerces specific features
      parser.setFeature(features + "validation/dynamic", true);
      parser.setFeature(features 
       + "validation/schema-full-checking", true); 
      parser.setFeature(features 
       + "validation/warn-on-duplicate-attdef", true);
      parser.setFeature(features 
       + "validation/warn-on-undeclared-elemdef", true);
      parser.setFeature(features + "continue-after-fatal-error", 
       true); 
      parser.parse(document);
      if (handler.isValid()) {
        System.out.println(document + " is valid.");
      }
      else {
        // If the document isn't well-formed, an exception has
        // already been thrown and this has been skipped.
        System.out.println(document + " is well-formed.");
      }
    }
    catch (SAXParseException e) {
      System.out.print(document + " is not well-formed at ");
      System.out.println("Line " + e.getLineNumber() 
       + ", column " +  e.getColumnNumber() 
       + " in file " + e.getSystemId());
    }
    catch (SAXException e) {
      System.out.println("Could not check document because " 
       + e.getMessage());
    }
    catch (IOException e) { 
      System.out.println(
       "Due to an IOException, the parser could not check " 
       + document
      ); 
    }
  
  }

}

Here’s the beginning of the output from running it across one of my web pages that was supposed to be well-formed HTML, but proved not to be:

%java XercesChecker http://www.cafeconleche.org/
Fatal Error: The element type "br" must be terminated by the 
 matching end-tag "</br>". 
 at line 73, column 4
Fatal Error: The element type "br" must be terminated by the 
 matching end-tag "</br>".
 at line 74, column 16
Fatal Error: The element type "dd" must be terminated by the 
 matching end-tag "</dd>".
 at line 123, column 4
Fatal Error: The element type "br" must be terminated by the 
 matching end-tag "</br>".
 at line 162, column 4
Fatal Error: The reference to entity "section" must end with 
 the ';' delimiter.
 at line 183, column 78
 …

There were actually quite a few more errors that I’ve omitted here. The advantage of using XercesChecker instead of one of the earlier generic checking programs is that XercesChecker gives me a reasonably complete list of all the errors in one pass. I couldn’t necessarily do this with any off-the-shelf parser. With the earlier programs that stopped at the first fatal error, I’d have to fix one error, retest, fix the next error, retest, and so on until I had fixed the final error.

Xerces Custom Properties

DTDs require instance documents to specify what DTDs they should be validated against. While often useful, this can be dangerous. For example, imagine that you’ve written an order processing system that accepts XML documents containing orders from many heterogenous systems around the world. You can’t trust the people sending you orders to necessarily send them in the correct format, so as a first step you validate every order received. If the order document is invalid, your system rejects it.

However, this system has a flaw. Since the documents themselves specify which DTD they’ll be validated against, hackers can introduce bad data into your system by replacing the system identifier for your DTD with a URI for a DTD on a site they control. Then they can send you a document that will test as valid, even though it’s not, because it’s being validated against the wrong DTD!

For this and other reasons, the schema specification explicitly states that the xsi:schemaLocation and xsi:noNamespaceSchemaLocation attributes are not the only way to attach a schema to an instance document. The client application parsing a document is allowed to override the schema locations given in the document with schemas of its own choosing. For this purpose, Xerces has two custom properties:

  • http://apache.org/xml/properties/schema/external-schemaLocation

  • http://apache.org/xml/properties/schema/external-noNamespaceSchemaLocation

Both of these properties are strings telling the parser where a schema for elements in particular namespaces (or no namespace) can be found. They have the same syntax as the xsi:schemaLocation and xsi:noNamespaceSchemaLocation attributes in instance documents. For instance, this code fragment says that elements not in any namespace should be validated against the schema found at the relative URL orders.xsd:

parser.setProperty(
 "http://apache.org/xml/properties/schema/"
 + "external-noNamespaceSchemaLocation", "orders.xsd");

This code fragment says that elements in the http://schemas.xmlsoap.org/soap/envelope/namespace should be validated against the schema found at the URL http://www.w3.org/2002/06/soap-envelope and that elements in the http://schemas.xmlsoap.org/soap/encoding/ namespace should be validated against the schema found at the URL http://www.w3.org/2002/06/soap-encoding:

parser.setProperty(
 "http://apache.org/xml/properties/schema/external-SchemaLocation", 
 "http://schemas.xmlsoap.org/soap/envelope/ "
 + "http://www.w3.org/2002/06/soap-envelope "
 + "http://schemas.xmlsoap.org/soap/encoding/ "
 + "http://www.w3.org/2002/06/soap-encoding");

If these properties are used and xsi:schemaLocation and/or xsi:noNamespaceSchemaLocation attributes are present in the instance document, then the schemas named by the properties take precedence.

These properties are only available in Xerces. Other parsers may support something similar, but if so they’ll place it at their own URL. In fact, as I write this Sun has just proposed adding http://java.sun.com/xml/jaxp/properties/schemaLanguage and http://java.sun.com/xml/jaxp/properties/schemaLocation properties to JAXP. The rough idea is the same, though Sun’s proposal would allow supporting arbitrary schema languages and allow the schemaLocation property to have a value from which the schema itself could be read rather than merely giving the location of the schema. For instance, it could be an InputStream or an InputSource object. Other parsers will doubtless implement this in other ways.


Copyright 2001, 2002 Elliotte Rusty Haroldelharo@metalab.unc.eduLast Modified November 10, 2001
Up To Cafe con Leche