Parsing

Parsing is the process of reading an XML document and reporting its content to a client application while checking the document for well-formedness. SAX represents parsers as instances of the XMLReader interface. The specific class that implements this interface varies from parser to parser. For example, in Xerces it’s org.apache.xerces.parsers.SAXParser. In Crimson it’s org.apache.crimson.parser.XMLReaderImpl. Most of the time you don’t construct instances of this interface directly. Instead you use the static XMLReaderFactory.createXMLReader() factory method to create a parser-specific instance of this class. Then you pass InputSource objects containing XML documents to the parse() method of XMLReader. The parser reads the document, and throws an exception if it detects any well-formedness errors.

Example 6.1 demonstrates the complete process with a simple program whose main() method parses a document found at a URL entered on the command line. If this document is well-formed, a simple message to that effect is printed on System.out. Otherwise, if the document is not well-formed, the parser throws a SAXException. If an I/O error such as a broken network connection occurs, then the parse() method throws an IOException. In this case, you don’t know whether or not the document is well-formed.

Example 6.1. A SAX program that parses a document

import org.xml.sax.*;
import org.xml.sax.helpers.XMLReaderFactory;
import java.io.IOException;


public class SAXChecker {

  public static void main(String[] args) {
  
    if (args.length <= 0) {
      System.out.println("Usage: java SAXChecker URL");
      return;
    }
    
    try {
      XMLReader parser = XMLReaderFactory.createXMLReader();
      parser.parse(args[0]);
      System.out.println(args[0] + " is well-formed.");
    }
    catch (SAXException e) {
      System.out.println(args[0] + " is not well-formed.");
    }
    catch (IOException e) { 
      System.out.println(
       "Due to an IOException, the parser could not check " 
       + args[0]
      ); 
    }
  
  }

}

Note

Don’t forget that you’ll probably need to install a parser such as Xerces or Ælfred somewhere in your class path before you can compile or run this program. Only Java 1.4 and later include a built-in parser.

This program’s output is straightforward. For example, here’s the output I got when I first ran it across my Cafe con Leche home page:

%java SAXChecker http://www.cafeconleche.org
http://www.cafeconleche.org is not well-formed.

After I located and fixed the bugs in that document, I got this output:

%java SAXChecker http://www.cafeconleche.org
http://www.cafeconleche.org is well-formed.

However, some readers will encounter a different result when they run this program. In particular, you may get this output:

%java SAXChecker http://www.cafeconleche.org
org.xml.sax.SAXException: System property org.xml.sax.driver not specified

What this really means is that your parser has not properly customized its version of the XMLReaderFactory class. Unfortunately, far too many parsers including Xerces and Crimson fail to do this. Consequently you need to set the org.xml.sax.driver Java system property to the fully package-qualified name of the Java class for your parser. For Xerces, it’s org.apache.xerces.parsers.SAXParser. For Crimson, it’s org.apache.crimson.parser.XMLReaderImpl. For other parsers, consult the parser documentation. You can specify a one-time value for this property using the -D flag to the Java interpeter like this:

%java -Dorg.xml.sax.driver=org.apache.xerces.parsers.SAXParser 
  SAXChecker http://www.cafeconleche.org/
http://www.cafeconleche.org is well-formed.

Copyright 2001, 2002 Elliotte Rusty Haroldelharo@metalab.unc.eduLast Modified October 17, 2001
Up To Cafe con Leche