Parsing documents with a DOM Parser

Parsing documents with a DOM Parser
Prev	Chapter 9. The Document Object Model	Next

Unlike SAX, DOM does not have a class or interface that represents the XML parser. Each parser vendor provides their own unique class. In Xerces, this is org.apache.xerces.parsers.DOMParser. In Crimson it’s org.apache.crimson.jaxp.DocumentBuilderImpl. In Ælfred it’s an inner class, gnu.xml.dom.JAXPFactory$JAXPBuilder. In Oracle, it’s oracle.xml.parser.v2.DOMParser In other implementations it will be something else.

Furthermore, since these classes do not share a common interface or superclass, the methods they use to parse documents vary too. For example, in Xerces, the two methods that read XML documents have these signatures:

public void parse(InputSource source)
    throws SAXException, IOException;

public void parse(String systemID)
    throws SAXException, IOException;

To get the Document object from the parser, you first call one of the parse methods and then call the getDocument() method.

public Document getDocument();

For example, if parser is a Xerces DOMParser object, then these lines of code load the DOM Core 2.0 specification into a DOM Document object named spec:

parser.parse("http://www.w3.org/TR/DOM-Level-2-Core");
Document spec = parser.getDocument();

In Crimson’s parser class, by contrast, the parse() method returns a Document object directly so no separate getDocument() method is needed. For example,

Document spec 
 = parser.parse("http://www.w3.org/TR/DOM-Level-2-Core");

Furthermore, the Crimson parse() method is five-way overloaded instead of two:

public Document parse(InputSource source)
    throws SAXException, IOException;

public Document parse(String uri)
    throws SAXException, IOException;

public Document parse(File file)
    throws SAXException, IOException;

public Document parse(InputStream in)
    throws SAXException, IOException;

public Document parse(InputStream in, String systemID)
    throws SAXException, IOException;

Example 9.3 is a simple program that uses Xerces to check documents for well-formedness. You can see that it depends directly on the org.apache.xerces.parsers.DOMParser class.

Example 9.3. A program that uses Xerces to check documents for well-formedness

import org.apache.xerces.parsers.DOMParser;
import org.xml.sax.SAXException;
import java.io.IOException;


public class XercesChecker {

  public static void main(String[] args) {
     
    if (args.length <= 0) {
      System.out.println("Usage: java XercesChecker URL"); 
      return;
    }
    String document = args[0];
    
    DOMParser parser = new DOMParser();
    try {
      parser.parse(document); 
      System.out.println(document + " is well-formed.");
    }
    catch (SAXException e) {
      System.out.println(document + " is not well-formed.");
    }
    catch (IOException e) { 
      System.out.println(
       "Due to an IOException, the parser could not check " 
       + document
      ); 
    }
   
  }

}

It’s not hard to port XercesChecker to a different parser like Oracle, but you do need to change the source code as shown in Example 9.4 and recompile.

Example 9.4. A program that uses the Oracle XML parser to check documents for well-formedness

import oracle.xml.parser.v2.*;
import org.xml.sax.SAXException;
import java.io.IOException;


public class OracleChecker {

  public static void main(String[] args) {
     
    if (args.length <= 0) {
      System.out.println("Usage: java OracleChecker URL"); 
      return;
    }
    String document = args[0];
    
    DOMParser parser = new DOMParser();
    try {
      parser.parse(document); 
      System.out.println(document + " is well-formed.");
    }
    catch (XMLParseException e) {
      System.out.println(document + " is not well-formed.");
      System.out.println(e);
      
    }
    catch (SAXException e) {
      System.out.println(document + " could not be parsed.");
    }
    catch (IOException e) { 
      System.out.println(
       "Due to an IOException, the parser could not check " 
       + document
      ); 
    }
   
  }

}

Other parsers have slightly different methods still. What all of these have in common is that they read an XML document from a source of text, most commonly a file or a stream, and provide an org.w3c.dom.Document object. Once you have a reference to this Document object you can work with it using only the standard methods of the DOM interfaces. There’s no further need to use parser-specific classes.

JAXP DocumentBuilder and DocumentBuilderFactory

The lack of a standard means of parsing an XML document is one of the holes that JAXP fills. If your parser implements JAXP, then instead of using the parser-specific classes, you can use the javax.xml.parsers.DocumentBuilderFactory and javax.xml.parsers.DocumentBuilder classes to parse the documents. The basic approach is as follows:

Use the static DocumentBuilderFactory.newInstance() factory method to return a DocumentBuilderFactory object.
Use the newDocumentBuilder() method of this DocumentBuilderFactory object to return a parser-specific instance of the abstract DocumentBuilder class.
Use one of the five parse() methods of DocumentBuilder to read the XML document and return an org.w3c.dom.Document object.

Example 9.5 demonstrates with a simple program that uses JAXP to check documents for well-formedness.

Example 9.5. A program that uses JAXP to check documents for well-formedness

import javax.xml.parsers.*; // JAXP
import org.xml.sax.SAXException;
import java.io.IOException;


public class JAXPChecker {

  public static void main(String[] args) {
     
    if (args.length <= 0) {
      System.out.println("Usage: java JAXPChecker URL");
      return;
    }
    String document = args[0];
    
    try {
      DocumentBuilderFactory factory 
       = DocumentBuilderFactory.newInstance();
      DocumentBuilder parser = factory.newDocumentBuilder();
      parser.parse(document); 
      System.out.println(document + " is well-formed.");
    }
    catch (SAXException e) {
      System.out.println(document + " is not well-formed.");
    }
    catch (IOException e) { 
      System.out.println(
       "Due to an IOException, the parser could not check " 
       + document
      ); 
    }
    catch (FactoryConfigurationError e) { 
      // JAXP suffers from excessive brain-damage caused by 
      // intellectual in-breeding at Sun. (Basically the Sun 
      // engineers spend way too much time talking to each other
      // and not nearly enough time talking to people outside 
      // Sun.) Fortunately, you can happily ignore most of the 
      // JAXP brain damage and not be any the poorer for it.
      
      // This, however, is one of the few problems you can't 
      // avoid if you're going to use JAXP at all. 
      // DocumentBuilderFactory.newInstance() should throw a 
      // ClassNotFoundException if it can't locate the factory
      // class. However, what it does throw is an Error,
      // specifically a FactoryConfigurationError. Very few 
      // programs are prepared to respond to errors as opposed
      // to exceptions. You should catch this error in your 
      // JAXP programs as quickly as possible even though the
      // compiler won't require you to, and you should 
      // never rethrow it or otherwise let it escape from the 
      // method that produced it. 
      System.out.println("Could not locate a factory class"); 
    }
    catch (ParserConfigurationException e) { 
      System.out.println("Could not locate a JAXP parser"); 
    }
   
  }

}

For example, here’s the output from when I ran this program across this chapter’s DocBook source code:

D:\books\XMLJAVA>java JAXPChecker file:///D:/books/xmljava/dom.xml
file:///D:/books/xmljava/dom.xml is well-formed.

How JAXP Chooses Parsers

You may be wondering which parser this program actually uses. JAXP, after all, is reasonably parser-independent. The answer depends on which parsers are installed in your class path and how certain system properties are set. The default is to use the class named by the javax.xml.parsers.DocumentBuilderFactory system property. For example, if you want to make sure that Xerces is used to parse documents, then you would run JAXPChecker like this:

D:\books\XMLJAVA>java 
 -Djavax.xml.parsers.DocumentBuilderFactory=org.apache.xerces.jaxp.DocumentBuilderFactoryImpl 
 JAXPChecker file:///D:/books/xmljava/dom.xml
file:///D:/books/xmljava/dom.xml is well-formed.

If the javax.xml.parsers.DocumentBuilderFactory property is not set, then JAXP looks in the lib/jaxp.properties properties file in the JRE directory to determine a default value for the javax.xml.parsers.DocumentBuilderFactory system property. If you want to consistently use a certain DOM parser, for instance gnu.xml.dom.JAXPFactory, place the following line in that file:

javax.xml.parsers.DocumentBuilderFactory=gnu.xml.dom.JAXPFactory

If this fails to locate a parser, next JAXP looks for a META-INF/services/javax.xml.parsers.DocumentBuilderFactory file in all JAR files available to the runtime to find the name of the concrete DocumentBuilderFactory subclass.

Finally, if that fails, then DocumentBuilderFactory.newInstance() returns a default class, generally the parser from the vendor who also provided the JAXP classes. For example, the JDK JAXP classes pick org.apache.crimson.jaxp.DocumentBuilderFactoryImpl by default but the Ælfred JAXP classes pick gnu.xml.dom.JAXPFactory instead.

Configuring DocumentBuilderFactory

The DocumentBuilderFactory has a number of options that allow you to determine exactly how the parsers it creates behave. Most of the setter methods take a boolean that turns the feature on if true or off if false. However, a couple of the features are defined as confusing double negatives, so read carefully.

Coalescing

These two methods determine whether CDATA sections are merged with text nodes or not. If the coalescing feature is true, then the result tree will not contain any CDATA section nodes, even if the parsed XML document does contain CDATA sections.

public boolean isCoalescing();
public void setCoalescing(boolean coalescing);

The default is false, but in most situations you should set this to true, especially if you’re just reading the document and are not going to write it back out again. CDATA sections should not be treated differently than any other text. Whether or not certain text is written in a CDATA section should be purely a matter of syntax sugar for human convenience, not anything that has any effect on the data model.

Expand Entity References

The following two methods determine whether the parsers produced by this factory expand entity references.

public boolean isExpandEntityReferences();
public void setExpandEntityReferences(boolean expandEntityReferences);

The default is true. If a parser is validating, then this it will expand entity references, even if this feature is set to false. That is, the validation feature overrides the expand entity references feature.

The five predefined references— &, <, >, ", and ' —will always be expanded regardless of the value of this property.

Ignore Comments

The following two methods determine whether the parsers produced by this factory will generate comment nodes for comments seen in the input document. The default, false, means that comment nodes will be produced. (Watch out for the double negative here. False means include comments, and true means don’t include comments. This confused me initially, and I was getting my poison pen all ready to write about the brain damage of throwing away comments although the spec required them to be included, when I realized that the method was in fact behaving like it should.)

public boolean isIgnoringComments();
public void setIgnoringComments(boolean ignoringComments);

Ignore Element Content Whitespace

The following two methods determine whether the parsers produced by this factory will generate text nodes for so-called “ignorable white space”; that is, white space that occurs between tags where the DTD specifies that parsed character data cannot appear.

public boolean isIgnoringElementContentWhitespace();
public void setIgnoringElementContentWhitespace(boolean ignoreElementContentWhitespace);

The default is false; that is, include text nodes for ignorable white space. Setting this to true might well be useful in record-like documents. However, for this property to make a difference, the documents must have a DTD and should be valid or very nearly so. Otherwise the parser can’t tell which white space is ignorable and which isn’t.

Namespace Aware

The following two methods determine whether the parsers produced by this factory are “namespace aware.” A namespace aware parser will set the prefix and namespace URI properties of element and attribute nodes that are in a namespace. A non-namespace aware parser won’t.

public boolean isNamespaceAware();
public void setNamespaceAware(boolean namespaceAware);

The default is false, which is truly brain-damaged. You should always set this to true. For example,

DocumentBuilderFactory factory
 = DocumentBuilderFactory.newInstance();
factory.setNamespaceAware(true);

Validating

These methods determine whether or not the parsers produced by this factory validate the document against its DTD.

public boolean isValidating();
public void setValidating(boolean validating);

The default is false, do not validate. If you want to validate your documents, set this property to true. You’ll also need to register a SAX ErrorHandler with the DocumentBuilder using its setErrorHandler() method to receive notice of validity errors. Example 9.6 demonstrates with a program that uses JAXP to validate a document named on the command line.

Example 9.6. Using JAXP to check documents for well-formedness

import javax.xml.parsers.*; // JAXP
import org.xml.sax.*;
import java.io.IOException;


public class JAXPValidator {

  public static void main(String[] args) {
     
    if (args.length <= 0) {
      System.out.println("Usage: java JAXPValidator URL");
      return;
    }
    String document = args[0];
    
    try {
      DocumentBuilderFactory factory 
       = DocumentBuilderFactory.newInstance();
      // Always turn on namespace awareness
      factory.setNamespaceAware(true);
      // Turn on validation
      factory.setValidating(true);

      DocumentBuilder parser = factory.newDocumentBuilder();
      
      // SAXValidator was developed in Chapter 7
      ErrorHandler handler = new SAXValidator();
      parser.setErrorHandler(handler);
      
      parser.parse(document); 
      if (handler.isValid()) {
        System.out.println(document + " is valid.");
      }
      else {
        // If the document isn't well-formed, an exception has
        // already been thrown and this has been skipped.
        System.out.println(document + " is well-formed.");
      }
      
    }
    catch (SAXException e) {
      System.out.println(document + " is not well-formed.");
    }
    catch (IOException e) { 
      System.out.println(
       "Due to an IOException, the parser could not check " 
       + document
      ); 
    }
    catch (FactoryConfigurationError e) { 
      System.out.println("Could not locate a factory class"); 
    }
    catch (ParserConfigurationException e) { 
      System.out.println("Could not locate a JAXP parser"); 
    }
   
  }

}

Parser-specific Attributes

Many JAXP aware parsers support various custom features. For example, Xerces, has an http://apache.org/xml/features/dom/create-entity-ref-nodes feature that lets you choose whether or not to include entity reference nodes in the DOM tree. This is not the same as deciding whether or not to expand entity references. That determines whether the entity nodes that are placed in the tree have children representing their replacement text or not.

JAXP allows you to set and get these custom features as objects of the appropriate type using these two methods:

public Object getAttribute(String name)
    throws IllegalArgumentException;

public void setAttribute(String name, Object value)
    throws IllegalArgumentException;

For example, suppose you’re using Xerces and you don’t want to include entity reference nodes. They're included by default so you need to set http://apache.org/xml/features/dom/create-entity-ref-nodes to false. You would use setAttribute() on the DocumentBuilderFactory like this:

DocumentBuilderFactory factory  
 = DocumentBuilderFactory.newInstance();
 factory.setAttribute(
  "http://apache.org/xml/features/dom/create-entity-ref-nodes", 
  new Boolean(false)
 );

The naming conventions for both attribute names and values depends on the underlying parser. Xerces uses URL strings like SAX feature names. Other parsers may do something different. JAXP 1.2 will add a couple of standard attributes related to schema validation.

DOM3 Load and Save

JAXP only works for Java, and it is a Sun proprietary standard. Consequently, the W3C DOM working group is preparing an alternative cross-vendor means of parsing an XML document with a DOM parser. This will be published as part of DOM Level 3. DOM3 is not close to a finished recommendation at the time of this writing and is not yet implemented by any parsers, but I can show you pretty much what the interface is likely to look like.

Parsing a document with DOM3 requires four steps:

Load a DOMImplementation object by passing the feature string "LS-Load 3.0" to the DOMImplementationRegistry.getDOMImplementation() factory method. (This class is also new in DOM3.)
Cast this DOMImplementation object to DOMImplementationLS, the sub-interface that provides the extra methods you need.
Call the implementation’s createDOMBuilder() method to create a new DOMBuilder object. This is the new DOM3 class that represents the parser. The first argument to createDOMBuilder() specifies whether the document is parsed synchronously or asynchronously. The second argument is a URL identifying the type of schema to be used during the parse, "http://www.w3.org/2001/XMLSchema" for W3C XML Schemas, "http://www.w3.org/TR/REC-xml" for DTDs. You can pass null to ignore all schemas.
Pass the document’s URL to the the builder object’s parseURI() method to read the document and return a Document object.

Example 9.7 demonstrates with a simple program that uses DOM3 to check documents for well-formedness.

Example 9.7. A program that uses DOM3 to check documents for well-formedness

import org.w3c.dom.*;
import org.w3c.dom.ls.*;


public class DOM3Checker {

  public static void main(String[] args) {
     
    if (args.length <= 0) {
      System.out.println("Usage: java DOM3Checker URL");
      return;
    }
    String document = args[0];
    
    try {
      DOMImplementationLS impl = (DOMImplementationLS) 
       DOMImplementationRegistry
       .getDOMImplementation("LS-Load 3.0");
      DOMBuilder parser = impl.createDOMBuilder(
       DOMImplementationLS.MODE_SYNCHRONOUS,
       "http://www.w3.org/TR/REC-xml");
    // ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    // Use DTDs when parsing
      Document doc = parser.parseURI(document); 
      System.out.println(document + " is well-formed.");
    }
    catch (NullPointerException e) {
      System.err.println("The current DOM implementation does"
       + " not support DOM Level 3 Load and Save");
    }
    catch (DOMException e) {
      System.err.println(document + " is not well-formed");
    }
    catch (IOException e) { 
      System.out.println(
       "Due to an IOException, the parser could not check " 
       + document
      ); 
    }
    catch (Exception e) {
      // Probably a ClassNotFoundException,
      // InstantiationException, or IllegalAccessException 
      // thrown by DOMImplementationRegistry.getDOMImplementation
      System.out.println("Probable CLASSPATH problem."); 
      e.printStackTrace(); 
    }
   
  }

}

For the time being, JAXP’s DocumentBuilderFactory is the obvious choice since it works today and is supported by almost all DOM parsers written in Java. Longer term, DOM3 will provide a number of important capabilities JAXP does not, including parse progress notification and document filtering. However, since these APIs are far from ready for prime time just yet, for the rest of this book, I’m mostly going to use JAXP without further comment.

Prev	Up	Next
DOM Parsers for Java	Home	The Node Interface

Copyright 2001, 2002 Elliotte Rusty Harold	elharo@metalab.unc.edu	Last Modified May 26, 2002
	Up To Cafe con Leche