DOM Parsers for Java

DOM is defined almost completely in terms of interfaces rather than classes. Different parsers provide their own custom implementations of these standard interfaces. This offers a great deal of flexibility. Generally, you do not install the DOM interfaces on their own. Instead they come bundled with a parser distribution that provides the detailed implementation classes. DOM isn’t quite as broadly supported as SAX, but most of the major Java parsers provide it including Crimson, Xerces, XML for Java, the Oracle XML Parser for Java, and GNU JAXP.

DOM is not complete to itself. Almost all significant DOM programs need to use some parser-specific classes. DOM programs are not too difficult to port from one parser to another, but a recompile is normally required. You can’t just change a system property to switch from one parser to another as you can with SAX. In particular, DOM2 does not specify how one parses a document, creates a new document, or serializes a document into a file or onto a stream. These important functions are all performed by parser-specific classes.

JAXP, the Java API for XML Processing, fills in a few of the holes in DOM by providing standard parser independent means to parse existing documents, create new documents, and serialize in-memory DOM trees to XML files. Most current Java parsers that support DOM Level 2 also support JAXP 1.1. JAXP is a standard part of Java 1.4. JAXP is not included in earlier versions of Java, but it does work with Java 1.1 and later and is bundled with most parser class libraries.

DOM Level 3 also promises to fill the same holes JAXP fills (parsing, serializing, and bootstrapping). However, it is not yet finished and not yet supported in a large way by any parsers.

Because DOM depends so heavily on parser classes, its performance characteristics vary widely from one parser to the next. Speed is something of a concern, but memory consumption is a much bigger issue for most applications. All DOM implementations I’ve seen use more space for the in-memory DOM tree than the actual file on the disk occupies. Generally the in-memory DOM trees range from three to ten times as large as the actual XML text. Some parsers including Xerces offer a “lazy DOM” that leaves most of the document on the disk, and only reads into memory those parts of the document the client actually requests.

Another distinguishing factor between different DOM implementations is the extra features the parser provides. Most parsers provide methods to parse XML documents and serialize DOM trees to XML. Other useful features include schema validation, database access, XInclude, XSLT, XPath, support for different character sets, and application specific DOMs like the MathML, SVG, and WML DOMs.

For example, the Oracle and Xerces parsers provide schema validation. Ælfred and Crimson don’t. Ælfred has partial support for XInclude. The other three don’t. The Oracle XML parser can produce a DOM Document object from a SQL query against a relational database or a JDBC ResultSet object. The other three can’t. The Oracle XML parser can decode the WAP binary XML format. The other three can’t. Xerces has specialized DOMs for HTML and WML documents. The other three don’t. These are all non-standard features; but if they’re useful to you, that would be a good reason to choose one parser over another. Table 9.2 summarizes parser support for various useful features.

Table 9.2. DOM Parser Features

 XercesÆlfredOracleCrimson
DTDsXXXX
SchemasX X 
NamespacesXXXX
Lazy DOMX   
HTML DOMX   
Views    
Style Sheets    
CSS    
CSS2    
EventsXXX 
UI Events X  
Mouse Events    
Mutation EventsXX  
HTML Events X  
TraversalXpartialX 
Range  X 
XSLT/XPathVia Xalan-J X 
XInclude X  

Copyright 2001, 2002 Elliotte Rusty Haroldelharo@metalab.unc.eduLast Modified July 27, 2002
Up To Cafe con Leche