DOM Parsers for Java

DOM Parsers for Java
Prev	Chapter 9. The Document Object Model	Next

DOM is defined almost completely in terms of interfaces rather than classes. Different parsers provide their own custom implementations of these standard interfaces. This offers a great deal of flexibility. Generally, you do not install the DOM interfaces on their own. Instead they come bundled with a parser distribution that provides the detailed implementation classes. DOM isn’t quite as broadly supported as SAX, but most of the major Java parsers provide it including Crimson, Xerces, XML for Java, the Oracle XML Parser for Java, and GNU JAXP.

DOM is not complete to itself. Almost all significant DOM programs need to use some parser-specific classes. DOM programs are not too difficult to port from one parser to another, but a recompile is normally required. You can’t just change a system property to switch from one parser to another as you can with SAX. In particular, DOM2 does not specify how one parses a document, creates a new document, or serializes a document into a file or onto a stream. These important functions are all performed by parser-specific classes.

JAXP, the Java API for XML Processing, fills in a few of the holes in DOM by providing standard parser independent means to parse existing documents, create new documents, and serialize in-memory DOM trees to XML files. Most current Java parsers that support DOM Level 2 also support JAXP 1.1. JAXP is a standard part of Java 1.4. JAXP is not included in earlier versions of Java, but it does work with Java 1.1 and later and is bundled with most parser class libraries.

DOM Level 3 also promises to fill the same holes JAXP fills (parsing, serializing, and bootstrapping). However, it is not yet finished and not yet supported in a large way by any parsers.

Because DOM depends so heavily on parser classes, its performance characteristics vary widely from one parser to the next. Speed is something of a concern, but memory consumption is a much bigger issue for most applications. All DOM implementations I’ve seen use more space for the in-memory DOM tree than the actual file on the disk occupies. Generally the in-memory DOM trees range from three to ten times as large as the actual XML text. Some parsers including Xerces offer a “lazy DOM” that leaves most of the document on the disk, and only reads into memory those parts of the document the client actually requests.

Measuring DOM Size

To test the memory usage of various implementations, I wrote a simple program that loaded the XML specification, 2nd edition, into a DOM Document object. The spec’s text format is 197K (not including the DTD which adds another 56K, but isn't really modeled by DOM at all). Here's the approximate amount of memory used by the Document objects built from this file by several parsers:

Xerces-J 2.0.1: 1489K
Crimson 1.1.3 (JDK 1.4 default): 1230K
Oracle XML Parser for Java 9.2.0.2.0: 2500K

I used a couple of different techniques to measure the memory used. In one case I used OptimizeIt and the Java Virtual Machine Profiling Interface (JVMPI) to check the heap size. I ran the program both with and without loading the document. I subtracted the total heap memory used without loading the document from the memory used when the document was loaded to get the numbers reported above. In the other test, I used the Runtime class to measure the total memory and the free memory before and after the Document was created. In both cases, I garbage collected before taking the final measurements. The results from the separate tests were within 15% of each other. All tests were performed in Sun’s JDK 1.4.0 using Hotspot on Windows NT 4.0SP6.

I don't claim these numbers are perfect, and I certainly don’t think this one test document justifies any claims whatsoever about the relative efficiency of the different DOM implementations. The difference between Crimson and Xerces is well within my margin of error. A more serious test would have to look at how the different implementations scale with the size of the initial document, and perhaps graph the curves of memory size vs. file size. For instance, it's possible that each of these requires a minimum of 1024K per document, but grows relatively slowly after that point. I did run the same tests on a minimal document that contained a single empty element. The results ranged from 3K to 131K for this document. However, these numbers were extremely sensitive to exactly when and how garbage was collected. I wouldn’t claim the results are accurate to better than ±300K. However, I do think that together these tests demonstrate just how inefficient DOM is.

Another distinguishing factor between different DOM implementations is the extra features the parser provides. Most parsers provide methods to parse XML documents and serialize DOM trees to XML. Other useful features include schema validation, database access, XInclude, XSLT, XPath, support for different character sets, and application specific DOMs like the MathML, SVG, and WML DOMs.

For example, the Oracle and Xerces parsers provide schema validation. Ælfred and Crimson don’t. Ælfred has partial support for XInclude. The other three don’t. The Oracle XML parser can produce a DOM Document object from a SQL query against a relational database or a JDBC ResultSet object. The other three can’t. The Oracle XML parser can decode the WAP binary XML format. The other three can’t. Xerces has specialized DOMs for HTML and WML documents. The other three don’t. These are all non-standard features; but if they’re useful to you, that would be a good reason to choose one parser over another. Table 9.2 summarizes parser support for various useful features.

Table 9.2. DOM Parser Features

	Xerces	Ælfred	Oracle	Crimson
DTDs	X	X	X	X
Schemas	X		X
Namespaces	X	X	X	X
Lazy DOM	X
HTML DOM	X
Views
Style Sheets
CSS
CSS2
Events	X	X	X
UI Events		X
Mouse Events
Mutation Events	X	X
HTML Events		X
Traversal	X	partial	X
Range			X
XSLT/XPath	Via Xalan-J		X
XInclude		X

Prev	Up	Next
Trees	Home	Parsing documents with a DOM Parser

Copyright 2001, 2002 Elliotte Rusty Harold	elharo@metalab.unc.edu	Last Modified July 27, 2002
	Up To Cafe con Leche