XML News from Monday, June 28, 2004

XimpleWare has posted VTD-XML 0.5, a free (GPL) non-extractive Java library for processing XML. According to the announcement,

Capable of random-access, VTD-XML attempts to be both memory efficient and high performance. The starting point of this project is the observation that, for XML documents that don't declare entities in DTD, tokenization can indeed be done by only recording the starting offset and length of a token.

The core technology of VTD-XML is a binary format specification called Virtual Token Descriptor (VTD). A VTD record is a 64-bit integer that encodes the starting offset, length, type and nesting depth of a token in an XML document. Because VTD records don't contain actually token content, they work alongside of the original XML document, which is maintained intact in memory by the processing model.

VTD's memory-conserving features can be summarized as follows:

Our benchmark indicates that VTD-XML processes XML at the performance level similar to (and often better than) SAX with NULL content handler. The memory usage is typically between 1.3x ~ 1.6x of the size of the document, with "1" being the document itself.

Other features included in this release are:

In the upcoming releases, we plan to add the persistence support so that one can save/load VTD to/from the disk along with the XML documents to avoid repetitive parsing in read-only situations. XPATH support is also on the development roadmap. However, we would like to collect as many suggestions and bug reports before taking the next step.

The algorithms sound interesting. Unfortunately VTD-XML cannot process arbitrary XML, at least not yet. First off, it places some arbitrary limits on the size of qualified names and of the entire document, though this would seem to be fixable. The size of the qualified names could easily be run up to as much as is supported by a Java String, which is all competing APIs can claim. The limits on document size may be fundamental, but they are at least competivie with other in-memory APIs like DOM, though not with streaming APIs like SAX an StAX. Bigger problems include enityt resolution, default attribute values, and attribute value normalization. VTD-XML does not support entity references other than five predefined entities (& <, etc.). The documentation doesn't discuss default attributes or attribute value normalization, but given the algorithms used these seem unlikely to be supported. More than once I've seen completing the last 10% of XML conformance demolish the speed that was so impressive in earlier, less complete betas. :-( It remains to be seen whether XimpleWare can extend their algorithms to support XML in all its complexity.


Ian E. Gorman has released GXParse 1.3, a free (LGPL) Java library that sits on top of a SAX parser and provides semi-random access to the XML document. The documentation isn't very clear, but as near as I can tell, it buffers various contsructs like elements until their end is seen, rather than dumping pieces on you immediately like SAX does.