XML News from Tuesday, October 18, 2005

XimpleWare has released VTD-XML 1.0, a free (GPL) non-extractive Java library for processing XML that supports XPath. This appears to be an example of what Sam Wilmot calls "in situ parsing". In other words, rather than creating objects representing the content of an XML document, VTD-XML just passes pointers into the actual, real XML. (These are the abstract pointers of your data structures textbook, not C-style addresses in memory. In this cases the pointers are int indexes into the file.) YOu don't even need to hold the document in memory. It can remain on disk. This should improve speed and memory usage/ Current tree models typically require at least 3 times the size of the actual document, more often more. Using a model based on indexes into one big array might allow these to reduce their requirements to twice the size of the original document or even less. VTD-XML claims 1.3 times, but I haven't verified that.

However VTD-XML currently "only supports built-in entity references(" ' > <)." That means it's not an XML parser. Given this, comparisons to other parsers are unfair and misleading. I've seen many products that outperform real XML parsers by subsetting XML and cutting out the hard parts. It's often the last 10% that kills the performance. :-( The other question I have for anything claiming these speed gains is whether it correctly implements well-formedness testing, including the internal DTD subset. Will VTD-XML correctly report all malformed documents as malformed? Has it been tested against the W3C XML conformance test suite? I'm not sure.

Finally, even if everything works out once the holes are plugged, this seems like it would be slower than SAX/StAX for streaming use cases. VTD, like DOM, needs to read the entire document before it can work on any of it. SAX/StAX can begin processing the beginning of a document before most of the document has even arrived from the network. This isn't relevant to all use cases, but it's very relevant for many of the cases where speed is most critical and most problematic.