XML News from Tuesday, September 23, 2003

CubeWerx has released CWXML, an open-source C-language library for parsing and generating XML, but really what they're interested in is BXML, yet another binary format that pollutes the XML brand. Guys, what you've created is not XML. Please don't call it that.

A report by Dr. Craig S. Bruce makes the usual batch of claims about the non-XML binary format being smaller and faster to parse than real XML. As usual, the actual data tested is way too small to make any real assertions about whether this is true. Unusually, they did include a large narrative document as one of their two test cases. Most people offering systems like this ignore that completely. Regardless though, even ignoring the very limited number of test cases (2), the numbers in their report don't support their claims. They demonstrate that the size of gzipped XML is smaller than their binary format, and that when you gzip their format the size difference between the two is trivially small. They make some bigger claims for speed, but the parse times they're working with are so small to begin with, they really don't matter. Reducing parsing from 0.044 seconds to 0.005 seconds may be an order of magnitude speed up, but it's hardly a significant gross savings.

As usual in this field, it appears the researchers have rigged the game by assuming a more homogeneous and thus more easily optimized world than XML actually supports. They only support Latin-1, and it looks probable that they aren't performing all the checks an XML parser is required to perform on characters. They also make the common mistake of encoding numbers as binary for speed. It's possible to make this work fast for one platform, but every optimization you perform for that platform exacts a comparable slowdown on other platforms with incompatible binary numeric formats.

One thing I'm beginning to think is necessary in this field is a large collection of standard test cases that use the full panoply of XML: DTDs, entity references, CDATA sections, attributes, elements, narrative content, record-like data, recursive data, namespaces, multiple encodings, and more. This benchmark set would be useful not just for testing binary formats, but also for testing parser performance. All too often benchmarks in this field are based on just one or two documents. It's rare to see benchmarks that cover a broad and deep enough collection of documents to justify themselves.