Cafe con Leche News Thursday, April 13, 2006

Michael Kay has released version 8.7.1 of Saxon, his XSLT 2.0 and XQuery processor for Java and .NET. According to Kay, "This is a maintenance release that fixes known bugs and non-conformances; it also implements a few spec changes agreed by W3C since the Candidate Recommendation came out (for example the decision to put types such as xdt:dayTimeDuration into the XML Schema namespace - Saxon currently supports both the old and the new namespaces)."

Saxon is published in two versions for both of which Java 1.4 or later (or .NET) is required. Saxon 8.7B is an open source product published under the Mozilla Public License 1.0 that "implements the 'basic' conformance level for XSLT 2.0 and XQuery." Saxon 8.7SA is a £250.00 payware version that "allows stylesheets and queries to import an XML Schema, to validate input and output trees against a schema, and to select elements and attributes based on their schema-defined type. Saxon-SA also incorporates a free-standard XML Schema validator. In addition Saxon-SA incorporates some advanced extensions not available in the Saxon-B product. These include a try/catch capability for catching dynamic errors, improved error diagnostics, support for higher-order functions, and additional facilities in XQuery including support for grouping, advanced regular expression analysis, and formatting of dates and numbers."

Norm Walsh has published the fifth beta of DocBook 5.0 DocBook 5 is "a significant redesign that attempts to remain true to the spirit of DocBook." The schema is written in RELAX NG. A DTD and W3C XML Schema generated from the RELAX NG schema are also available. There's also a Schematron schema "that validates some extra-grammatical DocBook constraints. These patterns are also present directly in the RELAX NG Grammar and some validators, for example MSV, can perform both kinds of validation at the same time." This beta repairs the broken DTD from beta 4.

Norm Walsh has also posted the second candidate release of DocBook 4.5. Version 4.5 implements a minor bug-fix to citebiblioid and updates the reference documentation. As you may recall, I wrote Processing XML with Java in DocBook 4. I've been playing with DocBook 5 lately for a couple of possible future book projects. While it's clearly an improvement over DocBook 4 in numerous ways—for instance it uses namespaces, embeds SVG and MathML, and has reasonable XInclude support—the tool chain isn't up to snuff yet. The stylesheets and various editors like Oxygen haven't adapted to like in a DocBook 5 world yet. I'll probably continue to use DocBook 5 because I'm a bleeding edge sort of guy, but most users should stick to DocBook 4 for the time being.

Altsoft N.V. has released Xml2PDF 3.0, a $49 payware Windows program for converting XSL-FO, SVG, WordML, and XHTML documents into PDF files. New features in 3.0 include:

SVG Basic output
XSL-FO embededded in SVG
OpenType fonts with Type1 outlines and kernings
MathML as an external graphics format
Multipage floats
Extensions for absolutely positioned floats and tabulation support

This release should be faster and use less memory too.

XimpleWare has released VTD-XML 1.5, a free (GPL) non-extractive Java library for processing XML that supports XPath. This appears to be an example of what Sam Wilmot calls "in situ parsing". In other words, rather than creating objects representing the content of an XML document, VTD-XML just passes pointers into the actual, real XML. (These are the abstract pointers of your data structures textbook, not C-style addresses in memory. In this cases the pointers are int indexes into the file.) You don't even need to hold the document in memory. It can remain on disk. This should improve speed and memory usage. Current tree models typically require at least 3 times the size of the actual document, more often more. Using a model based on indexes into one big array might allow these to reduce their requirements to twice the size of the original document or even less. VTD-XML claims 1.3 times, but I haven't verified that.

However VTD-XML currently only supports the built-in entity references (" & ' > <). They're some other limits. Element names are limited to 2048 characters. Documents can't be much bigger than a billion characters, so SAX (or XOM) is still needed for really huge documents. There's also a maximum depth to the document, though exactly what it is isn't specified. All this means VTD-XML is not a conformant XML parser. Given this, comparisons to other parsers are unfair and misleading. I've seen many products that outperform real XML parsers by subsetting XML and cutting out the hard parts. It's often the last 10% that kills the performance. :-( The other question I have for anything claiming these speed gains is whether it correctly implements well-formedness testing, including the internal DTD subset. Will VTD-XML correctly report all malformed documents as malformed? Has it been tested against the W3C XML conformance test suite? I'm not sure.

XML News from Thursday, April 13, 2006