XML News from Tuesday, August 2, 2005

I'm at the Extreme Markup Languages conference in Montreal this week. As the mood strikes me, I may update this site in real time. However, it will be a little slow going at first as I'm giving one of the first talks this morning, and I just found a bug in the software I'm announcing demoing here, and have just over two hours in which to fix it. Plus I realize I left the most recent version of my notes sitting on my desktop at home, and have to recreate the recent changes. :-) My camera is acting up. Unless I can figure out how to fix it, I may not be posting any photos from this year's show.

Is overlap really a question of multiple trees and diffs between them?

My talk is over now. There were a lot of good suggestions in the talk, and I may be busy for a while trying to implement some of the ideas. What I talked about was a small program that obscures XML by randomizing its content and optionally its name while preserving its structure. This enables documents to be submitted to tool maintainers to reproduce bugs without exposing private information. It's currently in very rough shape, just raw source, not even a zip file. I need to improve that now that it's been officially announced. In the meantime, if you're curious probably the best way to get started is by reading the conference paper. It's nice being the first talk at the conference. Now I can give my full concentration to listening to other people, without being constantly distracted by thinking about what I'm going to say. (Last year I basically ignored a talk I really should have heard on GXParse because it happened to fall right before my own XOM talk.)

You know the saying "It steam engines when it's steam engine time"? Sometimes at these conferences you can hear the whistle of the oncoming steam engine a little early. Of course, sometimes the steam engine derails on the way (Schemas); and sometimes it always seems to be right around the corner (XQuery, RDF, Topic Maps, Semantic Web). But sometimes it really does arrive on schedule (XML, UML, Java, HTML, HTTP, REST, RELAX NG). I think I'm picking up the sound of the next train. I've heard it in at least three different places just today, and I don't think these people are working together yet; but they're all heading toward the same station.

Right now I'm listening to Kristoffer H. Rose talk about the Data Format Description Language DFDL (pronounced "Daffodil"). This is a way of mapping from standard binary formats like JPEG, COBOL copybooks, and C code into XML. What jumps out at me about this is he specifically does not want to convert this to an XML document. He just wants to expose the data through an XML interface. And I'm hearing something similar on a lot of fronts right now, including some of my own work in XOM.

The point is that the sheer cost of converting all the data to Strings (and other objects) is starting to limit parser performance. To some extent, this is nothing new. SAX quite deliberately does not pass String to the characters() method. Instead it passes a char[] array and an index into that array. This allows the parser to keep passing the same array to the method and simply update the index. However, SAX, StAX, and similar APIs still create a lot of strings: for each element and attribute name for example. Tree-models like DOM and XOM are even more profligate with object creation. Good parsers like Xerces reuse the same strings; but if you've ever profiled deeply into an XML application you're likely to see a lot of time spent in String creation regardless. What's really annoying about this is that most of the time you don't need most of those strings. A typical application only uses a small subset of the strings (and other objects) an XML parser creates. The I/O cost of moving all this data around can also be significant.

A lot of developers here seem to be converging on the same solution from different directions. The destination is what one of the posters downstairs calls "in situ parsing". In other words, rather than creating objects representing the content of an XML document, just pass pointers into the actual, real XML. In some cases you wouldn't even need to hold the document in memory. It could remain on disk. This won't work with traditional APIs like SAX and DOM. However, it might be important enough to justify a new API. Many, though not all, use cases could see an order of magnitude speed-up or better from such an approach. Memory usage could improve too. Current tree models typically require at least 3 times the size of the actual document, more often more. Using a model based on indexes into one big array might allow these to reduce their requirements to twice the size of the original document or even less. Finally, this approach would make retrieving the actual original text of the document feasible, so you could finally tell whether a document used & or &. Most programs don't need this ability, but it would be very useful for XML editors and other programs that want to do better round-tripping.

Jon Bosak is doing a last minute fill-in presentation on UBL. Basically this is a plan to convert a lot of paper documents to XML forms by defining a "royalty-free library of standard business documents." I don't buy it. Bosak says, "You trade off all the wonderful ways you were customizing things, and you do without it." Customizations are not a mistake. They are necessary functions of doing business. I don't believe in making all businesses work alike just to standardize a few forms and take more humans out of the loop. (plus it's not RESTful, but Bosak thinks that's fixable.)