XML News from Wednesday, August 3, 2005

Day two of Extreme commences. Simon St. Laurent and Roger Sperberg are also reporting from the show, and both paid more attention yesterday to what people were actually saying than I was. I confess yesterday's sessions on OWL, RDF, Topic Maps, and UBL pretty much put me to sleep. Plus, these days I'm a morning person. I'm ready to go by 6:00 A.M. and anything after lunch is a stretch. (Welcome to middle age.) It also didn't help that two of the talks I particularly wanted to hear yesterday were cancelled. At least the DFDL replacement session was interesting. I was a little too tired to get full value out of it, but it does sound worth exploring more in the future. However, there are lots of good talks to look forward to today starting with two on XSLT and a talk from Walter Perry, one of the most iconoclastic thinkers in the XML space. He's so diametrically opposed to the conventional wisdom that most people can't even hear what he's saying. It's like trying to explain atheism to an eighth grade class in a Texas Christian school. Atheist: "I don't believe in God." Class: "You worship Satan?!" Atheist: "No, I don't believe in any gods." Class: "But that means you believe in the devil." Atheist: "No, I don't believe in the devil either." Class: "But you just said you don't believe in God." Except in Walter's case it's schemas, DTDs, and preexisting agreements he doesn't believe in instead of God and the Devil. (Disclaimer: This is just a metaphor. I have no idea what Walter's religious beliefs are.)

Ken Holman is talking about synthesizing XSLT based on his experience with the UBL stylesheets. 25 stylesheets is too many to write by hand. Instead he annotates a literal result (a hand-authored instance of XSL-FO) and generates the XSLT from that. This was important because in UBL he needs to match the printed formatting very precisely. He annotates with namespaced attributes that an XSL-FO processor will ignore. RELAX NG helped him because it could validate only the annotations and ignore everything else. This seems like a very powerful idea. I don't fully understand his syntax yet, but it doesn't look too complex.

I think I've figured out what was wrong with the camera. The autofocus only works indoors if the flash is turned on. Here's a picture of Ken presenting:

Ken Holman talking about ResultXSLT at Extreme 2005

If only I'd remembered to install iPhoto or Photoshop Elements on the PowerBook before leaving New York.

Matthijs Breebaartis from Holland is talking about "Processing references to documents you don’t have access to: Constructing identifiers with Relax NG and XSLT". The problem is a lot of information is organized into "vendor silos" and they want to be able to break the silos, and show the users what they need from across many different silos without redundancy. They need to link all this stuff but they don't control it. Different publishers have different URLs for the same things. (So much for the "Uniform" part.) They tried to get everyone in one room and agree on basic concepts. He prefers meaningful identifiers to opaque IDs. They wrote everything in RELAX NG and used Trang to translate to W3C XML Schemas to satisfy company policies. Their element names are all in Dutch, but appear to be restricted to ASCII. They use Pythin.

Matthijs Breebaartis talks at Extreme 2005 (also seen, Eric van der Vlist, G. Ken Holman, and Elliotte's PowerBook)

Why don't the raw XML forms of the papers published on the Extreme web site have xml-stylesheet processing instructions? e.g. this one.

Ann Wrightson is talking about "Semantics of well-formed XML as a human and machine readable language."

Ann M Wrightson talking at Extreme Markup Languages 2005

The official title of Walter Perry's talk is "Indexing The Whole As Well As The Parts: Derived Schemas and Imputed Hierarchies in Document Management." He starts with a quote from Peter Murray-Rust about how the CML DTD must be flexible because we don't understand chemistry:

With CML (unlikely though it may seem) we have to have an extremely fluid DTD. That is because we don't understand chemistry. It was put well by Democritos "Nothing exists except atoms and empty space - all else is opinion". The Chemical Bond is simply an opinion and people fight about it just as much as over XML matters. So CML is increasingly becoming very sparse (atoms, bond and electrons, with a bit of geometry). That allows authors free expression.

Walter thinks we don't really understand documents, schematization, or much else; and hence need more flexibility. Schemata are effectively structural. they are interdocument in scope. They constrain lexical possibility. By contrast instance documents are hyperstructural. Schemas operate on the internal context of a document. external context incudes hyperlinks, key/value indexing, processes that can consume a document, and processes that might produce a document. External contexts are significant for document seacrh and query.

It is the process that should decide what kind of documents it can consume; not the document that decides what it can be consumed by; i.e. the document cannot specify its own type.

Indexing documents in semantic value spaces identifies bonds.

Walter Perry at Extreme Markup Languages 2005

The afternoon sessions begin with several talks about XQuery. The first is Daniela Florescu from Oracle with "Declarative XML processing with XQuery: reevaluating the big picture." She thinks we need more architectural work and less syntax sniping. She thinks we don't have a clear idea of the "final goal" of XML, and that we need one. "There is one problem: everybody likes it for different reasons." I disagree on that last point. The beauty of XML is that it solves so many goals so well, including goals no one has thought of. We don't all need to do the same thing.

She's got some interesting things to say; but she's going way too fast for me to get it all down. Her database colleagues don't believe in mixed content. (No surprises there.) Entity relationships (E/R) don't work with mixed content. "XML is the only tractable abstract information model that is not E/R based." She brings up LISP. "30-year malaise in IT infrastructure" as a result of schema dependence. XML is the first to allow instance documents to be created in adavnce of schema. Schemas differ from community to community. Agreeing on a schema is the most expensive step.

Daniela Florescu talks about XQuery at Extreme

The power of //*; i.e. the ability to query something without knowing where it is or what it's called. (cf. SQL). "XQuery is a not a query language. In my opinion it was a very bad name." Difference between XQuery and SQL is that SQL works aon a table and XQuery on a tree. (Good point, nicely stated.)

XML/XQuery doesn't fit anywhere into the current architecture without paying a large price. Architecture needs to change or XML will fail. XQuery data model must be a first class citizen. Must make XML a graph not a tree. (I'm just taking notes here. I disagree with quite a lot though not all of this.) She wants to deprecate document nodes.

An E/R model is cyclic. No standard way to support this in XML. Only hack solutions. No global and standard solutions. We need native references in XML. This would improve integration between XML and RDF. (In Q&A it comes out that we have them in XML. That's what ID and IDREFS are. It's XQuery that's lacking here.)

She wants to deprecate xsi:nil. She wants to embed code behavior into schemas! She wants assertions (preconditions and postconditions) in schemas.

She wants continuous queries over infinite sequences in XQuery. (That might be useful.)

"XSLT is easier when the shape of the data is unknown. XQuery is easier when the shape of the data is known." (Another good point, nicely put.) Web services and XQuery don't work together closely enough. We need to make XQuery a full programming language. It's Turing complete, but inconvenient for programmers. Writing the code in Java kills the adavntages of XML. She wants updates, variable assignment, error handling, and deterministic evaluation order added to XQuery.

"The industry needs to outgrow the 'XML is a syntax' myth." She's making so many points and claims so fast that she has little to no time to justify any of them. There's probably a day's worth of material she's trying to cram into 45 minutes. I see virtually no chance of her getting all (or even any) of the changes she wants, and that's probably a good thing.

Next Jonathan Robie (cowritten with Daniela Florescu) describe "XQuery Update Facility: Setting Up the Problem" He's talking about one subbullet of one bullet on one of Daniela's slides in the last talk. "We're still not certain what we need." They aren't even sure about the use cases, much less implementations and strategies. XQuery updates are "mostly ACID." Isolation is the part that makes it only mostly.

IBM's Achille Fokoue is talking about "Extracting input/output dependencies from XSLT 2.0 and XQuery 1.0." This involves mappings between schemas (input and output formats) as defined by XSLT and XQuery.

The accuracy of the mappings is a trade-off with the cost of creating the mappings. It can have exponential behavior if you aren't careful. XSLT is too tricky to handle due to recursion. XQuery is easier.

The final talk of the day is C. Michael Sperberg-McQueen on "Applications of Brzozowski derivatives to XML Schema processing." First a brief lesson in Polish on how to pronounce "Brzozowski."

A Brzozowski derivative is:

the derivative of R with respect to s is the set of strings t which can follow s in sentences of R, or: the set of strings t such that the concatenation of s and t is a sentence in R.

Regular sets of strings can, of course, be denoted by regular expressions, and Brzozowski's contribution was to show how, given (1) a regular expression E denoting the language R and (2) a string s, to calculate a regular expression D denoting the derivative of R with respect to s. He also proved (3) that of all the derivatives of an expression, only a finite number would be distinct from each other in terms of recognizing different languages, and (4) that even if equal expressions are not always detected, there will still be only a finite number of dissimilar derivatives, if certain simple tests of similarity are performed; he then showed (5) how to construct a finite-state automaton from the set of characteristic derivatives thus identified.

This talk isn't quite as fast paced as Florescu's, but it too could really use three hours and a blackboard instead of 45 minutes and slides. It's quite mathematical. The notation he's using is unfamiliar to me. He explains it, but following this is going to be tough.

Brzozowski derivatives allow you to avoid building the finite state automaton when evaluating regular expressions. They can also handle non-deterministic regular expressions very simply. This has important implications for validation. Evaluation of a regular expression reduces to the question of whether its derivative? is nullable.

Empty sequences and empty choices (sets) are legal in the W3C XML schema language. But many think restrictions on xsd:all groups are too onerous. (I agree.)