XML News from Monday, April 19, 2004

I've arrived at XML Europe. I'm reporting this conference in chronological order (earliest item on top) as opposed to my usual archaeological order (most recent item on top) so if you're coming back to read this, scroll to the bottom to see if I've added anything.


Quick head count at the first keynote shows about 120 people here. I'd love to provide live updates from the conference, but the wireless network seems to be password protected. :-( Also the keynote hall is notably lacking in power plugs. On the other hand the chair in front of you can be folded down to form a very nice desk.


The first keynote is about Amazon Web Services by Amazon's Jeff Barr. I've been meaning to use this for some time to finally update the books pages here on Cafe con Leche and Cafe au Lait, but time is limited as always. He actually defines web services as any programmatic (as opposed to human) access to a web server so it includes REST approaches as well as SOAP. The big difference he sees (I'm not sure this is accurate) is that REST is preferred by weakly typed scripters (Python, XSLT) where as SOAP is preferred by strongly typed programmers (Java, C). Interesting statistic from this talk: Amazon provides both SOAP and REST interfaces to their data. About 80% of the calls come through REST, 20% through screen scraping. He expected the opposite. BEEP, WSDL, etc. seem unnecessary for aggregation of web services. The developers he sees are doing just fine without them.


Next up is Stephen Pemberton of the CWI and chair of the W3C HTML and Forms working group. He's talking about notations in a generic sense, not specifically XML NOTATION type attributes. Examples include two-letter U.S. state abbreviations such as NY and FL. He suggests a better algorithm for this, but I don't think it would actually work. I see several possible conflicts. As he says, "I'm English. I just live in Holland." He recommends reading "The Goldilocks Theories" in Tog on Interface. People writing with WYSIWYG editors produce higher quality text than people typing in text editors (he says, as I type this in BBEdit). Pen and paper is higher quality still. Very interesting picture that demonstrates if you buy a new computer every 18 months or more, your current computer is more powerful than the sum of all computers you have owned previously. "The only thing my computer has all those extra cycles for is to make it act more like a television...so why are we devising notations to make like easier for the computer?" I suggest that we're not so much making it easier for computers as for programmers. Software development, programmer skill, algorithms, etc. don't follow Moore's law. Hmm, seems Pemberton agrees with me. 90% of the cost of developing software is debugging according to the DoD. A program that's 10 times longer is 31 times harder to write according to Moore of Mythical Man-Month fame. Therefore we should write programming languages to make life easier for the programmer rather than the computer. This was the goal of ABC, Python etc. What is Lambert Meertens working on? An order of magnitude improvement on Python/ABC? He's complaining about the difficulty of authoring XML (and XHTML), but he's exaggerating the problem by assuming validity, XML declaration, namespaces, etc. are required. I think he's also overestimating the ease of writing unmarked up text that can be processed by a computer. I don't think computers are really going to be able to parse real unmarked up text until and unless we have real AI. I think it's easier to write explicitly marked up text than implicitly marked up text.


Chris Lilley of the W3C is talking about Architectural Principles of the World Wide Web. This is the first breakout session. Good crowd, about 50 people. According to Lilley, the TAG is only responsible for documenting the web architecture as it exists, not designing an architecture.

First principle is orthogonality of specifications is good. I agree. XML is harmed by excessive reliance on Unicode and URLs. Big digression in the audience over why "orthogonal" is or is not the right word for this principle, but everyone agrees with the principle.

2nd principle: "Silent recovery from error is harmful." Does Opera error correct XML? Claim is made in audience. Some disagreement in audience with this principle.

Principle 3: URIs (as redefined in RFC 2396bis). Open question whether or not IRIs can only be written using Unicode Normalization Form C. Check the spec.

Principle 4: URIs are compared character by character.

Principle 5: Avoid unnecessary new URI schemes. "Making up stupid things like itunes that are exactly the same as http except they mean use my software instead of a web browser is a bad idea." Ditto for subscribe in RSS.

Principle 6: "User agents must not silently ignore authoritative server metadata."

Principle 7: Safe interactions. GET is safe (does not incur obligations). POST may not be. Big issue with GET is character encoding in query strings. This breaks search engines in countries with less-ASCII like character sets.

Principle 8: Text vs. binary. Lilley likes text. Tag finding summarizes the issue.

Principle 9: Extensibility and Versioning. Extensibility must be designed in. Must understand vs. must ignore.

Principle 10: Separate content, presentation, and interaction. Question from audience: "Isn't there someone from Microsoft on the Working Group?"

Principle 11: XML and Hypertext. Allow web wide linking. Use URIs instead of IDREFs.

Principle 12: XML ID semantics.


Paul Prescod is talking about "Take REST: An Analysis of Two REST APIs". He's referring to Amazon and ATOM. I'm not sure I like the title. I suppose these are interfaces, and can be used as interfaces to application programs, but they are not APIs in the traditional sense. They're simply a presentation of data as XML documents at particular URLs. Hmm, seems he may have thought the same thing. That title was from the show program, but on the slides it's morphed into "Take REST: A Tale of Two Service Interfaces".

Prescod prefers "data-centric interfaces" to "service oriented interfaces". "XML is the solution to the problem, not the problem." Don't hide the XML! Big problem with Amazon interfaces is embedding authentication info in the URIs. However, this does work better with XSLT. RPC is too fragile (not extensible) for wire protocols. Example: fixed length argument lists.


Cool siting of the day: Linux running on a dual-boot IPod.


Michael Kay, author of the popular Saxon open source XSLT processor, is talking about "XSLT and XPath Optimization (in Saxon)". There's a large crowd, more than 60 people in a small room. "Saxon is an engineering project, not a research project." He does not have a good performance test suite and reproducible measurements. His technique is mostly based on incrementally optimizing badly performing stylesheets. If he had been a reviewer his own paper, he would have complained about this. Runtime optimizations can use knowledge of input data. Compile time optimizations avoid cost of repeated optimization. Differences between XSLT 1 and XSLT 2 aren't that radical from the standpoint of optimizations. Most optimizations in Saxon 7 could have been applied to Saxon 6 if he hadn't abandoned it. Some techniques are more effective in XSLT 2 due to strong types, but even 1.0 processors can deduce type information. Namespace prefixes defined at runtime (often via variables) are a major pain. Saxon does more optimization on XPath expressions than XSLT instructions.


Jonathan Robie of Data Direct is talking about "SQL/XML, XQuery, and Native XML Programming." Robie expects a second last call working draft of XQuery, because of the significant changes still being made. "It is anticipated" that there will be support for the SQL/XML XML data type in JDBC 4.0. There should be a public draft of a Java API for XQuery soon.


IBM's Elena Litani, a major contributor to Xerces-Java, is talking about "An API to Query XML Schema Components and the PSVI," about 20 people attending. The API she's describing is implemented in Xerces and has been submitted as a member proposal to the W3C. (I don't remember seeing this there. It may be members only. If the wireless network were working I could check.)

They wanted a platform and language independent API, defined in IDL. Didn't the DOM prove once and for all that this was a bad idea? Here they don't even have the excuse of needing to run inside browsers.

The three main interfaces are ElementPSVI, AttributePSVI and their common superinterface, ItemPSVI. These are implemented by the same objects that implement DOM Level 3 standard Element, Attr, and Node interfaces (or equivalent in other APIs). Casting is required.

Streaming models would use a PSVIProvider pull interface instead. Xerces supports this in SAX. Cast XMLReader to PSVIProvider, and then call getElementPSVI(), getAttributePSVI(), etc. However not all details may be available. For instance, in startElement(), one doesn't yet know if the element is valid.

This all looks very closely tied to the W3C XML Schema Language. I don't see how one could use this on a RELAX NG validated document, for example.

This API also includes a full read-only model for modelling schemas including XSObject, XSModel, etc for modelling element declarations, target namespaces, type definitions, etc. I asked what the use case for this part of the API was. Litani suggests comparing two schemas and a schema-aware editor. According to Henry Thompson, it also allows you to navigate the type hierarchy; for instance to find out if a user defined type is a subtype of xsd:int.


Next up is Henry Thompson of the University of Cambridge talking about "A Logical Foundation for W3C XML Schema." He admits the spec was written for implementers, and difficult to read for ordinary users. In the future wants to better support logical reasoning about schema composition. He's speaking for himself as an individual, not the working group. He starts from the logic of feature structures as developed by Rounds, Moshier, et al. (And I'm already lost. Oh, he's going to give a mini-tutorial on what a "logic" is. Maybe this will explain it.) A logic is

Gee, that's clear. OK, he elaborates. A sentiental form is a grammar for defining well-formedness such as a BNF grammar. Now we're on ground I understand somewhat. A model theory is what the sentences are about. It's also a set of individuals and a set of named subsets of the set. The interpretation relates the well-formed sentences to the model so truth values of sentences can be determined. The sentences are interpreted by comparing what's found in the sentence to the items in the sets. The sentences contain logical operators such as OR and AND. I think I see what he's saying. I just don't see why this is useful.

OK, that's a logic. Now on to schemas. According to Thompson, a schema is a component graph in which components are nodes and properties are edge labels. Non-component values are leaf nodes. However, this is a general graph with circles. Unlike XML documents it is not a tree. He wants to extend XPath to support such graphs.

Now he's showing a reformulation of parts of the schema spec using logic notation. I wouldn't have thought it possible, but it's even more opaque and reader hostile than before! Maybe it makes more sense once you've had some time to absorb it. This does look like it may make life easier for implementors. I'm just not sure it's an improvement for users. I asked, who is this meant for? Apparently it's supposed to replace the normative parts of the spec, and allow the non-normative parts to be written more cleanly.

Dan Brickley wants to rewrite this on top of OWL (which is itself written on top of RDF) instead of Thompson's idiosyncratic XML.

Thompson notes that when originally working on the PSVI, the working group was frustrated that there was nothing in the Infoset spec that told them how to be good citizens when extending the Infoset. It occurs to me that this is a problem for other specs like XInclude (which is trying to sneak a new language property into the Infoset) that want to extend the Infoset. Thompson claims this approach solves that problem, but I can't tell. As he says, "This stuff is dense." The formalization is very close to being a Prolog program, which would make an excellent reference implementation (if an inefficient one).

That's all for today. More tomorrow.