Cafe con Leche News Thursday, December 7, 2006

Microsoft's Craig Kitterman kicks off the morning by talking about "Ecma Office Open XML". That's a disingenuous name. This has nothing to do with OpenOffice. In fact, it's a direct competitor. Is there a trademark attorney in the house?

This is the default format in Office 2007. .docx is the file extension. "100% compatible with previous office documents." In other words, everything in classic Office binary files can be converted to XML with full pixel level fidelity. Licensed under Covenant Not to Sue and Open Specification Promise. The current draft spec is 6,000 pages long.

The basic message of this talk is that the format is an open standard, supported by many players. I don't buy it. ECMA is the rubber stamp of standards organizations, and any company the size of Microsoft can get a few friends to lend their names. 6,000 page specs that document legacy formats aren't open. There's no reasonable way anyone can hope to implement all of this faithfully without Microsoft's legacy code base. I doubt even Microsoft can do it. Documenting all the kinks and corner cases of a 10+ year old legacy format of one product does not turn it into a true open standard. Open standards start from scratch with full consideration for all players. They are not crippled by insistence on compatibility with decades of legacy code from one product and one company. They are independent of particular implementations. This is not a neutral file format. It vastly privileges Microsoft Office.

That said, I'm glad this exists. It is an improvement over Microsoft's classic, undocumented, binary file formats. However, it is not a plausible alternative to OpenDocument. It is far too complex and too baroque.

Paolo Marinelli and Stefano Zacchiroli from the Università di Bologna won the XML 20006 Student Scholarship with Co-constraint Validation in a Streaming Context. Paolo gets to give the morning keynote. Oh great. There's no wireless in the keynote room, again. Bleah.

This was the most technical talk I've seen at this conference. I don't think I followed it all, but the ideas seem good. The notion of automatically rewriting location paths such that reverse axes turn into forward axes was quite clever. e.g. /descendant::x[preceding-sibling::y] becomes /descendant::x[/descendant::y/following-sibling::node() == self::node()].

For the first morning session, Andrew Savikas from O'Reilly talks about the Atom Publishing Protocol, APP; and there's no wireless again. They use DocBook subsets fro Safari and Safari U (but not the same subset). Moving from classic paper book publishing to more continuous, adaptable publishing in a variety of formats is driving some changes in process.

APP supports the creation of arbitrary resources over the Web, not just blog entries. They publish DocBook 4.4, XHTML, PDF, and a variety of image formats. "Having to resolve 10 years of DocBook validity errors takes a while." O'Reilly chose DocBook 4.4 because it's closest to their existing content, and the DocBook XSL stylesheets don't perfectly support DocBook 5 yet (though that's improving fast). The repository is a Mark Logic native XML database.

After the lunch break, I returned to Back Bay A for the DITA (Darwin Information Typing Architecture) panel, despite the nonexistent wireless network in this room. I've heard a lot about DITA, but I'm not really sure what it does. My vague picture is sort of like DocBook but for man pages. Perhaps I'll get an idea whether or not this is worth paying further attention too.

First speaker is Allen Houser from Group Wellesley. He''s giving the 30,000 foot view. DITA was developed inside IBM to replace IBMIDDOC. Donated to OASIS, DITA 1.0 is current spec. 1.1 is under development. It's an architecture, not just a markup language.

Topics are core information units; a stand alone reusable chunk of information
Task, concept, and reference information types
Sppecialization and generalization: domain and structural specialization
Attribute based formatting; like class attribute in HTML
Maps organize topics, blocks, and words/phrases into books, web sites etc.
Metadata-based filtering excludes or annotates content at runtime.
DITA Open toolkit

Start by creating the map file (the outline) rather than the text.

Second speaker is Sean Angus from XyEnterprise talking about the RIM Blackberry and DITA. Cost of tranlsation reduced 75%. Productivity increased 20%. 14 month recouped investment.

Third speaker is Scott Hudson from Flatirons Solutions. He's talking about DocBook vs. DITA. This is interesting, but I'm falling asleep anyway. I may have to split early to hit Dunkin Donuts for some coffee. Twice as much coffee as Starbucks for half the price.

I would have liked to have heard morre about DITA, but I instead I defected at the half to hear IBM's Elias Torres talk about the Apache Project's Abdera, a Java class library for publishing, consuming, and transmitting APP. There's no client user interface or storage backend. A file system backed storage system is planned but not yet implemented. Lotus Ventura (whatever that is) is a major user.

Kenneth Sall and Ronald R. Reck are talking about "applying XQuery which is safe to say the hit technology of the conference" (Simon St. Laurent). Specifically they are talking about applying XQuery and OWL to Wikipedia, the CIA World Factbook, and Project Gutenberg. They want to combine these data sources.

Problem statement: find all the project Gutenberg books written by male European authors in the 19th century.

First they need to convert Project Gutenberg from text to RDF.
Find the authors of each book in Wikipedia to determine gender and time.
Use the factbook to identify European countries.

Wikipedia wasn't very structured or marked up. They had to use proximity search to determine nationality.

Used the DAML OWL version of the CIA World Factbook.

eXist was their XQuery implementation. It worked well for them. Scalability is not yet known.

The final slot of the conference was the first one where I didn't find at least two sessions I really wanted to see. I decided to stick around in the Web 2.0 track to hear Harry Halpin from the University of Edinburgh talk about Social Semantic Mashups: Exploring Social Networks with Microformats and GRDDL.

Data is trapped within HTML or in relational databases behind firewalls. He wants to liberate the data and put it in a common format: RDF. Microformats are "the lower case semantic web". Many sites are using microformats including Yahoo; but data checks in. It doesn't check out. They are writing XSLT stylesheets to convert microformats documents to RDF. Social networks (Linked In, etc.) trap users within their own network. Oh my god! Eigenvectors! There's something I haven't seen or thought of for ten years, probably more.

I think I get GRDDL for the first time. It's just a way of linking an arbitrary namespace well-formed XML document or an XHTML document to a stylesheet that transforms that document into RDF. That's it. You can also put the transformation links in the namespace document (e.g. RDDL) rather than in the document itself.

I still don't buy the semantic web though (upper or lower case).

XML News from Thursday, December 7, 2006