2004 XML News

Friday, December 31, 2004

Eric S. Raymond has released doclifter 2.0, an open source tool that transcodes {n,t,g}roff documentation to DocBook. Version 2.0 adds support for man, mandoc, ms, me, and TkMan source documents as well. Raymond claims the "result is usable without further hand-hacking about 95% of the time." This release fixes bugs. Doclifter is written in Python, and requires Python 2.2a1. doclifter is published under the GPL.


Peter Jipsen has released ASCIIMathML 1.4.3, a JavaScript program that converts calculator-style ASCII math notation and some LaTeX formulas to Presentation MathML while a Web page loads. The resulting MathML can be displayed in Mozilla-based browsers and Internet Explorer 6 with MathPlayer.

Monday, December 27, 2004

Satimage-software has released XMLLib 2.0, an XML parser for AppleScript based on the Gnome Project's libxml. XMLLib supports DOM, XPath, and XSLT. Mac OS X 10.2.8 or later is required.

Thursday, December 23, 2004

I'll be travelling for the holidays for the next week or so. Updates will likely be fairly slow until I return.


The OpenOffice Project has released OpenOffice 1.1.4, an open source office suite for Linux and Windows that saves all its files as zipped XML. I used the previous 1.0 version to write Effective XML. 1.1.4 is exclusively a bug fix release. OpenOffice is dual licensed under the LGPL and Sun Industry Standards Source License.

Wednesday, December 22, 2004

The XML Apache Project has released Xalan-C++ 1.9, an open source XSLT processor written in standard C++. Version 1.9 supports memory management, enables iterative processing, can pool all text node strings, and fixes assorted bugs.


Michael Kay has released Saxon 8.2, an implementation of XSLT 2.0, XPath 2.0, and XQuery in Java. Saxon 8.2 is published in two versions for both of which Java 1.4 is required. Saxon 8.2B is an open source product published under the Mozilla Public License 1.0 that "implements the 'basic' conformance level for XSLT 2.0 and XQuery." Saxon 8.2SA is a £250.00 payware version that "allows stylesheets and queries to import an XML Schema, to validate input and output trees against a schema, and to select elements and attributes based on their schema-defined type. Saxon-SA also incorporates a free-standard XML Schema validator. In addition Saxon-SA incorporates some advanced extensions not available in the Saxon-B product. These include a try/catch capability for catching dynamic errors, improved error diagnostics, support for higher-order functions, and additional facilities in XQuery including support for grouping, advanced regular expression analysis, and formatting of dates and numbers." Version 8.2 adds support for XOM, supports the JAXP 1.3 XPath and schema validation APIs, improves performance in a few areas, and is more backwards compatible with XSLT 1.0 stylesheets. Upgrades from 8.x are free.

Tuesday, December 21, 2004

I am pleased to announce what I expect is the final beta and release candidate of XOM 1.0, my open source dual streaming/tree-based API for processing XML with Java. XOM focuses on correctness, simplicity, and performance, in that order. This final (I hope) beta makes a number of improvements to performance in various areas of the API. Depending on the nature of your programs and documents, you should see speed-ups of somewhere between 0 and 20% compared to the previous beta. There are no over-the-covers changes in this release. Under-the-covers a few classes have undergone major rewrites, and a couple of non-public classes have been removed. All the unit tests (now over a thousand of them) still pass, but please do check this release out with your own code. If no problems are identified in this beta, I expect to officially release XOM 1.0 possibly as early as tomorrow, and certainly by the end of the year.


The W3C has released the final recommendation of XInclude 1.0. There do not appear to be any significant changes since the proposed recommendation was published a couple of months ago. Briefly, XInclude describes a means to build complex doucments out of simpler documents by replacing elements like <xi:include href="chapter1.xml"/> with the contents of the file they refer to. For more details, I've written a Brief Introduction to XInclude.

There don't appear to be any fully conformant implementations yet. However, XOM's XIncluder class implements all the required functionality, and passes all the tests that don't depend on optional features. Specifically XOM does not support unparsed entities, notations, or the xpointer XPointer scheme. The Gnome Project's libxml also does a pretty good job with XInclude, and does support the xpointer scheme, though it doesn't handle all the edge cases quite as rigorously as XOM does. I've also written XInclude engines for DOM, JDOM, and SAX. However, these are much buggier and more incomplete than the XOM version. Some improvements have been made in CVS since the last milestone drop. However, they still flunk lots of the test cases, and may even get stuck in infinite loops or otherwise die horrible deaths when faced with relatively complex operations.

Monday, December 20, 2004

The Apache XML Project has released XML Security v1.2, an implementation of security related XML standards including Canonical XML, XML Encryption, and XML Signature Syntax and Processing. A compatible Java Cryptography Extension provider is required. Version 1.2 improves performance and offers "Easier JCE integration".

Sunday, December 19, 2004

The Mozilla Project has released Mozilla 1.7.5. This release improves IE compatibility with non-standards compliant sites, and adds NPRuntime support. "NPRuntime is an extension to the Netscape Plugin API that was developed in cooperation with Apple, Opera, and a group of plugin vendors." More importantly, this release fixes over three hundred assorted bugs. Sadly none of them seem to be ones that have been bedeviling me. This release isn't too critical, but given the large number of fixes you should probably upgrade when you get a minute. Most of these fixes will be rolled into Firefox 1.1 sometime next year.

Saturday, December 18, 2004

The W3C XSL Working Group has published the second working draft of Extensible Stylesheet Language (XSL) Version 1.1. Despite the more generic name, this actually only covers XSL Formatting Objects, not XSL Transformations. New features in 1.1 include:

  • Multiple flows
  • Change marks
  • Back of the book indexing
  • Bookmarks
  • Markers in tables
  • fo:page-number-citation-last.
  • fo:page-sequence-wrapper
  • clear and float inside and outside
  • prefixes and suffixes for page numbers
Friday, December 17, 2004

The W3C Technical Architecture Group (TAG) has published Architecture of the World Wide Web, First Edition. Quoting from the abstract:

The World Wide Web uses relatively simple technologies with sufficient scalability, efficiency and utility that they have resulted in a remarkable information space of interrelated resources, growing across languages, cultures, and media. In an effort to preserve these properties of the information space as the technologies evolve, this architecture document discusses the core design components of the Web. They are identification of resources, representation of resource state, and the protocols that support the interaction between agents and resources in the space. We relate core design components, constraints, and good practices to the principles and properties they support.

It's pretty good stuff overall. Everyone working on the Web, the Semantic Web, or with XML or URIs should read it.


Jens Låås has released version 1.5.4 of xmlclitools, a set of four Linux command-line tools for searching, modifying, and formating XML data. The tools are designed to work in conjunction with standard utilities such as grep, sort, and shell scripts. Version 1.5.4 allows UTF-8 output from xmlgrep. They are published under the LGPL.

Thursday, December 16, 2004

Rich Salz and Dave Orchard have written a proposal for a URN scheme for XML QNames. Basically they propose that a qualified name such as xsl:template could be written as urn:qname:xsl:template:http://www.w3.org/1999/XSL/Transform. The default namespace can be handled by omitting the prefix. For examle the XHTML p element would be urn:qname::p:http://www.w3.org/1999/xhtml. You can ignore the prefix by using an asterisk. For instance, urn:qname:*:rect:http://www.w3.org/2000/svg matches both svg:rect and rect in the default SVG namespace. I've expressed a few technical and editorial quibbles with the current draft to the authors, mostly revolving around the distinctions between URI, IRI, and URI reference, but nothing fundamental. Overall this seems like a pretty solid idea. I'm not sure exactly where this would be used, but it sounds like it ought to be useful somewhere.

Wednesday, December 15, 2004

The W3C Web Services Addressing Working Group has posted three new working drafts on the subject of, what else? web services addressing. Web Services Addressing - Core defines generic extensions to the Infoset for endpoint references and message addressing properties. Web Services Addressing - SOAP Binding and Web Services Addressing - WSDL Binding describe how the abstract properties defined in the core spec are implemented in SOAP and WSDL respectively.


The W3C Multimodal Interaction working group has posted the fourth public working draft of EMMA: Extensible MultiModal Annotation markup language. According to the abstract, this spec "provides details of an XML markup language for describing the interpretation of user input. Examples of interpretation of user input are a transcription into words of a raw signal, for instance derived from speech, pen or keystroke input, a set of attribute/value pairs describing their meaning, or a set of attribute/value pairs describing a gesture. The interpretation of the user's input is expected to be generated by signal interpretation processes, such as speech and ink recognition, semantic interpreters, and other types of processors for use by components that act on the user's inputs such as interaction managers."

Tuesday, December 14, 2004

Sleepycat Software has released Berkeley DB XML 2.0.7, an open source "application-specific, embedded data manager for native XML data" based on Berkeley DB. It supports the July working drafts of XQuery 1.0 and XPath 2.0. It includes C++, Java, Perl, Python, TCL and PHP APIs. This is the first public release in the 2.0 series.

Monday, December 13, 2004

Benjamin Pasero has posted the first release candidate of RSSOwl 1.0, an open source RSS reader written in Java and based on the SWT toolkit. RSSOwl is the best open source RSS client I've seen written in Java. That said, it still doesn't feel right to me. Even ignoring various small bugs and user interface inconsistencies, news just doesn't flow in this client. The three-pane layout that separates the news item titles from each news item, and place the news item titles above the text of the news item doesn't work well for me.


The Mozilla Project has released Sage 1.3, an open source RSS plug-in for Firefox. Overall, I think this works better than RSSOwl. It also uses a three pane layout, but one of the panes is almost a full sized browser window, and includes complete news items, one after the other. This is almost good enough to actually use. However, it's got three major missing features:

  1. It does not respect the browser's font preferences. The default font is way too small for me, not is it my preferred font face for onscreen reading.
  2. It does not aggregate news items from different RSS feeds. (RSSOwl does.)
  3. It does not hide news items I've already read.

Both RSSOwl and Sage still feel like toys to me, not serious tools. Neither of these products seems capable of handling hundreds of blogs and thousands of news items. Oh, don't get me wrong. I'm sure you could subscribe to hudnreds, probably thousands, of feeds in either one of them; and it wouldn't crash or slow down. But the user interface is just not adequate for managing such large, ongoing, constantly updated information collections. So far I'm not sure if there is an RSS client that is capable of this. I don't know what a good RSS client will look like, but I'll know it when I see it, and so far I haven't seen it. RSS feeds are a new way of interacting with information, and we need some serious user interaction studies to understand how to properly design new user interfaces that fit. The old metaphors aren't working any more.

Sunday, December 12, 2004

IBM's alphaWorks has released IBM Forms for Mobile Devices, "a Java-based, distributed software solution that, using XForms (a W3C standard for forms definition), enables pervasive mobile devices to access and complete business forms. This forms solution allows developers to quickly create, deploy and use forms based applications. This software demonstrates the ability of intermittently connected mobile devices to access and complete business forms that are stored locally on the mobile device. The completed forms are transferred to a server for additional processing when connectivity is available."

Friday, December 10, 2004

Bare Bones Software has released version 8.0.3 of BBEdit, my preferred text editor on the Mac. This release adds various small features and fixes a number of bugs. BBEdit is $179 payware. Upgrades from 8.0 are free. Upgrades from earlier versions are $49 for 7.0 owners and $59 for owners of earlier versions.

Thursday, December 9, 2004

Ryan Tomayko has commenced work on Kid, "a simple Pythonic template language for XML based vocabularies. It was spawned as a result of a kinky love triangle between XSLT, TAL, and PHP." The language is based on just five attributes: kid:repeat, kid:if, kid:content, kid:omit, and kid:replace; each of which contains a Python expression. Since this expression can point to externally defined functions, this is most of what you need. In addition there are attribute value templates similar to XSLT's, and <?kid?> processing instructions can embed code directly in the XML document. I'm not sure I approve of the use of processing instructions in the language, but I'm not sure I don't either. Not having to escape XML-significant symbols like < and & in the embedded code is convenient. Kid templates are compiled to Python byte-code and can be imported and invoked like normal Python code. Kid templates generate SAX events and can be used with existing libraries that work along SAX pipelines. Overall it looks like a fairly well-designed, well-thought out system that has clearly learned from the mistakes of gnarly systems like PHP, JSP, and ASP. Why am I not surprised to see this coming out of the Python community?

Wednesday, December 8, 2004

The Mozilla Project has posted Camino 0.8.2, a Mac OS X web browser based on the Gecko 1.7 rendering engine and the Quartz GUI toolkit. It supports pretty much all the technologies that Mozilla does: HTML, XHTML, CSS, XML, XSLT, etc. 0.8.2 is a bug fix release. Mac OS X 10.1.5 or later is required.


Kiyut has released Sketsa 2.2.1, a $29 payware SVG editor written in Java. Java 1.4.1 or later is required.

Tuesday, December 7, 2004

In anticipation of the upcoming release of the XInclude recommendation, I've posted a brief introduction to XInclude on The Cafes. This is an updated version of an article I've published in a couple of other venues over the last few years.

By the way, if you're using Internet Explorer and have had problems with the comment form on the Cafes, that has now been at least partially fixed. It still looks ugly, but at least it doesn't slide under the sidebar any more. The trick was setting width: 100%; on the form element.


The W3C has released version 8.7 of Amaya, their open source testbed web browser and authoring tool for Solaris, Linux, Windows, and Mac OS X that supports HTML 4.01, XHTML 1.0, XHTML Basic, XHTML 1.1, HTTP 1.1, MathML 2.0, SVG, and much of CSS 2. Besides bug fixes, there are a few new features in this release including non-breaking space and tabs are shown in the source view as ~ and », and menu items to generate section numbers and tables of contents.

Monday, December 6, 2004

The Mozilla Project has posted the fifth alpha of Mozilla 1.8. New features in 1.8 include FTP uploads, improved junk mail filtering, better Eudora import, and an increase in the number of cookies that Mozilla can remember. It also makes various small user interface improvements, gives users the option to disable CSS globally or on a per-page basis, and adds support for CSS quotes. Alpha 5 fixes a slew of bugs and enables support for CSS columns.


The W3C Authoring Tool Accessibility Guidelines Working Group has posted a working draft of Implementation Techniques for Authoring Tool Accessibility Guidelines 2.0. "This document provides non-normative information to authoring tool developers who wish to satisfy the checkpoints of "Authoring Tool Accessibility Guidelines 2.0" [ATAG20]. It includes suggested techniques, sample strategies in deployed tools, and references to other accessibility resources (such as platform-specific software accessibility guidelines) that provide additional information on how a tool may satisfy each checkpoint."

Saturday, December 4, 2004

P & P Software has released XSLTdoc 1.0, a free-as-in-speech (GPL) Javadoc-like tool for XSLT stylesheets. Instead of using comments, it uses elements in the http://www.pnp-software.com/XSLTdoc namespace. These are top-level elements that appear before each documented top-level XSLT element. The processor is itself implemented in XSLT 2.0. The XSLT stylesheets are not complete onto themselves. A config file is also required, though mostly this just replaces what would be provided by command line arguments in javadoc. However, truly JavaDoc like documentation should be able to generated purely from the XSLT stylesheets themselves, without any extra files needing to be consulted.

Friday, December 3, 2004

The OpenOffice Project has posted the first release candidate of OpenOffice 1.1.4, an open source office suite for Linux and Windows that saves all its files as zipped XML. I used the previous 1.0 version to write Effective XML. 1.1.4 is exclusively a bug fix release. OpenOffice is dual licensed under the LGPL and Sun Industry Standards Source License.


The IETF has posted working drafts for five more URL schemes:

(Does anyone still use Prospero any more? For that matter, does anyone still use gopher?) These all document schemes that were originally documented in the soon to be historic RFC 1738. NNTP URLs are deprecated in favor of news. Otherwise, none of them make significant changes.

Thursday, December 2, 2004

The IETF has posted a working draft of The file URI Scheme. "This document specifies the file Uniform Resource Identifier (URI) scheme that was originally specified in RFC 1738. The purpose of this document is to allow RFC 1738 to be moved to historic while keeping the information about the scheme on standards track." Sadly, this draft does not attempt to correct the numerous ambiguities and inconsistencies in both the orignal RFC or the practice of file URLs in software today.


The IETF has also posted another last call working draft of Internationalized Resource Identifiers (IRIs). "An IRI is a sequence of characters from the Universal Character Set (Unicode/ISO 10646). A mapping from IRIs to URIs is defined, which means that IRIs can be used instead of URIs where appropriate to identify resources." In other words this lets you write URLs that use non-ASCII characters such as http://www.libération.fr/. The non-ASCII characters would be converted to a genuine URI using hexadecimally escaped UTF-8. For instance, http://www.libération.fr/ becomes http://www.lib%C3%A9ration.fr/. There's also an alternative, more complicated syntax to be used when the DNS doesn't allow percent escaped domain names. However, the other parts of the IRI (fragment ID, path, scheme, etc.) always use percent escaping. The changes in this draft mostly focus on specifying different possible mechanisms for comparing IRIs for equality.


The W3C Web Services Internationalization Task Force has published the secopnd public working draft of Requirements for the Internationalization of Web Services. According to the intro,

A Web Service is a software application identified by a URI [RFC2396], whose interfaces and binding are capable of being defined, described and discovered by XML artifacts, and which supports direct interactions with other software applications using XML-based messages via Internet-based protocols. The full range of application functionality can be exposed in a Web service.

The W3C Internationalization Working Group, Web Services Task Force, was chartered to examine Web Services for internationalization issues. The result of this work is the Web Services Internationalization Usage Scenarios document [WSIUS]. Some of the scenarios in that document demonstrate that, in order to achieve worldwide usability, internationalization options must be exposed in a consistent way in the definitions, descriptions, messages, and discovery mechanisms that make up Web services.

According to the status section, "There were only very few changes since the last publication. The main change is the addition of requirement R007 about integration with the overall Web services architecture and existing technologies. The wording of the other requirements was changed to not favor solutions that are still under discussion. Text has been streamlined and references have been updated."


Michael Smith has posted version 1.67.2 of the DocBook XSL stylesheets. These support transforms to HTML, XHTML, and XSL-FO. This is mostly a bug fix release but does expand customizability in a few areas including tables and tables of content.

Wednesday, December 1, 2004

The W3C Multimodal Interaction Working Group has posted a working draft of the Dynamic Properties Framework. According to the abstract, "This document defines platform and language neutral interfaces that provide Web applications with access to a hierarchy of dynamic properties representing device capabilities, configurations, user preferences and environmental conditions."


RenderX has released version 4.1 of XEP, its payware XSL Formatting Objects to PDF and PostScript converter. XEP also supports part of Scalable Vector Graphics (SVG) 1.1. New features in 4.1 include embedding of Adobe Compact Font Format (CFF) fonts, improved memory management , and reworked algorithms for automatic table layout. The basic client is $299.95. The developer edition with an API is $999.95. The server version is $3999.95. Updates from 3.0 range from free to full-price depending on when you bought it.

Tuesday, November 30, 2004

The Cafes seems to be off and running. There were a few initial glitches that I have now cleaned up. There's some interesting discussion in the comments fora for On Iterators and Indexes and Overloading Int Considered Harmful. Today's project is to make the staging server work enough like the production server that I can use it for testing and debugging without affecting the production server. Yesterday I got stymied by a slight difference in how the PHP engines were configured. (The staging server didn't have libtidy support that the site relies on heavily.) I had planned to post a backlist article today, but instead I found myself forced to think about spam.


Colin Paul Adams has commenced work on Gestalt, an open source, non-schema aware XSLT 2.0 processor written in Eiffel. Gestalt is published under the Eiffel Forum License V2.0.


The W3C Quality Assurance (QA) Activity has published a revised working draft of the QA Framework: Specification Guidelines. Quoting from the abstract, "A lot of effort goes into writing a good specification. It takes more than knowledge of the technology to make a specification precise, implementable and testable. It takes planning, organization, and foresight about the technology and how it will be implemented and used. The goal of this document is to help W3C editors write better specifications, by making a specification easier to interpret without ambiguity and clearer as to what is required in order to conform. It focuses on how to define and specify conformance for a specification. Additionally, it addresses how a specification might allow variation among conforming implementations. The document is presented as a set of guidelines or requirements, supplemented with good practices, examples, and techniques."


Altsoft N.V. has released Xml2PDF 2.1, a $49 payware Windows program for converting XSL Formatting Objects documents into PDF files. Version 2.1 makes various optimizations and adds support for support for patterns in SVG and XML image embedding with data: URLs.

Monday, November 29, 2004

Lately I've noticed that outlets for article sized content are becoming fewer and farther between. I've got lots of things I'd like to write about at a length somewhat longer than a typical Cafe con Leche news item, but much shorter than a full book; so I decided to do something about it. Hence I am announcing The Cafes, a new site for content that falls somewhere in the large territory between a blog post and a book. The initial articles include:

The Cafes will not be updated as regularly as Cafe con Leche; just when I've got something I want to write about. It's going to focus more on How-Tos and technical material, and less on product announcements. The goal is to write more substantive material that will be valuable for a longer period of time. I do have an RSS feed for the site to announce the most recent articles, but my assumption is that most readers will find the site through search engines and links, when looking for information on a specific topic, not by checking in every day. I will be adding new articles at a rate of about one a day this week, as I've got quite a back log to plow through. Indeed the back log was one of the motivating factors for launching the site.

Also unlike Cafe au Lait/con Leche, the Cafes includes a place for reader comments on each article. I rolled my own comments system on top of PHP and MySQL because I really couldn't find an existing system that did what I wanted it to do. At the same time, since no one's done comments like this before, I pretty much had to code the system from scratch, so it's more than likely there are some bugs flitting around, waiting to be squashed. if you happen on any of the critters, please let me know.

Judging by my server logs, a few of you have found the site already. There's still a lot of work to be done, but I think it's ready to be opened to the public. Check it out, and let me know what you think. Please post any comments you have on the Welcome to the Cafes page. I think you'll like what you see. I'm very excited about The Cafes, and I think it's a going to be a very interesting and productive destination. Happy XML!

Sunday, November 28, 2004

The W3C Authoring Tool Accessibility Guidelines Working Group has posted the last call working draft of Authoring Tool Accessibility Guidelines 2.0. "This specification provides guidelines for designing authoring tools that lower barriers to Web accessibility for people with disabilities. An authoring tool that conforms to these guidelines will promote accessibility by providing an accessible authoring interface to authors with disabilities as well as enabling, supporting, and promoting the production of accessible Web content by all authors."

Friday, November 26, 2004

The W3C Quality Assurance Working Group has published The QA Handbook, "a non-normative handbook about the process and operational aspects of certain quality assurance practices of W3C's Working Groups, with particular focus on testability and test topics. It is intended for Working Group chairs and team contacts. It aims to help them to avoid known pitfalls and benefit from experiences gathered from the W3C Working Groups themselves. It provides techniques, tools, and templates that should facilitate and accelerate their work."

Thursday, November 25, 2004

The W3C Internationalization Working Group has published the proposed recommendation of Character Model for the World Wide Web 1.0: Fundamentals. "This Architectural Specification provides authors of specifications, software developers, and content developers with a common reference for interoperable text manipulation on the World Wide Web, building on the Universal Character Set, defined jointly by the Unicode Standard and ISO/IEC 10646. Topics addressed include use of the terms 'character', 'encoding' and 'string', a reference processing model, choice and identification of character encodings, character escaping, and string indexing."

This version spins out a new spec, Character Model for the World Wide Web 1.0: Resource Identifiers, which is in candidate recommendation. This spec basically says other specs should use IRIs everywhere, and should be careful to define when the conversion to URIs takes place.

Wednesday, November 24, 2004

IBM's developerWorks has published an article I wrote about RELAX NG with custom datatype libraries. This article explores one of the most powerful but little-known features of RELAX NG: the ability to define new simple data types using Java code. This enables one to check constraints like a number is prime, every left parenthesis is matched by a properly balanced right parenthesis, or the value of an SKU attribute matches the value of an SKU field in an external database. None of these constraints are expressible in the W3C XML Schema Language.


Ispras Modis has posted Sedna 0.3, an open source native XML database for Windows written in C++ and Scheme and published under the Apache License 2.0. This is not currently recommended for production. Sedna has partial support for XQuery and its own declarative update language.

Tuesday, November 23, 2004

The W3C SVG and CSS Working Groups have posted the second public working draft SVG's XML Binding Language (sXBL). In brief think of this as stylesheets on steroids. The goal is to be able to render any XML document by transforming it into SVG. This would allow the rendering of things that don't look remotely like text, such as MathML and MusicXML.


Stefan Champailler has posted DTDDoc 0.0.11, a JavaDoc like tool for creating HTML documentation of document type definitions from embedded DTD comments. This release adds a DTD tree browser, an entities index for each DTD, clickable element models, and autodetection of root elements. DTDDoc is published under an MIT license.

Monday, November 22, 2004

The W3C XML Protocol Working Group has published three proposed recommendations covering XOP, a MIME multipart envelope format for bundling XML documents with binary data:

  • XML-binary Optimized Packaging "defines the XML-binary Optimized Packaging (XOP) convention, a means of more efficiently serializing XML Infosets (see [XMLInfoSet]) that have certain types of content. A XOP package is created by placing a serialization of the XML Infoset inside of an extensible packaging format (such a MIME Multipart/Related, see [RFC 2387]). Then, selected portions of its content that are base64-encoded binary data are extracted and re-encoded (i.e., the data is decoded from base64) and placed into the package. The locations of those selected portions are marked in the XML with a special element that links to the packaged data using URIs."
  • SOAP Message Transmission Optimization Mechanism "describes an abstract feature and a concrete implementation of it for optimizing the transmission and/or wire format of SOAP messages. The concrete implementation relies on the [XOP] format for carrying SOAP messages."
  • Resource Representation SOAP Header Block "describes the semantics and serialization of a SOAP header block for carrying resource representations in SOAP messages."

Basically this is another whack at the packaging problem: how to wrap up several documents including both XML and non-XML documents and transmit them in a single SOAP request or response. In brief, this proposes uses a MIME envelope to do that. This is all reasonable. I do question the wisdom, however, of pretending this is just another XML document. It's not. The working group wants to ship binary data like images in their native binary form, which is sensible. What I don't like is that the working group wants to take their non-XML, MIME based format and say that it's XML because you could theoretically translate the binary data into Base-64, reshuffle the parts, and come up with something that is an XML document, even though they don't expect anyone to actually do that.

Why is there this irresistible urge throughout the technology community to call everything XML, even when it clearly isn't and clearly shouldn't be? XML is very good for what it is, but it doesn't and shouldn't try to be all things to all people. Binary data is not something XML does well and not something it ever will do well. Render into binary what is binary, and render into XML what is text.

Sunday, November 21, 2004

The W3C Web Content Accessibility Guidelines Working Group has posted five public working drafts covering various topics:

These describe "design principles for creating accessible Web content. When these principles are ignored, individuals with disabilities may not be able to access the content at all, or they may be able to do so only with great difficulty. When these principles are employed, they also make Web content accessible to a variety of Web-enabled devices, such as phones, handheld devices, kiosks, network appliances. By making content accessible to a variety of devices, that content will also be accessible to people in a variety of situations."

There's some useful information in here. I knew most of this stuff already. but I did find a few new ideas. Frames can have titles, which I didn't know, but then I rarely if ever use frames. More practical for me is that I can put an abbr attribute on th elements to provide terse substitutes for header labels to be used for screen readers. Also, "Use the address element to define a page's author." I'd forgotten about that one, but I'll be adding it to my pages now. And I should probably be using an abbr element with a title attribute rather than spelling out "Java Specification Request (JSR)" every time somebody submits a new draft to the JCP. See? I used it already!

Saturday, November 20, 2004

Jacob Roden has posted csv2xml, a simple open source (BSD license) command line utility for coonverting comma separated values files to XML.

Friday, November 19, 2004
XML 1.1 Bible Cover

The XML 1.1 Bible is now available as an eBook in Adobe Reader format. Diesel eBooks has it on sale for just $27.48. Amazon has lowered their price for the paper version to just $26.39, inlcuding free shipping. Bookpool is selling the paper version for $24.95, but you'll need to pay for shipping unless your total order exceeds $40.00.


Oleg Paraschenko has released TeXML 1.2, an XML vocabulary for TeX. The processor that transforms TeXML markup into TeX markup is written in Python, and thus should run on most modern platforms. The intended audience is developers who automatically generate TeX files. According to Paraschenko, "The main new feature is an automatic laying out of the generated LaTeX code. In fully automatic mode, the TeXML processor deletes redundant spaces and splits long lines on smaller chunks. The generated LaTeX code is legible enough for humans to read and modify." TeXML is published under the GPL.

Thursday, November 18, 2004

The W3C XForms working group has posted the first public working draft of XForms 1.1. Changes since 1.0 include:

  • A new namespace URI, http://www.w3.org/2004/xforms/
  • power, luhn, current and property XPath extension functions
  • An e-mail address datatype
  • An ID card number datatype
  • A duplicate action element and a corresponding xforms-duplicate event
  • A destroy action element and a corresponding xforms-destroy event
  • An xforms-close event
  • An xforms-submit-serialize event
  • Inline rendition of non-text media types

Andy Clark has posted version 0.9.4 of his CyberNeko Tools HTML Parser for the Xerces Native Interface (NekoXNI) and version 0.2.2 of his ManekiNeko RelaxNG Validator. This new version of the HTML parser is mostly a bug fix release. The RELAX NG validator adds an option to set ability to set an entity resolver. CyberNeko is writen in Java. Besides the HTML parser and RELAX NG validator, CyberNeko includes a generic XML pull parser, a DTD parser, and a DTD to XML converter.

Wednesday, November 17, 2004

Adobe has posted an update to their SVG viewer plug-in for Windows that fixes a couple of bugs including a security hole. Everyone using this on Windows should upgrade. Other platforms are not affected.


Apparently Amsterdam was a success last year. XML Europe is now XTech and has settled in Amsterdam again. This year it will take place May 24-27, convenient for most academic schedules. The call for papers has been posted. I'll have to think of something to submit. Submissions are due by January 7.

Tuesday, November 16, 2004

Happy Fifth Birthday XSLT!


Altova GmbH has released the Altova XSLT 1.0 and 2.0 Engines and the Altova XQuery Engine. These are closed source and free-beer products for Windows 2000 and later. These are the same engines used in XMLSpy. The XQuery and XSLT 2.0 engines are not fully standards conformant. I'm not sure about the XSLT 1.0 engine, but any bugs in XMLSpy's XSLT are probably found here too.

Monday, November 15, 2004

The W3C the Voice Browser Working Group has posted a new working draft of Semantic Interpretation for Speech Recognition. According to the abstract,

This document defines the process of Semantic Interpretation for Speech Recognition and the syntax and semantics of semantic interpretation tags that can be added to speech recognition grammars to compute information to return to an application on the basis of rules and tokens that were matched by the speech recognizer. In particular, it defines the syntax and semantics of the contents of Tags in the Speech Recognition Grammar Specification.

Semantic Interpretation may be useful in combination with other specifications, such as the Stochastic Language Models (N-Gram) Specification, but their use with N-grams has not yet been studied.

The results of semantic interpretation describe the meaning of a natural language utterance. The current specification represents this information as an ECMAScript object, and defines a mechanism to serialize the result into XML. The W3C Multimodal Interaction Activity is defining a data format (EMMA) for representing information contained in user utterances. It is believed that semantic interpretation will be able to produce results that can be included in EMMA.

Sunday, November 14, 2004

Opera Software has posted the third beta of version 7.6.0 of their namesake web browser for Windows. Opera supports HTML, XML, XHTML, RSS, WML 2.0, and CSS. XSLT is not supported. Other features include IRC, mail, and news clients and pop-up blocking. There are lots of little changes, bug fixes, and usability enhancements in 7.60. However major new features include speech-enabled browsing (including support for XHTML+Voice), medium-screen rendering, and inline error pages. Opera is $39 payware.

Saturday, November 13, 2004

The W3C Synchronized Multimedia working group has posted a proposed edited recommendation of Synchronized Multimedia Integration Language (SMIL 2.0). According to the draft, "there are no substantial implementation issues arising as a result of this edition, which aims only to incorporate the published corrigenda to the first edition." Comments are due by December 5.

Friday, November 12, 2004

The Gnome Project has released version 2.6.16 of libxml2, the open source XML C library for Gnome. This release fixes various bugs.


Sun's released version 1.5 of the Java Web Services Developer Pack. This release adds "XML Web Services Security, a preview of the Sun Java Streaming XML Parser based on JSR 173, as well as updates to existing web services technologies previously released in the Java WSDP." The complete contents are:

  • XML and Web Services Security v1.0
  • XML Digital Signatures v1.0 EA2
  • Sun Java Streaming XML Parser v1.0 EA
  • Java Architecture for XML Binding (JAXB) v1.0.4
  • Java API for XML Processing (JAXP) v1.2.6_01
  • Java API for XML Registries (JAXR) v1.0.7
  • Java API for XML-based RPC (JAX-RPC) v1.1.2_01
  • SOAP with Attachments API for Java (SAAJ) v1.2.1_01
  • JavaServer Pages Standard Tag Library (JSTL) v1.1.1_01
  • Java WSDP Registry Server v1.0_08
  • Ant Build Tool 1.6.2
  • WS-I Attachments Sample Application 1.0 EA3

This should all run in Java 1.4 and later.

Thursday, November 11, 2004

Michael Smith has posted version 1.67.0 of the DocBook XSL stylesheets. These support transforms to HTML, XHTML, and XSL-FO. Besides bug fixes, major enhancements in this release include:

  • Enabled dbfo table-width on entrytbl in FO output
  • Added support for role=strong on emphasis in FO output
  • Added new FO parameter hyphenate.verbatim that can be used to turn on "intelligent" wrapping of verbatim environments.
  • Replaced all <tt></tt> output with <code></code>
  • Use strong/em instead of b/i in HTML output
  • Added Saxon8 extensions

Peter Eisentraut has released version 1.79 of the DocBook DSSSL stylesheets. According to Euisentraut, "This is a maintenance release. It fixes a number of outstanding bugs and contains updated translations." New features include:

  • The doctype declaration in the HTML output now contains a system identifier
  • CSS decoration has been added to procedure steps.
  • Uses of <VAR> in HTML output (often rendered in italic) have been
  • changed to something more appropriate
  • Admonition titles and contents are kept together.
  • Programlistings with callouts now honor the width attribute.
  • "pc" is now allowed as abbreviation for "pica".
  • Bosnian and Bulgarian translations have been added.

Cladonia Ltd.has released the Exchanger XML Editor 3.0, a $98 payware Java-based XML Editor. Features include

  • Schema Based Editing
  • Tag Prompting
  • Validation against DTD, XML Schema, RelaxNG
  • Tree View and Outliner for Tag Free editing
  • XPath and Regular expression searches
  • Schema Conversion
  • XSLT
  • Project Management
  • SVG Viewer and Conversion
  • Easy SOAP Invocations
  • Find in Files
  • Extension Handling
  • DTD editing
  • XML catalogs
  • RelaxNG and DTD based tag completion.
  • XSLT Debugger
  • XML Signature support
  • Better performance with large documents
  • WSDL Analyzer
  • WebDAV and FTP support
  • XInclude resolution

New features in version 3.0 include:

  • Unordered XML Differencing and Merging,
  • Content Folding
  • Split Views
  • User defined Keyboard Shortcuts
  • Emacs Keyboard Shortcuts
  • Multiple Tag-Completion Schemas
  • Attribute Value Prompting
  • Navigator with XPath Filters
Wednesday, November 10, 2004

The W3C XML Core Working Group has posted the second and last call working draft of xml:id Version 1.0. This describes an idea that's been kicked around in the community for some time. The basic problem is how to link to elements by IDs when a document doesn't have a DTD or schema. The proposed solution is to predefine an xml:id attribute that would alays be recognized as an ID, regardless of the presence or absence of a DTD or schema.


The W3C XML Binary Characterization Working Group has published the third working draft of XML Binary Characterization Use Cases. Apparently the one they posted five days ealrier had "some obsolete content. This new publication is meant to reflect the up-to-date state of the document, it is recommended not to read the previous version."

Tuesday, November 9, 2004

The Mozilla Project has released Firefox 1.0, the open source web browser that is rapidly gaining on Internet Explorer. Firefox supports HTML, XHTML, CSS, and XSLT. MathML and SVG aren't supported out of the box, but can be added.

Monday, November 8, 2004

In my continuing efforts to make XML dead-bang easy to manipulate with Java, I've posted beta 7 of XOM, my dual streaming/tree API for processing XML with Java. This release fixes a few bugs and approximately doubles the performance of a few common operations including getValue(), toXML(), DOM and SAX conversion, canonicalization, and XSL transformation.

This is the first release candidate. There are still a few open issues with regard to error handling in XInclude that require clarification from the XInclude working group. If they decide that how XOM currently behaves is correct, then XOM 1.0 is essentially complete. If they decide to require different behavior, a few changes may yet need to be made.

Sunday, November 7, 2004

Skipping right over candidate recommendation (I guess they think this has already been implemented), the W3C Technical Architecture Group (TAG) has posted the proposed recommendation of Architecture of the World Wide Web, First Edition. Quoting from the abstract:

The World Wide Web uses relatively simple technologies with sufficient scalability, efficiency and utility that they have resulted in a remarkable information space of interrelated resources, growing across languages, cultures, and media. In an effort to preserve these properties of the information space as the technologies evolve, this architecture document discusses the core design components of the Web. They are identification of resources, representation of resource state, and the protocols that support the interaction between agents and resources in the space. We relate core design components, constraints, and good practices to the principles and properties they support.

It's pretty good stuff overall. Everyone working on the Web, the Semantic Web, or with XML or URIs should read it. Even a cursory skim reveals a few surprises. For instance, apparently URIs with fragment identifiers are now considered to be full-fledged URIs, not just URI references (When did that change?) and there's no actual syntax for fragment identifiers for URIs that point to XML documents. (What happened to XPointer?) Comments are due by December 3.

Saturday, November 6, 2004

The W3C XSL Working Group has published the last call working draft of XSL Transformations (XSLT) Version 2.0. According to the draft, more significant changes since the previous XSLT 2 draft include:

  • A new attribute, use-when, allows compile-time conditional inclusion of sections of the stylesheet depending on the processing environment (for example, for schema-aware or non-schema-aware processing)

  • A switch, input-type-annotations, defines whether the stylesheet expects source data to have been validated and annotated by a schema processor.

  • A new instruction xsl:document is provided, to construct a document node.

  • Serialization attributes can now be specified (dynamically) on the xsl:result-document instruction.

  • A schema can now be included inline within the xsl:import-schema declaration.

Friday, November 5, 2004

The W3C XML Binary Characterization Working Group has published the second working draft of XML Binary Characterization Use Cases. This divides roughly 50-50 into things that should be done in plain vanilla XML (Web Services for Small Devices, Web Services as an Alternative to CORBA, Electronic Documents, FIXML) and things that should not be done in anything remotely like XML (Floating Point Arrays in the Energy Industry, PC-free Photo Printing). In brief, they're trying to turn a station wagon into a Ferrari, and instead they're going to end up with an Edsel. Despite the hype XML is not, cannot, and will not be all things to all people. At best this effort will fail. At worst, it will fail and take down XML with it.

A few of the use cases (Embedding External Data in XML Documents, PC-free Photo Album Generation) demonstrate legitmate needs to bundle binary data with XML. However, they're doing it inside out. The XML and the binary data should be combined in a non-XML envelope like XOM proposes, rather than forcing the binary's square pegs into XML's round holes.

Thursday, November 4, 2004

Wolfgang Meier of the Darmstadt University of Technology has posted the second beta of eXist 1.0, an open source native XML database that supports fulltext search. XML can be stored in either the internal, native XML database or an external relational database. The search engine supports XPath and XQuery. The server is accessible through HTTP and XML-RPC interfaces and supports the XML:DB API for Java programming.

According to Meier, "This release benefits from a lot of testing done by other projects, and fixes many instabilities and database corruptions that were still present in the previous version. In particular, the XUpdate implementation should now have reached a stable state. Concurrent XUpdates are fully supported. The XQuery implementation has matured, adding support for collations, computed constructors, and more. Module loading has been improved, allowing more complex web interfaces to be written entirely in XQuery (see new admin interface). Finally, there's a new WebDAV module, a reindex/repair option and support for running eXist as a system service." eXist is published under the LGPL.


MetaStuff Ltd. has released dom4j 1.5.1, a tree-based API for processing XML with Java. dom4j is based on interfaces rather than classes, which distinguishes it from alternatives like JDOM and XOM (Not to its credit, in my opinion. Using concrete classes instead of interfaces was one of the crucial decisions that made JDOM as simple as it is.) Version 1.5/1.5.1 seems to be mostly a collection of bug fixes and small, backwards compatible, API enhancements. It improves compliance to the DOM interfaces and adds support for StAX.

dom4j is published under a BSD license. However, it uses code form the GNU Classpath extension Project (specifically the Ælfred parser) in a manner incompatible with its license, and it really should be published under the GPL as a result. Because dom4j's own BSD license is incompatible with dom4j, any distribution of dom4j must violate either the copyright of MetaStuff or the copyright of the Free Software Foundation. You might be able to cure this for your own distribution by removing the org.dom4j.aelfred and org.dom4j.aelfred2 packages from your own code base, and linking to unmodified copies of GNU JAXP instead. However there might be other license mines in other parts of the code base. dom4j has a long history of ignoring other projects' licenses—it started life as an illegal fork of JDOM, though that has since been cured—and it wouldn't surprise me in the least to find more misappropriated code in other packages.

Wednesday, November 3, 2004

The W3C XML Protocol Working Group has published the last call working draft Assigning Media Types to Binary Data in XML. This spec attempts to preserve the original MIME media type of Base-64 encoded binary data stuffed in an XML element. The mechanism by which this happens is an xmlmime:contentType attribute for indicating the media type of XML element content whose type is xs:base64Binary. It also defines an expectedMediaType for use in schema annotations to indicate what the contentType attribute may say.

Tuesday, November 2, 2004

The W3C the Timed Text (TT) Working Group has posted the first public working draft of Timed Text (TT) Authoring Format 1.0 – Distribution Format Exchange Profile (DFXP). According to the abstract,

This document specifies the distribution format exchange profile (DFXP) of the timed text authoring format (TT AF) in terms of a vocabulary and semantics thereof.

The timed text authoring format is a content type that represents timed text media for the purpose of interchange among authoring systems. Timed text is textual information that is intrinsically or extrinsically associated with timing information.

The distribution format exchange profile is intended to be used for the purpose of transcoding or exchanging timed text information among legacy distribution content formats presently in use for subtitling and captioning functions.


The W3C Voice Browser working group has posted Pronunciation Lexicon Specification (PLS) Version 1.0 Requirements. According to the abstrat, "This document is part of a set of requirements studies for voice browsers, and provides details of the requirements for markup used for specifying application specific pronunciation lexicons. Application specific pronunciation lexicons are required in many situations where the default lexicon supplied with a speech recognition or speech synthesis processor does not cover the vocabulary of the application. A pronunciation lexicon is a collection of words or phrases together with their pronunciations specified using an appropriate pronunciation alphabet."

Monday, November 1, 2004

The W3C XQuery and XSLT Working Groups have updated five working drafts:

The XSLT 2 working draft hasn't been updated for the second time in a row now. I'm not sure what's holding it up.

Most of the changes in XPath 2.0 in this draft seem to be editorial, more aimed at tightening up the spec than on changing the language itself. Some of the more substantive changes include::

  • SequenceType syntax has been simplified. SchemaContextPath is no longer part of the SequenceType syntax.

  • xdt:untypedAny has changed to xdt:untyped.

  • xs:anyType is no longer abstract, and is used to denote the type of a partially validated element node.

  • Value comparisons return () if either operand is ().

  • The precedence of the cast, treat, and unary arithmetic operators has been increased.

  • A new component has been added to the static context: context item static type.

Most of the changes in XQuery are a little more substantive and include::

  • The last step in a path expression can return a sequence of atomic values or a sequence of nodes (mixed nodes and atomic values are not allowed.)

  • A value of type xs:QName is now defined to consist of a "triple": a namespace prefix, a namespace URI, and a local name. Including the prefix as part of the QName value makes it possible to cast any QName into a string when needed.

  • Local namespace declarations have been deleted from computed element constructors. No namespace bindings may be declared by a computed element constructors.

  • The Prolog has been reorganized into three parts which must appear in this order: (a) Setters; (b) Namespace declarations and module and schema imports; (c) function and variable declarations.

  • A new "inherit-namespaces" declaration has been added to the Prolog, and "namespace inheritance mode" has been added to the static context.

  • An "encoding" subclause has been added to the Version Declaration in the Prolog.

  • A new declaration has been added to the Prolog to control the query-wide default handling of empty sequences in ordering keys ("empty greatest" or "empty least".)

  • In the static context, "current date" and "current time" have been replaced by "current dateTime", which is defined to include a timezone.

  • Computed comment constructors now raise an error rather than trying to "fix up" a malformed comment by inserting blanks.

  • The div operator can now divide two yearMonthDurations or two dayTimeDurations. In either case, the result is of type xs:decimal.

  • Support for XML 1.1 and Namespaces 1.1 have been bundled together and defined as an optional feature. Various aspects of query processing and serialization that depend on this optional feature have been identified.

  • Cyclic module imports are no longer permitted. A module M may not import another module that directly or indirectly imports module M.

  • The application/xquery MIME media type has been defined.

Sunday, October 31, 2004

The Gnome Project has released version 2.6.15 of libxml2, the open source XML C library for Gnome. This release fixes various bugs including a couple of security issues. It also improves the XInclude error reports, adds some convenience functions to the Reader API, and supports processing instructions in HTML. They've also released version 1.1.12 of libxslt, the GNOME XSLT library for C and C++. This is a bug fix release.

Saturday, October 30, 2004

The W3C XML Schema working group has released the second edition of the W3C XML Schema specifications. This is not a new language, The new specs just incorporate various errata found in the original specs since their publication a few years ago. The spec is still divided into three parts: Part 0: Primer, Part 1: Structures, and Part 2: Datatypes.

Friday, October 29, 2004

Opera Software has posted the second beta of version 7.6.0 of their namesake web browser for Windows. Opera supports HTML, XML, XHTML, RSS, WML 2.0, and CSS. XSLT is not supported. Other features include IRC, mail, and news clients and pop-up blocking. There are lots of little changes, bug fixes, and usability enhancements in 7.60. However major new features include speech-enabled browsing (including support for XHTML+Voice), medium-screen rendering, and inline error pages. Opera is $39 payware.


The W3C Scalable Vector Graphics Working Group has posted the last call working draft of Scalable Vector Graphics (SVG) 1.2. Non-editorial changes in this draft include:

  • requiredFormats test attribute
  • requiredFonts test attribute
  • "Various updates to SVGGlobal (formerly SVGWindow), including removal of documentStyleSheet and evt attributes, addition of screen and location attributes, addition of navigation method, addition of mouse capture and merging of existing new methods into the interface."
  • Return type of SVGImage::getPixel is now SVGColor.
  • Synchronization attributes imported from SMIL 2
  • # Added background-fill-opacity property.
  • Changed compositing attribute to clipout
  • New Selection interfaces
  • "Auto" textLength
  • animation element for displaying animated vector content.
  • Added "overflow" and "underflow" events for flow regions.
  • Ogg Vorbis is required. No video formats are required.
  • Event notification for shape changes, and event notification for rendering bounding box modifications.
  • Mouse wheel events
  • Two new methods on SVGLocatable for obtaining rendered bounds.

Kiyut has released Sketsa 2.2, a $29 payware SVG editor written in Java. Java 1.4.1 or later is required.

Thursday, October 28, 2004

Martin Duerst has submitted a draft of The Archived-At Message Header Field. Briefly this proposes adding a semi-permanent URL for each e-mail message posted to a mailing list to the e-mail header. This is an insanely good idea, that I wish everyone would start using immediately. The W3C mailing lists already use X-Archived-At for this purpose. I just wish more lists would follow suit.


The Apache Software Foundation has published The Common Gateway Interface (CGI) Version 1.1, an informational RFC that describes "'current practice' parameters of the 'CGI/1.1' interface developed and documented at the U.S. National Centre for Supercomputing Applications. This document also defines the use of the CGI/1.1 interface on UNIX(R) and other, similar systems."

Wednesday, October 27, 2004

Jiri Pachman has written fo2wordml, a stylesheet to convert XSL-FO to Microsoft's WordprocessingML format. (This is the second such tool for that I've seen in the last couple of weeks. I'm amazed this is even possible.)


Sébastien Cramatte has posted xslt2Xforms 0.7, an XSLT stylesheet that adds W3C XForms support to a web browser using XHTML, Javascript and CSS. This release only works in Mozilla.

Tuesday, October 26, 2004

Norm Walsh has written a draft of XML Chunk Equality, an attempt to decide when two XML infosets are and are not equal.


William F. Hammond has posted gellmu 0.8.0.5, "a LaTeX-like way to produce article-level for online display in the modern, fully accessible, form of HTML extended by the World Wide Web Consortium's Mathematical Markup Language (MathML)."

Monday, October 25, 2004

Dave Beckett has released the Raptor RDF Parser Toolkit 1.4.0, an open source C library for parsing the RDF/XML, N-Triples. Turtle, and Atom Resource Description Framework formats. It uses expat or libxml2 as the underlying XML parser. Version 1.40 can serialize RDF triples into RDF/XML and N-Triples and adds RSS enclosure support to the RSS tag soup parser. Raptor is dual licensed under the LGPL and Apache 2.0 licenses.

Sunday, October 24, 2004

XMLmind has released version 2.8 of their XML Editor. This $220 payware product features word processor and spreadsheet like views of XML documents. A free-beer hobbled version is also available.


The xframe project has posted beta 5 of xsddoc, an open source documentation generator for W3C XML Schemas based on XSLT. xsddoc generates JavaDoc-like documentation of schemas. Java 1.3 or later is required.


Recordare has released Dolet 2.0, an $89.95 payware Mac OS X Finale plug-in for reading and writing MusicXML files. Java 1.4, Finale 2004 or 2005, and Mac OS X are required.

Saturday, October 23, 2004

Adam Souzis has released Rx4RDF 0.4.1, a set of technologies designed to make the Resource Description Framework (RDF) measier to use. It includes:

  • RxPath for querying, transforming and updating RDF by specifying a deterministic mapping of the RDF model to the XPath data model
  • ZML, a Wiki-like text formatting language that lets you write arbitrary XML or HTML
  • RxML, yet another alternative XML serialization for RDF, this one designed for easy authoring in ZML
  • Raccoon, a simple application server that uses an RDF model for its data store
  • Rhizome is a content management and delivery system that runs on Raccoon.
  • RDFScribbler, a web application that can display and edit any arbitrary RDF model using RxSLT and RxUpdate.

XMLMind has released the XMLmind FO Converter 2.0, an XSL-Formatting Objects to RTF converter written in Java. Version 2.0 adds the ability to convert XSL-FO documents to Microsoft's WordprocessingML format. The personal edition is free-beer. The professional edition adds an API for interacting with the product and costs $550.


Norm Walsh has posted the second candidate release of DocBook 4.4, an XML application designed for technical documentation and books such as Processing XML with Java. New elements in DocBook 4.4 include package and biblioref. This CR release fixes bugs and adds "wordsize as a global effectivity attribute" (whatever that means). He's also posted the second candidate release of simplified DocBook 1.1.

Friday, October 22, 2004

x-port.net has released of formsPlayer 1.1, a free-beer (e-mail address required) XForms processor that "only works in Microsoft's Internet Explorer version 6 SP 1." Version 1.1 can dynamically set any aspect of submission "from instance data, including URLs and HTTP headers, making it possible to implement clients that use protocols like SOAP, Atom over SOAP or REST, WebDAV, and so on." It also fixes assorted bugs.

Thursday, October 21, 2004

Karl Waclawek has released SAX for .NET 1.0, an open source port of the Java SAX API to C# and .NET 1.1. An implementation of this API based on expat is available, but is not compatible with Mono 1.0.2.


YesLogic has released Prince 4.0, a $295 payware batch formatter for Linux and Windows that produces PDF and PostScript from XML documents with CSS stylesheets. New features in 4.0 include XHTML style and link elements, PDF compression and Font Embedding, Automatic table layout, Shrink-to-fit floats, block alignment and negative margins, and word-breaking at soft hyphens.

Wednesday, October 20, 2004

I've posted the notes from last night's Effective XML presentation to the XML Developer's network of the Capitol District, where a good time was had by all. Compared to my usual notes, these are a little sparse. Most of the material is covered in much greater depth in the book. I'm next scheduled to give this talk at Software Development 2005 West in March, but if they're any user groups or conferences that would like to hear it before then, send me e-mail, and we'll see what we can do.

Tuesday, October 19, 2004

Yesterday evening I was doing some programming with DOM, when I was reminded of the importance of failing fast (as well as just how much I hate DOM). I was running the XInclude Test Suite across my DOMXIncluder and logging the results to a simple, record-like XML document. The format for the output was suggested by the XInclude working group, but it's quite simple: no namespaces; nothing fancy. It looks like this:

<?xml version="1.0" encoding="UTF-8"?>
<testresults processor="com.elharo.xml.xinclude.DOMXIncluder">
    <testresult id="imaq-include-xml-01" result="pass"/>
    <testresult id="imaq-include-xml-02" result="pass"/>
    <testresult id="imaq-include-xml-03" result="skipped">
        <note>DOMXIncluder does not support the xpointer scheme</note>
    </testresult>
    <testresult id="imaq-include-xml-04" result="pass"/>
    <testresult id="imaq-include-xml-05" result="pass"/>
    <testresult id="imaq-include-xml-06" result="fail"/>
...

One of the things I logged into the document was exception messages encountered when running any one of the 150 or so tests. Somewhere along the line one or more of the exception messages I logged was null or contained an XML illegal character such as a form feed. However, I'm still not sure which ones because DOM doesn't actually complain if you create a text node with malformed data that cannot possibly be serialized. Quite a while later, when I was serializing the document, the serializer complained and died with an unhelpful error message that didn't actually tell me where to find the problem. (I tried two serializers. Apache's XMLSerializer complained about a bad character, but didn't tell me what the character was or where it appeared. JAXP's ID transform simply generated a blank document without any error message.) As a kludgy fix, for the time being I've stopped logging the exception messages into DOM.

The problem is not that the exception messages contained illegal characters. If I had been informed of this, it would have been trivial to work around it. The problem was that DOM didn't bother checking for this, and blindly created a malformed document. XOM would have caught the error immediately when it happened, rather than waiting for the entire document to be serialized. It would have pinpointed exactly where the problem was so I could fix it. Draconian error handling is a feature, not a bug. It is the API's responsibility to detect bad input. It must not rely on client programmers to provide correct data. Even when the programmers are experts who really do know all the ins and outs of which input is legal, they may not be creating the input by hand. They are often passing in data from another source that has no idea it is talking to an API with particular preconditions. Precondition verification is a sine qua non for robust, published APIs; and it is a sine qua non that DOM fails to implement.

Monday, October 18, 2004

Tomorrow evening (Tuesday) I'll be talking about Effective XML at the XML Developers Network of the Capital District in Albany New York. The meeting runs from 6:00 to 8:00 P.M. Everyone's invited.


Planamesa Software has posted the second alpha of NeoOffice/J 1.1, a Mac OS X variant of OpenOffice that replaces X-Windows with Java Swing. This release "supports the features in OpenOffice.org 1.1.2 including faster startup, right-to-left and vertical text editing, and the ability to save documents directly to PDF." I wrote the original version of the Effective XML presentation in OpenOffice on Linux, which proved not up to the task so I eventually moved it to PowerPoint. I'll check this out and see if maybe I can use NeoOffice/J for tomorrow night's presentation, but no promises.

OK: verdict's in. That was quick. This product is definitely not ready for prime time, at least in the presentation component. As soon as I opened my PowerPoint slides, NeoOffice/J seemed to get stuck in an infinite flashing loop of draw and redraw. At least it let me quit, but I couldn't do anything else. I will be using PowerPoint tomorrow night.


The GEO (Guidelines, Education & Outreach) Task Force of the W3C Internationalization Working Group (I18N WG) has published the first public working draft of Authoring Techniques for XHTML & HTML Internationalization: Specifying the language of content 1.0. The table of contents provides a various nice summary of the rules:

  • Always declare the default text processing language of the page, using the html tag, unless there are more than one primary languages.
  • Consider using a Content-Language declaration in the HTTP header or a meta tag to declare metadata about the primary language of a document.
  • Do not use Content-Language to declare the default text processing language, and do not use language attributes to declare the primary language metadata.
  • Do not declare the language of a document in the body tag.
  • If you are using Content-Language to indicate the primary language metadata when there are multiple primary languages, provide a comma-separated list of all primary language tags.
  • For documents with multiple primary languages, decide whether you want to declare a single text processing language in the html tag, or leave it undefined.
  • For documents with multiple primary languages, try to divide the document at the highest possible level, and declare the appropriate text processing language in those blocks.
  • Use the lang and/or xml:lang attributes around text to indicate any changes in language.
  • For HTML use the lang attribute only, for XHTML 1.0 served as text/html use the lang and xml:lang attributes, and for XHTML served as XML use the xml:lang attribute only.
  • Follow the guidelines in RFC3066 for language attribute values.
  • Use the two-letter ISO 639 codes for the language code where there are both 2- and 3-letter codes.
  • Consider using the codes zh-Hans and zh-Hant to refer to Simplified and Traditional Chinese, respectively.
  • When pointing to a resource in another language, consider the pros and cons of using CSS to indicate the language, based on the value of the hreflang attribute of the a element.
  • If using CSS to generate a language marker from the hreflang attribute, do not use flag icons to indicate languages.

Antenna House, Inc has released XSL Formatter 3.2 for Linux and Windows. This tool converts XSL-FO files to PDF. Newly supported XSL-FO properties in 3.2 include alignment-adjust , alignment-baseline , dominant-baseline , glyph-orientation-horizontal, and glyph-orientation-vertical . New features in 3.2 include MathML support, WordML transformation, XSL Template designer integration, and end user defined and private use characters The lite version costs $300 and up on Windows and $900 and up on Linux/Unix, but is limited to 300 pages per document. Prices for the uncrippled version start around $1250 on Windows and $3000 on Linux/Unix.


David Holroyd has updated his a CSS2 DocBook stylesheet to version 0.3. Thus stylesheet that enables CSS Level 2 savvy web browsers such as Mozilla and Opera to display Docbook XML documents. The results aren't as pretty as what the XSLT stylesheets can produce, but they're serviceable. This release makes a number of small improvements including support for ulink, productname, and important.

Saturday, October 16, 2004

SyncroSoft has released verison 5.0 of the <Oxygen/> XML editor. Oxygen supports XML, XSL, DTDs, and the W3C XML Schema Language. New features in version 5.0 include an XSLT 2.0 Editor and Debugger, XPath 2.0 evaluator, XQuery Editor, WSDL Editor, SOAP Analyzer, and SVG Viewer. It costs $128 with support. Upgrades from previous versions are $76.

Friday, October 15, 2004

The W3C RDF Data Access Working Group has published the first public working draft of SPARQL Query Language for RDF. According to the introduction,

An RDF graph is a set of triples, each consisting of a subject, an object, and a property relationship between them as defined in RDF Concepts and Abstract syntax. These triples can come from a variety of sources. For instance, they may come directly from an RDF document. They may be inferred from other RDF triples. They may be the RDF expression of data stored in other formats, such as XML or relational databases.

SPARQL is a query language for accessing such RDF graphs. It provides facilities to:

  • extract information in the form of URIs, bNodes, plain and typed literals.
  • extract RDF subgraphs.
  • construct new RDF graphs based on information in the queried graphs.

Here's a simple example SPARQL query adapted from the draft:

PREFIX  dc: <http://purl.org/dc/elements/1.1/>
PREFIX  : <http://example.org/book/>
SELECT  ?var
WHERE   ( :book1  dc:title  ?var )

The ? indicates a variable name. This query stores the title of a book in a variable named var. There are boolean and numeric operators as well.

Thursday, October 14, 2004

Happy Tenth Birthday Netscape!


The RDF Data Access Working Group has published the third public working draft of RDF Data Access Use Cases and Requirements. According to the introduction,

The W3C's Semantic Web Activity is based on RDF's flexibility as a means of representing data. While there are several standards covering RDF itself, there has not yet been any work done to create standards for querying or accessing RDF data. There is no formal, publicly standardized language for querying RDF information. Likewise, there is no formal, publicly standardized data access protocol for interacting with remote or local RDF storage servers.

Despite the lack of standards, developers in commercial and in open source projects have created many query languages for RDF data. But these languages lack both a common syntax and a common semantics. In fact, the extant query languages cover a significant semantic range: from declarative, SQL-like languages, to path languages, to rule or production-like systems. The existing languages also exhibit a range of extensibility features and built-in capabilities, including inferencing and distributed query.

Further, there may be as many different methods of accessing remote RDF storage servers as there are distinct RDF storage server projects. Even where the basic access protocol is standardized in some sense—HTTP, SOAP, or XML-RPC—there is little common ground upon which to develop generic client support to access a wide variety of such servers.

The following use cases characterize some of the most important and most common motivations behind the development of existing RDF query languages and access protocols. The use cases, in turn, inform decisions about requirements, that is, the critical features that a standard RDF query language and data access protocol require, as well as design objectives that aren't on the critical path.


RenderX has released version 4.0 of XEP, its payware XSL Formatting Objects to PDF and PostScript converter. XEP also supports part of Scalable Vector Graphics (SVG) 1.1. It's not immediately clear what, if anything, is new in 4.0. The basic client is $299.95. The developer edition with an API is $999.95. The server version is $3999.95. Updates from 3.0 range from free to full-price depending on when you bought it.

Wednesday, October 13, 2004

The W3C Web Services Choreography Working Group has posted the second public working draft of Web Services Choreography Description Language Version 1.0. According to the abstract,

The Web Services Choreography Description Language (WS-CDL) is an XML-based language that describes peer-to-peer collaborations of parties by defining, from a global viewpoint, their common and complementary observable behavior; where ordered message exchanges result in accomplishing a common business goal.

The Web Services specifications offer a communication bridge between the heterogeneous computational environments used to develop and host applications. The future of E-Business applications requires the ability to perform long-lived, peer-to-peer collaborations between the participating services, within or across the trusted domains of an organization.

The Web Services Choreography specification is targeted for composing interoperable, peer-to-peer collaborations between any type of party regardless of the supporting platform or programming model used by the implementation of the hosting environment.

Tuesday, October 12, 2004

Tatu Saloranta has posted WoodStox 1.0, a free-as-in-speech (LGPL) non-validating XML processor written in Java that implements StAX API. "StAX specifies interface for standard J2ME 'pull-parsers' (as opposed to "push parser" like SAX API ones); at high-level StAX specifies 2 types (iterator and event based) readers and writers that used to access and output XML documents." WoodStox supports XML 1.0 and 1.1.

Monday, October 11, 2004

The W3C XML Binary Characterization Working Group has posted the first public working draft of XML Binary Characterization Properties. This describes the goals/hopes/dreams the group has for a binary format to replace XML. These include:

  • Accelerated Sequential Access
  • Byte Preserving
  • Compact
  • Data Model Versatility
  • Efficient Update
  • Embedding of arbitrary files
  • Encryptable
  • Extensible at the format level
  • Fragmentable
  • Hinting (I have no idea what this is, and neither does the draft)
  • Human Readable/Editable/Deducible
  • Integratable into the Web
  • Integratable into XML Family
  • No Arbitrary Limits
  • Fast
  • Random Access
  • Robust
  • Round Trippable
  • Schema Instance Change Resilience
  • Self Contained
  • Signable
  • Specialized codecs
  • Streamable
  • Support for Error Correction
  • Transcodable to XML
  • Transport Independence
  • Support for Open Content Models
  • Verifiable Integrity
  • Version Identification
  • Draconian error handling
  • Forward Compatible
  • Free
  • Small Footprint
  • Ubiquitous Implementation
  • Net decrease in entropy

OK. I snuck that last point in myself. It seems only slightly less likely than satisfying all the rest of these goals in a single format. I am glad the working group is so ambitious. Hopefully they'll either fail, and let the rest of us go back to doing real work with plain vanilla XML, or they'll succeed and produce something quite useful. However, I do hope they won't accept half measures. Some of these goals have not been important to vendors in this space before, human readability perhaps foremost among them. The group does use a very unorthodox definition of human readable though. Human deducible is more accurate; i.e. the format needs to be able to be reverse engineered without access to the specification or documentation. The requirement to be self contained rules out a lot of schema based compression systems.

There's one important non-goal that's notable by its absence. There's no requirement here that the format be language neutral. Actually that's two requirements: one that it not prefer one programming language over another and one that it not prefer one human language over another. A lot of the proposals I've seen have been designed so that they ran very fast in one particular environment but slowed down noticeably in environments with different byte orders, primitive data type widths, and other memory layout characteristics. Binary formats by their nature tend to be very tied to one particular architecture to the detriment of others. Java byte code, for instance, happens to look a lot like what a Sparc engineer would expect to see, and that meant Java was less than optimal on X86 systems even though it was nominally platform independent. XSLT 1.0 is very hard to implement outside of java (and these days, even inside Java) because it normatively references the Java 1.1 specification. XSLT 1.1 died due to infighting between the Python and Java communities.

Even more seriously the specification should not favor some natural languages over others. For instance, it would be unacceptable to design a format where English ASCII data was highly compressed but Chinese data wasn't. UTF-8 has this issue, but Unicode and XML don't because they allow individuals documents to choose their own encodings. Each document can be optimized for tis own needs. Typical data-neutral compression schemes like basic Huffman coding have this property naturally. However, some of the schemes for XML compression I've seen make a lot of assumptions about what the data looks like, and optimize for particular scenarios at the expense of others. At least when it comes to text, we need to make sure that English and Chinese are both supported well (as should be the 6,000 or so other languages on the planet too, of course).

Sunday, October 10, 2004

I was just getting ready to upgrade to Apache httpd 2.0.52 when I noticed an apparent security issue on their web server. This won't allow anybody to crack into www.apache.org, so let me just describe it here, and see what people think. I wanted to check the PGP signature of the file I downloaded from a mirror, so I grabbed the signature from Apache's web site at http://www.apache.org/dist/httpd/httpd-2.0.52.tar.gz.asc. Notice anything funny about that URL? The scheme is http, not https. That means the connection is unauthenticated, which means it's vulnerable to a man in the middle attack. Shouldn't these signatures only be served over an authenticated connection? Am I out to sea, or is this a real problem? Let me know what you think. Before commenting, please take note of two things:

  1. The encryption or lack thereof doesn't matter here. It's authentication I care about.
  2. I'm not really interested in hearing how unlikely man-in-the-middle attacks are. Many, many people and organizations depend on Apache HTTPD; and some of them do need to defend against attacks from governments that can subvert the ISPs.

Possibly the verification of the KEYS file from the root certificates might cover this. But if that's the case, then why are we warned to "Make sure you get these files from the main distribution directory, rather than from a mirror." If we really do need to get them from the main site, and not some other site, then we really do need to prevent man-in-the-middle attacks.

Indeed, when I tried to verify the file, a problem showed up:

gpg: WARNING: This key is not certified with a trusted signature!
gpg:          There is no indication that the signature belongs to the owner.
gpg: Fingerprint: 33 16 9B 46 FC 12 D4 01  CA 6D DB D7 DE EA 4F D7

This looks like a classic case of good algorithms compromised by bad protocol implementation. This is exactly how codes are broken and security subverted in the real world.


The IETF has posted another last call working draft of Internationalized Resource Identifiers (IRIs). "An IRI is a sequence of characters from the Universal Character Set (Unicode/ISO 10646). A mapping from IRIs to URIs is defined, which means that IRIs can be used instead of URIs where appropriate to identify resources." In other words this lets you write URLs that use non-ASCII characters such as http://www.libération.fr/. The non-ASCII characters would be converted to a genuine URI using hexadecimally escaped UTF-8. For instance, http://www.libération.fr/ becomes http://www.lib%C3%A9ration.fr/. There's also an alternative, more complicated syntax to be used when the DNS doesn't allow percent escaped domain names. However, the other parts of the IRI (fragment ID, path, scheme, etc.) always use percent escaping. The changes in this draft appear to be editorial in nature.


Aaron Swartz has registered application/rdf+xml as the MIME media type for the Extensible Markup Language (XML) serialization of the Resource Description Framework (RDF). Mark Bake and Mark Nottingham have registered application/soap+xml as the MIME media type for SOAP 1.2 messages serialized as XML 1.0. Time to update your mime.types files.


James Kass has updated Code2001, a freeware TrueType font covering some of the scripts in the new Plane 1, including Deseret, Old Italic, Gothic, Aegean Numbers, Cypriot Syllabary, Pollard Script, and Ugaritic. It also provides experiemntal support for Old Persian Cuneiform, Tengwar, and Cirth, though the code pointm here will change in the future. According to Kass, "Code2001 works on Windows 2000, but may not work on other operating systems. Mac OS X supports Unicode's higher planes and Code2001 will work with certain applications." This release adds "rough glyphs for many of the scripts which were added in Unicode 4.0 are now found in the font. These rough, filler glyphs were added quickly in order to enable testing of these additions. These rough glyphs will be improved for the next release, meanwhile they should be better than those little squares."


Sonic Software has released Stylus Studio 6.0, a $495 payware XML editor for Windows. Features include:

  • XML differencing
  • XSLT debugging
  • XSLT mapping
  • XSLT profiling
  • XSL:FO
  • XQuery editing, mapping, and debugging.
  • XML Schema Editor
  • Document Type Definition (DTD) Editor
  • XPath Evaluator
  • XPath Expression Generator
  • Web Service Call Composer
  • UDDI Registry Browser
  • Tools for mapping to and from XML documents, Web service data, relational data, and flat files
  • Import/export utilities for RDBMS, XML, CSV, ADO, and flat files
  • JSP Editor

New features in 6.0 include:

  • XSLT 2.0 Editor and Debugger
  • Supports the July 2004 XQuery 1.0 working drafts
  • Convert flat files, binary data, EDI, and other formats to XML
  • XML Schema Editor
  • XML grid view for editing tabular XML data

Steve Ball has released version 1.2.1 of the XSLT Standard Library has been released. This is a collection of commonly-used templates written purely in XSLT. Besides bug fixes, version 1.2.1 adds new SVG and comparison modules and new templates in the string, date-time and math modules. I used the date templates from this library in the stylesheets for Processing XML with Java. xsltsl is open source published under the LGPL.


Engage Interactive has released DOMIT! 0.9.9, a free-as-in-speech (LGPL) DOM implementation for PHP. Version 0.9.9 is a bug fix release.

Saturday, October 9, 2004

John Cowan has posted the first release candidate of TagSoup, an open source, Java-language, SAX parser for nasty, ugly HTML. I use TagSoup to convert JavaDoc to well-formed XHTML. "Improvements for this release include better JavaDoc and an extension to TSSL, support for proper attribute normalization, namespace prefixes on elements, and an expanded public API for schema components." TagSoup is dual licensed under the Academic Free License and the GPL.

Friday, October 8, 2004

Henry S. Thompson and Richard Tobin have released XSV 2.8-1, a partial W3C XML Schema Validator for Linux and Windows. There's also a web form based interface. This release fixes bugs and can now validate XML 1.1 documents and read XML 1.1 schemas (though the W3C XML Schema language still doesn't support XML 1.1 names, so documents that actually use XML 1.1 features are likely to be invalid).


Michael Kay has released Saxon 8.1.1, an implementation of XSLT 2.0, XPath 2.0, and XQuery in Java. Saxon 8.1 is published in two versions for both of which Java 1.4 is required. Saxon 8.1B is an open source product published under the Mozilla Public License 1.0 that "implements the 'basic' conformance level for XSLT 2.0 and XQuery." Saxon 8.1SA is a £250.00 payware version that "allows stylesheets and queries to import an XML Schema, to validate input and output trees against a schema, and to select elements and attributes based on their schema-defined type. Saxon-SA also incorporates a free-standard XML Schema validator. In addition Saxon-SA incorporates some advanced extensions not available in the Saxon-B product. These include a try/catch capability for catching dynamic errors, improved error diagnostics, support for higher-order functions, and additional facilities in XQuery including support for grouping, advanced regular expression analysis, and formatting of dates and numbers." Version 8.1.1 is a bug fix release. Upgrades from 8.x are free.

Thursday, October 7, 2004

The OpenOffice Project has released OpenOffice 1.1.3, an open source office suite for Linux and Windows that saves all its files as zipped XML. I used the previous 1.0 version to write Effective XML. 1.1.3 is exclusively a bug fix release. OpenOffice is dual licensed under the LGPL and Sun Industry Standards Source License.

Community Manager Louis Suarez-Potts also writes, "OpenOffice.org 2.0 will be ready for general use in March 2005. Early versions--pre-Alpha versions--are ready for download now. They are not meant for daily use but are meant to give a taste of things to come. To download the early developer version of OpenOffice.org 2.0, visit our 680 page.

Wednesday, October 6, 2004

I'm playing with the CSS again to try to allow the content to be the very first part of the document encountered by screen readers and older, non-CSS compliant web browsers. Please holler if anything looks too weird, and let me know what browser, version, and platform you're using.

Tuesday, October 5, 2004

The third edition of XML in a Nutshell is back in stock at Amazon. Be the first on your block to get one! The current sales rank is just barely in the top 2500. I'd love to see it break into the top 1,000, though that's harder to do than it used to be now that so many non-techies shop at Amazon.

Monday, October 4, 2004

Amazon has done it again. Thanks to all the pre-orders from Cafe con Leche readers, the third edition of XML in a Nutshell has gone straight from "Not Yet Released" status to sold out and "Usually ships within 3 to 5 weeks" without stopping at "Ships within 24 hours" first. Thanks for all the orders! Based on past experience, Amazon should have more copies in stock a lot sooner than 3 to 5 weeks. If you order it today, you'll probably get it some time next week.


In my continuing efforts to make XML a lot less irritating for in-memory manipulation, I've posted beta 6 of XOM, my dual streaming/tree API for processing XML with Java. This beta is primarily a bug fix release. It also polishes off some rough edges in various corners of the API. Changes in this release include:

  • The deprecated setNodeFactory() method in XSLTransform has been removed. This is the only API-level change in this release.

  • The strings returned by toString in Comment, ProcessingInstruction, Attribute, and Text are all now truncated if they get too long. Furthermore any embedded line breaks and tabs are escaped as \n, \r, and \t. This makes the objects easier to inspect in various debuggers and loggers.

  • SAXConverter no longer converts XOM xml:base attributes into SAX attributes. Instead the xml:base attributes are used to determine the URI information the Locator reports. Providing xml:base attributes as well would risk double counting some relative URLs.

Windows users may have a little trouble with the zip archive, because it contains some files used to test the conversion of file names to URIs when the file names have illegal characters such as angle brackets that Windows doesn't like. You may, therefore, see some error messages while unzipping. Ignore them. The only practical effect this has is that seven of the 964 unit tests will fail on Windows. This will be fixed in the next release. All platforms may also have trouble with a file named resumé.xml, included to test the conversion of file names with non-ASCII characters to base URLs. In this case, all modern platforms should be able to handle the file. However, the Ant zip and tar tasks are mangling the é character when they add the file to the archive. Either the CVS repository on java.net or the Eclipse CVS client has a similar problem. Regardless, the functionality of the core API is not affected, and XOM does work well with files whose names contain unusual characters, as these test cases were written to prove. I just can't seem to get them into the distro. Suggestions for fixing this are appreciated.

Saturday, October 2, 2004

The Apache XML Project has released version 2.6.0 of Xerces-C, an open source schema validating XML parser written in reasonably cross-platform C++. Version 2.6.0 includes a number of small changes and improvements, mostly focussed on performance. The XML 1.1 implementation is no longer considered experimental. The deprecated parts of DOM are now built as a separate library.


Bare Bones Software has released version 8.0.2 of BBEdit, my preferred text editor on the Mac. This release fixes one bug in FTP/SFTP uploads that occurred in 8.0.1. BBEdit is $179 payware. Upgrades from 8.0 are free. Upgrades from earlier versions are $49 for 7.0 owners and $59 for owners of earlier versions.

Friday, October 1, 2004

The W3C has posted the proposed recommendation of XInclude. This draft includes many editorial clarifications, but doesn't make any real changes to the underlying syntax or semantics. There don't appear to be any fully conformant implementations yet. However, XOM's XIncluder class implements all the required functionality, and passes all the tests that don't depend on optional features. Comments are due by October 29.


The Gnome Project has released version 2.6.14 of libxml2, the open source XML C library for Gnome. This release improves W3C schema support and fixes various bugs. They've also released version 1.1.11 of libxslt, the GNOME XSLT library for C and C++. This is a bug fix release.

Thursday, September 30, 2004

So far everyone seems happy with the new layout, but I noticed one more problem. The top navbar (i.e. the list of links to other pages) still appears before the content of the page. This is a violation of web accessibility guidelines. The content should be the very first element on the page for maximum accesibility for anyone using a screen reader. Having to listen to 14 links before reaching the quote of the day isn't as bad as listening to the entire side bar droned out, but it's still annoying as hell. I suspect I'm going to have to absolutely postiion the header div at the top of the page. That's probably going to muck with the rest of the layout. As usual I'll post test pages for everyone to look at before doing anything too drastic. Expect more CSS shenanigans soon.


The W3C Multimodal Interaction Working Group has published the third public working draft of the Ink Markup Language. According to the abstract,

The Ink Markup Language serves as the data format for representing ink entered with an electronic pen or stylus. The markup allows for the input and processing of handwriting, gestures, sketches, music and other notational languages in Web-based (and non Web-based) applications. It provides a common format for the exchange of ink data between components such as handwriting and gesture recognizers, signature verifiers, and other ink-aware modules.

The following example of writing the word "hello" in InkML is given in the spec:

<ink>
   <trace>
     10 0 9 14 8 28 7 42 6 56 6 70 8 84 8 98 8 112 9 126 10 140
     13 154 14 168 17 182 18 188 23 174 30 160 38 147 49 135
     58 124 72 121 77 135 80 149 82 163 84 177 87 191 93 205
   </trace>
   <trace>
     130 155 144 159 158 160 170 154 179 143 179 129 166 125
     152 128 140 136 131 149 126 163 124 177 128 190 137 200
     150 208 163 210 178 208 192 201 205 192 214 180
   </trace>

   <trace>
     227 50 226 64 225 78 227 92 228 106 228 120 229 134
     230 148 234 162 235 176 238 190 241 204
   </trace>
   <trace>
     282 45 281 59 284 73 285 87 287 101 288 115 290 129
     291 143 294 157 294 171 294 185 296 199 300 213
   </trace>
   <trace>

     366 130 359 143 354 157 349 171 352 185 359 197
     371 204 385 205 398 202 408 191 413 177 413 163
     405 150 392 143 378 141 365 150
   </trace>
</ink>

<sarcasm>Gee, that's not the least bit opaque.</sarcasm>. This looks like the SVG mistake all over again. I wrote about this in Item 11 of Effective XML, "Make Structure Explicit through Markup.". The right way to solve this problem is something like this:

<ink>
  <trace>
    <coordinate><x>10</x> <y>0</y></coordinate>
    <coordinate><x>9</x> <y>14</y></coordinate>
    <coordinate><x>8</x> <y>28</y></coordinate>
    <coordinate><x>7</x> <y>42</y></coordinate>
    <coordinate><x>6</x> <y>56</y></coordinate>
    <coordinate><x>6</x> <y>70</y></coordinate>
    <coordinate><x>8</x> <y>84</y></coordinate>
    <coordinate><x>8</x> <y>98</y></coordinate>
    <coordinate><x>8</x> <y>112</y></coordinate>
    <coordinate><x>9</x> <y>26</y></coordinate>
    <coordinate><x>10</x> <y>140</y></coordinate>
    <coordinate><x>13</x> <y>154</y></coordinate>
    <coordinate><x>14</x> <y>168</y></coordinate>
    <coordinate><x>17</x> <y>182</y></coordinate>
    <coordinate><x>18</x> <y>188</y></coordinate>
    <coordinate><x>23</x> <y>174</y></coordinate>
    <coordinate><x>30 </x> <y>60</y></coordinate>
    <coordinate><x>38</x> <y>147</y></coordinate>
    <coordinate><x>49</x> <y>135</y></coordinate>
    <coordinate><x>58</x> <y>124</y></coordinate>
    <coordinate><x>72 </x> <y>21</y></coordinate>
    <coordinate><x>77</x> <y>135</y></coordinate>
    <coordinate><x>80</x> <y>149</y></coordinate>
    <coordinate><x>82</x> <y>163</y></coordinate>
    <coordinate><x>84</x> <y>177</y></coordinate>
    <coordinate><x>87</x> <y>191</y></coordinate>
    <coordinate><x>93</x> <y>205</y></coordinate>
  </trace>
</ink>

That's more verbose, but it's also much clearer. It would let the data be extracted with standard XML tools rather than requiring each user to write their own micro-parser for the trace elements. If InkML really can't afford to actually markup the x and y coordinates as x and y coordinates instead of raw text, then one wonders why it's using XML at all?

This draft adds a metadata element, defines the application/inkml+xml MIME media type, adds a documentID attribute, and includes more tutorial material. However, the fundamental problems with the proposed format have not been addressed or acknowledged.

Wednesday, September 29, 2004

Engage Interactive has updated two open source XML parsers written in PHP. SAXY 0.8.7 exposes a SAX like interface. Version 0.87 fixes a bug in attribute parsing. DOMIT! 0.9.8 exposes an API based on the Document Object Model (DOM). Version 0.98 is compatible with PHP 5 and supports HTTP proxy connections with basic authorization. Both are published under the GPL.

Tuesday, September 28, 2004

The new layout seems to be causing problem for readers with particularly long and difficult to spell last names (well probably not, but the first people to report the problem were Larry Zappaterrini and Richard Duivenvoorde). It seems to be caused by a small font size. For the moment, if you're not seeing everything, try increasing the text size a few points. The stylesheet does not set any absolute font sizes, just uses whatever default you've picked.

OK. I have a partial fix. It should be OK for any readable font size, though you can still reproduce it if you shrink the font to something illegible. The problem is that the width of the navbar is specified in ems and the min-width is specified in pixels. I need to do this to make sure the navbar is wide enough for both the text and the images. However, the right margin of the content area is given in ems, which works as long as the distinction between ems and pixels doesn't get too far out of whack from what it usually is. The hack is to specify roughly half the margin I want in ems and put the other half in pixels in padding. This isn't a perfect solution. CSS desperately needs a way to place an element at an offset from another element, rather than at a fixed position on the page.

Another big missing feature is the ability to specify min and max margins and paddings. The problem here happens because I need to set the contents right margin to at least the width of the navigation bar on the right. Sometimes that margin is set by the width property and sometimes its set by the min-width property. Which is bigger depends on the font size. I have no way to specify that the right margin of the contents is the maximum of the navbar's width and min-width. I have to set the margin to one or the other. :-(


Can anyone suggest why Mozilla/Firefox can't find its AppleScript dictionary? I've been able to use some old Mac OS 9 Applescripts with Mozilla 1.7 so it has the necessary classes, but the Script Editor only shows a few basic entries in the dictionary. Alterantely does anyone have a good reference to scripting Mozilla on Mac OS X? Suggestions appreciated.


There didn't seem to be any major issues with the last iteration of the CSS layout for this page, so I've switched the main page over. If no one notices any problems with this, I'll probably switch Cafe au Lait in a week or two.

A lot of people kept getting hung up on the pre element in September 24 that extends past the width of the window if the window isn't grossly large. I can't quite find it in myself to call this a bug though. It's an unusual but legal and important use of the pre element. In the particular example at hand, it was critical to reproduce the white space exactly, long though it was. In essence this was an unintentional stress test that demonstrated how different layout algorithms coped with an impossible situation. The three major different layouts (table, absolute positioning, and float-based) all handled this differently. The table layout expanded the content area to be wide enough to hold the content and added scroll bars if necessary. The absolutely positioned layout runs the extra text under the righthand navbar, but doesn't change the size of the rest of the content area. The float layout overlaid the overly wide text on top of the navbar, but also didn't change the size of the content area. Readers who noted the problem were roughly equally divided as to which solution they preferred. There was no clear consensus that one was right and one was wrong. Personally, I think the smartest solution was overlaying the wide content like the float-based layout did. It's also possible the content area shoudl have received horizontal scroll bars of its own (something none of the layouts did). Possibly I can hack this behavior in with additional CSS properties. I'll explore that soon, but this is an unusual situation and not worth holding up the major changes for. It's a complete fluke that one of the rare news items I postetd with over-wide content happened to fall within the week where I was exploring CSS layouts. Most weeks this won't be an issue, and any other week no one would have noticed the problem.

One issue a couple of people raised was whether I should set a maximum width on the content area. This is based on the general prinicple that long lines are harder to read. I explored that option but decided against it. Practically, if I set a maximum width on the content area, I couldn't make the navbar flush left up against the content instead of leaving big ugly patches of white space in the middle of the page. More importantly, the more I thought about it, the more it seemed to me that this principle was just plain wrong for the Web. Unlike a newspaper, a web browser allows users to set the width of the window to fit their needs, rather than those of the publisher. If the lines are too long, it's easy to make the window smaller. While there may be a nice sweet spot for line length that mostly satisfies most users most of the time, it's not going to fit everyone. In particular, I suspect it may not fit readers with very babd eyesight who like to bump up their font sizes very high. Therefore I think it's much more important to give users the option to make the lines narrower or wider than it is to set a maximum width. I'm still open on this one though, if anyone can make a really strong argument for maximum line layout that takes into account the difference between screen based layout and print, and considers the needs of visually impaired but not completely blind users. However, so far all I've seen on this point seems to be based on old rules for working in print, repurposed for the web world without a great consideration for the very real differences between onscreen display and paper.

Monday, September 27, 2004

Note to Dare (posted here because your comments are broken): The basic rule I had for creating an RSS feed for this site was that it couldn't require me to do more work than I was doing already, at least not on an ongoing basis. That's why I use XSLT driven by a cron job to generate the feed. I can edit the same way I always have, and the RSS happens automatically. The tools adjust to fit me rather than me adjusting to fit the tools. Adding individual URLs for each story would require bending myself around the tools, and I have this silly idea that computers were meant to serve people rather than the other way around. Actually the solution for the problem you note is to use XPointers to identify the individual news items. It would be easy enough to generate them automatically using XSLT. I haven't actually tried that. Maybe it would work, but I sort of expect it might run into some problems with browser compatibility.

The initial stumbling block that kept me from adding an RSS feed to this site was that my news items don't have titles, but then it occurred to me that I could use the first sentence of each item as the title. Of course this broke some RSS software written by developers who hadn't actually paid much attention to the specs because sometimes my sentences are on the long side and tend to drone on and on and on, but you get the idea and anyway this should be handled by clients because of course developers don't write arbitrary limitations on string size into their code because sooner or later those assumptions are going to be violated, as they were for some web browser's layout algorithms a couple of day's ago when I posted a pre fragment that couldn't fit within a browser window although in that case it was really important for semantic reasons to reproduce the exact line breaks and anyway how's that for a run-on sentence—once in high school I wrote an entire 500 word theme as a single run-on sentence.

Anyway, back to the point. I'm not going to rearrange my site to fit the needs of broken news readers. RSS got a lot of things wrong, and one of those things may be the lack of any unique identifier for articles separate from titles and URLs. However, that doesn't mean a client is justified in assuming that other things in the feed are in fact unique identifiers. If an RSS 0.92 client really needs to figure out whether two items are the same or different, it needs to use a combination of heuristics rather than relying on some assumed uniqueness that isn't actually present. Most simply, it could retrieve both items, and see if the old one is still there or not. It could also compare the descriptions, URLs, and titles and do a fuzzy match, without assuming that a change in a single byte reflected a completely new item. And if it can't do that, then it needs to be designed to operate correctly without any information about which items are new and which aren't. None of this is rocket science. It simply requires implementing the spec as it is, rather than as we might wish to be.


Ian E. Gorman has released GXParse 1.5, a free (LGPL) Java library that sits on top of a SAX parser and provides semi-random access to the XML document. The documentation isn't very clear, but as near as I can tell, it buffers various constructs like elements until their end is seen, rather than dumping pieces on you immediately like SAX does. This release completes namespace support, eases exception handling, and adds a few operators to CurrentElement.

Sunday, September 26, 2004

One more time. I've got a new CSS laid out version of this page to check out. This one returns to absolute positioning instead of floats. However, it uses relative measurements in ems and exs instead of absolute pixel counts. This makes it more flexible in the face of changing font sizes. This pretty much seems to work across all the browsers I tried including Firefox, Safari, Mozilla, Internet Explorer 5 for Mac OS 9, Internet Explorer 5.5 for Mac OS X, and Netscape 4.7 for Mac OS 9. I haven't tested IE for Windows yet. As usual, please check it out and let me know if anything looks too funky to live with. If so, please let me know what browser on what platform you're using.

It does seem apparent to me that CSS is missing some crucial features needed for truly dynamic, flexible layouts. The most important is that there's no way to set the width, height, or position of one element to be a function of the width, height, or position of another element. All I'm really trying to do here is say that the navbar on the right starts about 1.2 times the height of the header below the header, regardless of font size, window size, and the number of characters in the header. For example,

top: 1.2*height(#header)

or perhaps

top: bottom(#header) + 20px


Michael Kay has released Saxon 8.1, an implementation of XSLT 2.0, XPath 2.0, and XQuery in Java. Saxon 8.1 is published in two versions for both of which Java 1.4 is required. Saxon 8.1B is an open source product published under the Mozilla Public License 1.0 that "implements the 'basic' conformance level for XSLT 2.0 and XQuery." Saxon 8.1SA is a £250.00 payware version that "allows stylesheets and queries to import an XML Schema, to validate input and output trees against a schema, and to select elements and attributes based on their schema-defined type. Saxon-SA also incorporates a free-standard XML Schema validator. In addition Saxon-SA incorporates some advanced extensions not available in the Saxon-B product. These include a try/catch capability for catching dynamic errors, improved error diagnostics, support for higher-order functions, and additional facilities in XQuery including support for grouping, advanced regular expression analysis, and formatting of dates and numbers." Version 8.1 incorporates various recent changes in the XQuery/XSLT 2.0/XPath 2.0 family of specs, including some that have not been published yet. Upgrades from 8.0 are free.

Saturday, September 25, 2004

Yesterday's experimental version of this page using CSS instead of table layouts proved insufficiently liquid in the face of font size changes. That seems to be a flaw of any layout involving absolute positioning. I've written a different layout that uses floats instead. Again, please check it out and let me know if anything looks too funky to live with. If so, please let me know what browser on what platform you're using. It seems to work well on all the browsers I have conveniently available, but I haven't tested IE5 for Windows yet.

This page looks better, but it's got one major flaw compared to yesterday's. In order to float the navbar on the right I had to move the div containing the navbar to the beginning of the HTML before the content. This is a major hassle for anyone using a non-CSS browser, Lynx, or a screen reader. I really need to move the navigation after the main content. Any suggestions?

Several people commented on the overly long pre element. That's a coincidence but it is a good test. What does/should a browser do with a pre element that's too wide for a page or a window or a panel?

A couple of people viewed source, and noted it wasn't valid XHTML. That's correct, and that's a deliberate decision. The page is well-formed, which is all XML processors need; and the additional markup I've added for my own use follows the longstanding principle that browsers should ignore any tags they don't recognize. Validity is overemphasized in XHTML. At one point, I did use modular XHTML to make this page valid. I even wrote about that in Chapter 7 of XML in a Nutshell. However, that broke far too many existing web browsers for me to seriously consider putting it into production. It's important that elements be used in the ways the spec intends them to be used. It's not important that there be no other elements about which the spec says nothing.

Friday, September 24, 2004

Do me a favor. Please take a look at the experimental version of this page that uses CSS instead of table layouts, and let me know if anything looks too funky to live with. If so, please let me know what browser on what platform you're using. It seems to work well on all the browsers I have conveniently available, but I haven't tested IE5 for Windows yet. Netscape 4 isn't great, but I was able to hack that enough so it isn't unreadable. One thing I still haven't figured out how to do is center the h1 header ("Cafe con Leche XML News and Resources") within the left hand panel. I can center it relative to the page, but that's not quite the same thing, especially in a wide window.


Amazon has reduced the price of XML in a Nutshell, 3rd edition to $27.17, a 32% savings off the cover price. Be the first on your block to get one!


I've been spending a lot of time reviewing RSS readers lately, and overall they're a pretty poor lot. Latest example. Yesterday's Cafe con Leche feed contained this completely legal title element:

<title>I'm very pleased to announce the publication of XML in a Nutshell, 3rd edition by myself and W.
          Scott Means, soon to be arriving at a fine bookseller near you.
          </title>

Note the line break in the middle of the title content. This confused at least two RSS readers even though there's nothing wrong with it according to the RSS 0.92 spec. Other features from my RSS feeds that have caused problems in the past include long titles, a single URL that points to several stories, and not including more than one day's worth of news in a feed.

Cafe con Leche and Cafe au Lait use XSLT to generate their RSS feeds, so they're always completely well-formed. The home pages are edited by hand, and may not always be well-formed; but if so the XSLT processor reports an error and does not generate a new RSS document. I really wish RSS vendors would focus on implementing the actual specs reasonably before they wasted time on supporting brain damage like malformed feeds and double escaped HTML. It's well-known that supporting non-conformant documents poisons the well for everyone. What's less well-known is that adding support for non-conformant documents tends to break the support for sites that actually follow the specifications. Everyone gets sucked into a race to the bottom, and we end up back in the world where everyone's browser handles sites just a little bit differently from everyone else's, and vendors compete based on how many broken sites they can make sense out of instead of how well they can present genuinely good data. This is the HTML hell XML was supposed to save us from. Those who forget the past are condemned to repeat it.

Thursday, September 23, 2004
The Peacock Book

I'm very pleased to announce the publication of XML in a Nutshell, 3rd edition by myself and W. Scott Means, soon to be arriving at a fine bookseller near you. XML in a Nutshell is quite simply the most complete and succinct treatment of the major technologies in XML you'll find anywhere. Topics covered include elements, attributes, syntax, namespaces, well-formedness, DTDs, schemas, XPath, XSLT, XSL-FO, CSS, SAX, DOM, internationalization, XHTML, and more. The third edition is a major update that syncs the book with the latest developments in XML including:

  • XML 1.1
  • XInclude
  • DOM Level 3
  • SAX 2.0.1
  • Unicode 4.0.1
  • XPointer 1.0
  • Namespaces 1.1

If you don't have a copy, you need a copy. Do you need to upgrade your old copy? If you're sticking to XML 1.0 (a recommendation I've made in my two previous books and continue to stand by in this one), the second edition will probably continue to serve you well. However, if you're still thumbing through a very dog-eared copy of the first edition, it's definitely time to upgrade. XML hasn't stood still in the three years since the first edition was published, and there's a lot of new and improved material here.

My author's copy arrived a couple of days ago, and generally it ships to me from the warehouse at the same time it ships to bookstores, just by slightly faster courier, so bookstores should have it in stock any day now. Amazon is still listing it at the full cover price of $39.95, but they normally drop that as soon as it gets in stock, so you may want to wait a day or two to order it. Update: Sometime today they dropped the price to $27.17, a 32% savings, plus they'r eoffering free supersaver shipping. They normally don't do this until they have the book in stock, so I expect it to arrive any day now. Go ahead and order it. Powell's, Barnes & Noble, and other bookstores should have it momentarily as well. It will also be available on Safari in the not too distant future for those readers who prefer their books in electronic format. The book is XML in a Nutshell, 3rd edition. The ISBN is 0-596-00292-0. It's published by O'Reilly, and written by W. Scott Mean and me, Elliotte Rusty Harold. Check it out!

Wednesday, September 22, 2004

Benjamin Pasero has posted version 0.9b of RSSOwl, an open source RSS reader written in Java and based on the SWT toolkit. This release can search a site for news feeds, improves PDF export, and can use Mozilla 1.7 as internal browser. It also fixes a number of user interface inconsistencies I reported.


Ranchero Software has posted a beta of NetNewsWire 2.0, a closed source RSS client for Mac OS X available in both free-beer lite and payware full versions. Version 2.0 removes the weblog editor and adds Atom support, flagged items, searching, news persistence, and an embedded browser.


JAPISoft has released JXP 1.3.3, a €139 payware XPath 1.0 API that can be customized to fit different object models. This release fixes bugs.


JAPISoft has also released FastParser 1.6.8, a $199 payware, non-validating, XML parser for Java that supports SAX and some of DOM. I'm very skeptical of this parser, and JAPISoft products in general. I notice that every other release they announce "new features" that are absolutely essential to any minimally conformant implementation of the technologies they claim to implement. It makes me wonder what's missing from the current version. Plus they are completely misusing the phrase "open source." These products are not available under an open source license, JAPISoft claims to the contrary not withstanding.


Kevin Howe has written a Ruby wrapper for HTML Tidy. It's distributed under the Ruby license.

Tuesday, September 21, 2004

Dave Beckett has released the Raptor RDF Parser Toolkit 1.3.3, an open source C library for parsing the RDF/XML, N-Triples. Turtle, and Atom Resource Description Framework formats. It uses expat or libxml2 as the underlying XML parser. Version 1.33 restores Unicode Normalization Form C checking and fixes various bugs. Raptor is now dual licensed under the LGPL and Apache 2.0 licenses.


Mikhail Grushinskiy has posted XMLStarlet 0.95, a command line utility for Linux that exposes a lot of the functionality in libxml and libxslt including validation, pretty printing, and canonicalization. This release fixes some security bugs and has been recompiled against libxml2 2.6.13 and libxslt 1.1.10.

Sunday, September 19, 2004

I've posted beta 5 of XOM, my dual streaming/tree API for processing XML with Java. This beta primarily focuses on fixing bugs in XInclude and improving performance of builders when reading from files. It also deprecates the setNodeFactory() method in XSLTransform which will be removed in the next drop. In its place, there's a new constructor:

public XSLTransform(Document stylesheet, NodeFactory factory)

Finally, the four XSLTransform constructors deprecated in the last release have been removed.

I don't have any other major issues in the TODO list for 1.0. If nobody finds any bugs in this beta, I may label the next drop release candidate 1.

Saturday, September 18, 2004

The W3C Voice Browser Working Group has released the Recommendation of the Speech Synthesis Markup Language Version 1.0. According to the abstract, the Speech Synthesis Markup Language "is designed to provide a rich, XML-based markup language for assisting the generation of synthetic speech in Web and other applications. The essential role of the markup language is to provide authors of synthesizable content a standard way to control aspects of speech such as pronunciation, volume, pitch, rate, etc. across different synthesis-capable platforms."

Friday, September 17, 2004

Pekka Enberg has posted version 0.2.17 of XML Indent, an open source (GPL) "XML stream reformatter written in ANSI C" that "is analogous to GNU indent." This release supports multiple input files on the command line.


Altsoft N.V. has posted a beta of Xml2PDF 2.0, a $49 payware Windows program for converting XSL Formatting Objects documents into PDF files. Version 2.0 improves XSL-FO 1.0 and 1.1 draft support, adds support for SVG, and accepts XHTML as an input format.

Thursday, September 16, 2004

Norm Walsh has posted the second beta of DocBook 4.4, an XML application designed for technical documentation and books such as Processing XML with Java. New elements in DocBook 4.4 include package and biblioref.


Bob Stayton has posted version 1.66.0 of the DocBook XSL stylesheets. These support transforms to HTML, XHTML, and XSL-FO. Besides bug fixes, major enhancements in this release include:

  • Improved handling of olinks
  • Can create multiple indices containing different categories of entries
  • A section.autolabel.max.depth parameter to turn off section numbering below a certain depth
  • Better handling of relative URLs in image references, including xml:base support
  • Support for DocBook 4.3 corpcredit element.
  • Increased footnote customization

David Holroyd has written a a CSS2 DocBook stylesheet that enables CSS Level 2 savvy web browsers such as Mozilla and Opera to display Docbook XML documents. The results aren't as pretty as what the XSLT stylesheets can produce, but they're serviceable.

Wednesday, September 15, 2004

The Mozilla Project has posted the first preview release of Firefox 1.0, the open source web browser that has recently crossed 10% market share and is rapidly gaining on Internet Explorer. They've also upgraded Mozilla to 1.7.3 to fix some security bugs. Both Firefox and Mozilla support XML, XSLT, HTML, XHTML, CSS, and XSLT. New features in this release include RSS support. This release may break some extensions.

Tuesday, September 14, 2004

GotDotNet has released EXSLT.NET 1.1, a .NET library that implements 65 EXSLT extensions to XSLT from these modules:

  • Dates and Times
  • Common
  • Math
  • Random
  • Regular Expressions
  • Sets and Strings

EXSLT.NET library provides 13 unique extension functions of its own. New functions in this release include:

  • str:encode-uri()
  • str:decode-uri()
  • random:random-sequence()
  • dyn2:evaluate()
  • date2:day-name()
  • date2:day-abbreviation(),
  • date2:month-name()
  • date2:month-abbreviation()

EXSLT.NET is published under the GOTDOTNET WORKSPACES COMMERCIAL DERIVATIVES LICENSE, which Oleg Tkachenko says is an open source license, though I don't see it listed as one by the Open Source Initiative. IANALL (I am not a license lawyer) but the patent clauses probably disqualify this license from being true open source.

Monday, September 13, 2004

The W3C XML Protocol Working Group has published three candidate recommendations covering XOP, a MIME multipart envelope format for bundling XML documents with binary data:

  • SOAP Message Transmission Optimization Mechanism "describes an abstract feature and a concrete implementation of it for optimizing the transmission and/or wire format of SOAP messages. The concrete implementation relies on the [XOP] format for carrying SOAP messages."
  • Resource Representation SOAP Header Block "describes the semantics and serialization of a SOAP header block for carrying resource representations in SOAP messages."
  • XML-binary Optimized Packaging "defines the XML-binary Optimized Packaging (XOP) convention, a means of more efficiently serializing XML Infosets that have certain types of content."

Basically this is another whack at the packaging problem: how to wrap up several documents including both XML and non-XML documents and transmit them in a single SOAP request or response. In brief, this proposes uses a MIME envelope to do that. This is all reasonable. I do question the wisdom, however, of pretending this is just another XML document. It's not. The working group wants to ship binary data like images in their native binary form, which is sensible. What I don't like is that the working group wants to take their non-XML, MIME based format and say that it's XML because you could theoretically translate the binary data into Base-64, reshuffle the parts, and come up with something that is an XML document, even though they don't expect anyone to actually do that.

Why is their this insane urge across the technological spectrum to call everything XML, even when it clearly isn't and clearly shouldn't be? XML is very good for what it is, but it doesn't and shouldn't try to be all things to all people. Binary data is not something XML does well and not something it ever will do well. Render into binary what is binary, and render into XML what is text.

Comments are due by September 15.


JAPISoft has released EditiX 2.0, a $99 payware XML editor written in Java. Features include XPath location and syntax error detection, context sensitive popups based on DTD, W3C XML Schema Language, and RelaxNG schemas, and XSLT and XSL-FO previews. Version 2.0 adds an XSLT debugger, DocBook support, and multi-view preview. EditiX is available for Mac OS X, Linux, and Windows. Upgrades from 1.x are $59.

Sunday, September 12, 2004

I clicked on the RSS feed on Tim Bray's web log, which serves it as application/rss+xml; and Mozilla tried to open it. That's funny. Shouldn't it display it? It is an XML file after all. I tried passing it to RSSOwl, but nothing happened.


The Apache Commons Team has released Digester 1.6, a SAX-based XML to object mapper, designed primarily for parsing XML configuration files though it has other uses too. Digester is configured through an XML to Java object mapping module, which triggers actions whenever a pattern of nested XML elements is recognized. Version 1.6 "includes many bug fixes and minor enhancements as well as several new features including plugins (framework supporting dynamic rule reconfiguration) and variable expansion."


Engage Interactive has updated two open source XML parsers written in PHP. SAXY 0.8.6 exposes a SAX like interface. DOMIT! 0.9.7 exposes an API based on the Document Object Model (DOM). Both are published under the GPL. These releases fix a bug in comment parsing.


Michael B. Allen's posted domc 0.8, an opens source implementation of the Document Object Model Level 1 (DOM1) in ANSI C. It depends on the the Expat XML Parser Toolkit. This is a bug fix release.


Jeff Key's posted a free-beer XmlDocViewer for Windows.

Friday, September 10, 2004

Jason Hunter has released JDOM 1.0, an open source, tree-based API for processing XML with Java. The API is unchanged since beta 10. A few bugs have been fixed. Java 1.2 or later is required.


The Mozilla Project has posted the multilanguage (Danish, Dutch, English, French, German, Italian, Japanese, and Slovak with Chinese in preparation) version of Camino 0.8.1, a Mac OS X web browser based on the Gecko 1.7 rendering engine and the Quartz GUI toolkit. This release makes major updates to the rendering engine. Mac OS X 10.1.5 or later is required.


Get RSSOwl

Now that I'm moving my primary machine to Mac OS X, I've decided to finally break down and get an RSS reader. A lot of people like the payware NetNewsWire or the free-beer NetNewsWire Lite, but I'm going to try RSSOwl first instead. It's open source and written in Java, so I figure I should be able to fix it so it works the way I want if necessary. Of course if it requires too much fixing that decision may change fast. Hmm, it doesn't open pages in my browser so I'm forced to use a third of a screen in RSSOwl to read an entire article, and there's no AppleScript support. This may change faster than I thought. Hmm, OK. Maybe not. It can use an external browser, but you have to configure that in the preferences first, which are not in the right place. (I filed a bug on this.) Hmm, and even after you do that it picks the wrong browser (Safari instead of Mozilla) but at least it uses an external browser. OK, looks like Net Newswire has the same bug, but NetNewswire has a FAQ that explains how to work around the bug. Wouldn't you know it? That fixes the bug in RSSOwl too.

Thursday, September 9, 2004

John Cowan has updated TagSoup, his open source, Java-language, SAX parser for nasty, ugly HTML, to version 0.10.2. "Version 0.10.2 fixes some long-standing bugs in the areas of entity references within attribute values, well-formed names for elements and processing instructions, empty tags, and end-tags inside script and style elements. In addition, I have removed the misfeature introduced in 0.10.1 whereby > terminated a tag even inside quotes." TagSoup is dual licensed under the Academic Free License and the GPL.

Wednesday, September 8, 2004

I've posted the fourth beta of XOM, my absolutely correct, open source, dual tree-streaming API for procesing XML with Java. This release is still backwards compatible with 1.0d25. However it does deprecate four XSLTransform constructors that will be removed in the next release. The upside is that XSLT transformation should now be much faster and mmore memory efficient since XOM is no longer writing documents out into strings and passing those to the transform. Instead it is streaming SAX events directly into the transformer. Furthermore, implementing this exposed several bugs in the SAXConverter (and indirectly one in the DOMConverter) that have now been fixed. I've also fixed some minor cosmetic bugs innvolving serialization of internal DTD subsets.

I've also completed the first draft of the XOM tutorial (XHTML compliant browser required). At this point I'm open for any comments anyone has, ranging from spelling errors, grammatical mistakes and code bugs to missing topics and poorly explained subjects.

For what it's worth, I've also updated the future directions document with some more thoughts about what may or may not appear in XOM post-1.0. Finally, I've posted some instructions for building XOM that should assist anyone who wants to build this from source.

Tuesday, September 7, 2004

The W3C has posted the call for papers for WWW2005, to be held in Chiba, Japan, May 10-14, 2005. Papers are due by November 8 and "will be peer-reviewed by at least 3 reviewers from an International Program Committee. Accepted papers will appear in the conference proceedings published by the Association for Computing Machinery (ACM), and will also be accessible to the general public via http://www2005.org/." Tutorial proposals are due by October 15, and tutorials may be given in either English or Japanese. The latter's a nice touch. I'm always amazed by conferences in non-English speaking countries where even the locals deliver in English.

Chiba's a great choice for a conference like this, but I'm afraid I'll have to pass on this one. It's too far, too expensive, and too unpaid for me to justify the trip. My bank account is already showing the effect of attending too many academic conferences this year, and I've had to decline several interesting shows already because they expect the speakers to subsidize them rather than the other way around. That said, if anyone has a budget to pay speakers and is interested in hearing me present on XOM, Effective XML, XQuery, or many other XML related subjects, to either a conference or your company, please do drop me a line. I'm generally willing to talk to local user groups for free, and I do quite a bit of that here in New York—I'll be talking to the XML User's Group in Albany in October; topic and location will be announced soon—but anything that requires long distance travel needs to be compensated to justify the time away from other projects.


Steve Ball has released tkxmllint 1.6 and tkxsltproc 1.6, GUI front-ends for libxml2's xmllint and libxslt's. They're available for Windows and Mac OS X. Version 1.6:

  • Improves handling of character encodings
  • Supports relative URLs as system identifiers
  • Configures entity substitution
  • Can pretty print the result document
  • Sets the base URI of the result document so chunking works
Monday, September 6, 2004

The W3C Quality Assurance Working Group has published the second public working draft of the QA Handbook. According to the abstract, "The QA Handbook (QAH) is a non-normative handbook about the process and operational aspects of certain quality assurance practices of W3C's Working Groups, with particular focus on testability and test topics. It is intended for Working Group chairs and team contacts. It aims to help them to avoid known pitfalls and benefit from experiences gathered from the W3C Working Groups themselves. It provides techniques, tools, and templates that should facilitate and accelerate their work. This document is one of the QA Framework (QAF) family of documents of the Quality Assurance (QA) Activity. QAF includes the other in-progress or planned specifications: Specification Guidelines (in progress), and Test Guidelines."


The W3C Quality Assurance (QA) Activity has also published a revised working draft of the QA Framework: Specification Guidelines. Quoting from the abstract, "A lot of effort goes into writing a good specification. It takes more than knowledge of the technology to make a specification precise, implementable and testable. It takes planning, organization, and foresight about the technology and how it will be implemented and used. The goal of this document is to help W3C editors write better specifications, by making a specification easier to interpret without ambiguity and clearer as to what is required in order to conform. It focuses on how to define and specify conformance for a specification. Additionally, it addresses how a specification might allow variation among conforming implementations. The document is presented as a set of guidelines or principles, supplemented with good practices, examples, and techniques."


Finally, the W3C QA Group has published the first working draft of Variability in Specifications:

This document details and deepens some of the most important conformance-related concepts evoked in the QA Specification Guidelines, developing some of the analysis axes that need to be considered while designing a specification and providing advanced techniques, particularly for dealing with conformance variability and complexity.
Sunday, September 5, 2004

The W3C Multimodal Interaction working group has posted the third public working draft of EMMA: Extensible MultiModal Annotation markup language. According to the abstract, this spec "provides details of an XML markup language for describing the interpretation of user input. Examples of interpretation of user input are a transcription into words of a raw signal, for instance derived from speech, pen or keystroke input, a set of attribute/value pairs describing their meaning, or a set of attribute/value pairs describing a gesture. The interpretation of the user's input is expected to be generated by signal interpretation processes, such as speech and ink recognition, semantic interpreters, and other types of processors for use by components that act on the user's inputs such as interaction managers."

Saturday, September 4, 2004

Tom Bradford has released dbXML 2.0, an open source native XML database written in Java and published under the GNU General Public License. New features in version 2.0 include:

  • Journaling transactions
  • XSL transformations
  • Full text indexing and Full text querying
  • Pluggable security models
  • SSL connection support
  • JSP Tag Library

Changes since the last release candidate are only bug fixes. Java 1.4 is required.


Jason Hunter has posted the first release candidate of JDOM 1.0, an open source API for processing XML with Java. Changes since beta 10 consist of bug fixes and documentation updates. The API is unchanged. Java 1.2 or later is required.

Friday, September 3, 2004

The Gnome Project has released version 2.6.12 of libxml2, the open source XML C library for Gnome. This release improves W3C schema support, the Python bindings, and the command line tools. They also fixed some bugs. They've also released version 1.1.9 of libxslt, the GNOME XSLT library for C and C++. This is a bug fix release.


Mikhail Grushinskiy has posted XMLStarlet 0.93, a command line utility for Linux that exposes a lot of the functionality in libxml and libxslt including validation, pretty printing, and canonicalization. This release has been recompiled against libxml2 2.6.12 and libxslt 1.1.9.


Ian E. Gorman has released GXParse 1.4, a free (LGPL) Java library that sits on top of a SAX parser and provides semi-random access to the XML document. The documentation isn't very clear, but as near as I can tell, it buffers various constructs like elements until their end is seen, rather than dumping pieces on you immediately like SAX does. This release makes various API changes.


Nate Nielsen has released RTFX 0.9.4 (formerly RTFM), an open source (BSD license) tool for converting Rich Text Format (RTF) files into XML. "It majors on keeping meta data like style names, etc... rather than every bit of formatting. This makes it handy for converting RTF documents into a custom XML format (using XSL or an additional processing step)." Version 0.9.4 shoudl be much faster.


Engage Interactive has updated two open source XML parsers written in PHP. SAXY 0.8.5 exposes a SAX like interface. DOMIT! 0.9.6 exposes an API based on the Document Object Model (DOM). Both are published under the GPL. These releases add namespace support and fix bugs. This also means DOM Level 2 is now mostly supported.


The Apache Web Services Project has posted JaxMe 2 0.31, an open source implementation of the Java API for XML Data Binding (JAXB). It also extends JAXB with various features including a persistency layer. However, according to the FAQ, "As of this writing, JaxMe 2 isn't sufficiently mature for large projects."

Thursday, September 2, 2004

John Cowan has updated TagSoup, his open source, Java-language, SAX parser for nasty, ugly HTML, to version 0.10.1. "Version 0.10.1 finally makes Tag Soup Markup Language and State Machine Markup Language documents the source from which the HTMLScanner and HTMLSchema classes are generated. In addition, xmlns and xmlns:* attributes are now removed from the input, as are XML declarations. Prefixes are now partly recognized on attribute names: the xml: prefix is correctly recognized, and other prefixes are mapped to URIs of the form urn:x-prefix:prefix. XML-style empty tags are now recognized in a variety of cases. Finally, various bugs have been fixed." TagSoup is dual licensed under the Academic Free License and the GPL.


The W3C SVG Working Group has published the first public working draft SVG's XML Binding Language (sXBL).

sXBL is a mechanism for defining the presentation and interactive behavior of elements described in a namespace other than SVG's.

sXBL is intended to be used to enable XML vocabularies (tag sets) to be implemented in terms of SVG elements. For instance, a tag set describing a flowchart could be mapped to low-level SVG path and text elements, possibly including interactivity and animation.

sXBL is intended to be an SVG-specific first version of a more general-purpose XBL specification (e.g., "XBL 2.0"). The intent is that, in the future, a general-purpose and modularly-defined XBL specification will be developed which will replace this specification and will define additional features that are necessary to support scenarios beyond SVG, such as integration into web browsers that support CSS. Once a general-purpose XBL is defined, sXBL would just become an SVG-specific subset (i.e., a profile) of the larger XBL specification.

This was formerly developed as the the Rendering Custom Content feature in SVG 1.2, but has now been split off into its own spec.

Wednesday, September 1, 2004

Bare Bones Software has released version 8.0 of BBEdit, my preferred text editor on the Mac. New features in this release include improved handling of Unicode files, HTML Tidy integration, HTML fragment syntax checking, and support for the Perforce source code control system.

BBEdit is $179 payware. Upgrades are $49 for 7.0 owners and $59 for owners of earlier versions. Mac OS X 10.3.5 or later is required so I won't be upgrading as all my systems are running 10.2 or earlier. I'm always amazed at the willingness of software vendors to just throw away potential sales by refusing to support anything but the newest and buggiest OS release.

By the way, if you noticed a glitch in the quote of the day earlier, that was BBEdit's fault.


Alexandre Brilliant has posted the first release candidate of JXMLPad 2.5, a $499 JavaBean component for editing XML. Version 2.5 adds a bookmark API, a tree action toolbar, and horizontal split views. It also fixes various bugs. Java 1.4 or later is required.

Saturday, August 28, 2004

Sun has posted the second proposed final draft of Java Specification Request 206, Java API for XML Processing (JAXP) 1.3, to the Java Community Process. New features in JAXP 1.3 include:

  • An XPath 1.0 query API
  • A javax.xml.datatype package for XML<-->Java type mappings
  • A generic schema validation API

Plus the DOM APIs are upgraded to Level 3, and SAX is upgraded to 2.0.1.

Tuesday, August 24, 2004

I'll be on vacation for the next week on an island with limited Internet access, so updates will be infrequent to nonexistent until I return in September.


Wolfgang Hoschek found some bugs in DOM conversion in XOM beta 2 so I've posted beta 3 to fix the problem. I've also done some more work to improve the connection between Java 1.5's built-in parser and XOM. None of the public APIs have changed in any way. All beta and alpha code should still run without a recompile.

Monday, August 23, 2004

The RDF Data Access Working Group has published the second public working draft of RDF Data Access Use Cases and Requirements. According to the introduction,

The W3C's Semantic Web Activity is based on RDF's flexibility as a means of representing data. While there are several standards covering RDF itself, there has not yet been any work done to create standards for querying or accessing RDF data. There is no formal, publicly standardized language for querying RDF information. Likewise, there is no formal, publicly standardized data access protocol for interacting with remote or local RDF storage servers.

Despite the lack of standards, developers in commercial and in open source projects have created many query languages for RDF data. But these languages lack both a common syntax and a common semantics. In fact, the extant query languages cover a significant semantic range: from declarative, SQL-like languages, to path languages, to rule or production-like systems. The existing languages also exhibit a range of extensibility features and built-in capabilities, including inferencing and distributed query.

Further, there may be as many different methods of accessing remote RDF storage servers as there are distinct RDF storage server projects. Even where the basic access protocol is standardized in some sense—HTTP, SOAP, or XML-RPC—there is little common ground upon which to develop generic client support to access a wide variety of such servers.

The following use cases characterize some of the most important and most common motivations behind the development of existing RDF query languages and access protocols. The use cases, in turn, inform decisions about requirements, that is, the critical features that a standard RDF query language and data access protocol require, as well as design objectives that aren't on the critical path.


Schemas don't get no respect. IBM's alphaWorks has released version 1.2 of the Xeena a syntax directed XML editor specifically to support DTDs and deprecate the support they added a few years ago for the W3C XML Schema Language.

Sunday, August 22, 2004

The W3C Quality Assurance (QA) Activity has posted the fourth public working draft of the QA Framework: Test Guidelines. "The principal goal of QA Framework: Test Guidelines is to help W3C Working Groups to develop more useful and usable test materials. The material is to be presented as a set of principles and good practices."

Saturday, August 21, 2004

Opera Software has released version 7.5.4 of their namesake web browser for Windows, Mac, Linux, FreeBSD and Solaris. Opera supports HTML, XML, XHTML, RSS, WML 2.0, and CSS. XSLT is not supported. Other features include IRC, mail, and news clients and pop-up blocking. This release fixes several security bugs. All users should upgrade. Opera is $39 payware.

Friday, August 20, 2004

The W3C Technical Architecture Group (TAG) has posted the first last call working draft of Architecture of the World Wide Web, First Edition. Quoting from the abstract:

The World Wide Web is an information space of interrelated resources. This information space is the basis of, and is shared by, a number of information systems. Within each of these systems, people and software retrieve, create, display, analyze, relate, and reason about resources.

Web architecture defines the information space in terms of identification of resources, representation of resource state, and the protocols that support the interaction between agents and resources in the space. Web architecture is influenced by social requirements and software engineering principles. These lead to design choices and constraints on the behavior of systems that use the Web in order to achieve desired properties of the shared information space: efficiency, scalability, and the potential for indefinite growth across languages, cultures, and media. Good practice by agents in the system is also important to the success of the system. This document reflects the three bases of Web architecture: identification, interaction, and representation.

Last call ends September 17.


Syntext has posted the third beta Serna 2.0. a $299 payware XSL-based WYSIWYG XML Document Editor for Mac OS X, Windows, and Unix. Features include on-the-fly XSL-driven XML rendering and transformation, on-the-fly XML Schema validation, and spell checking. Version 2.0 adds a customizable GUI, "liquid" dialog boxes, multiple validation modes (strict, on, and off), and large document support. Beta 3 adds support for adds CALS tables and IDEAlliance proceedings, plus a GUI customization tool.

Thursday, August 19, 2004

Peter Jipsen has released ASCIIMathML 1.4.1, a JavaScript program that converts calculator-style ASCII math notation and some LaTeX formulas to Presentation MathML while a Web page loads. The resulting MathML can be displayed in Mozilla-based browsers and Internet Explorer 6 with MathPlayer.


David Tolpin has released RNV 1.7.1, an open source Relax NG Compact Syntax validator written in ANSI C. (I wonder why it only supports the comapct syntax? Even if the validator expects the compact syntax it shouldn't be hard to write a preprocessor that converts the full syntax to the compact syntax.) Version 1.7.1. fixes bugs. RNV is published under a BSD license.


The Mozilla Project has posted the third alpha of Mozilla 1.8. New features in 1.8 include FTP uploads, improved junk mail filtering, better Eudora import, and an increase in the number of cookies that Mozilla can remember. It also makes various small user interface improvements, gives users the option to disable CSS Use Style > None or a global preference, and adds support for CSS quotes.

Wednesday, August 18, 2004

Release early; release often. Stefan Matthias Aust noticed some problems using XOM beta 1 with Java 1.5 when the standard Xerces was not in the CLASSPATH so I've posted beta 2 to fix the problem. I would appreciate it if people could test this in both Java 1.4 and 1.5 without having any parsers besides the JDK bundled parsers in their classpaths. There are some pretty skanky hacks behind the scenes to make this all work seamlessly between different parsers and JVM versions, and I've stumbled across at least one major bug JDK 1.5 while implementing this. (Sun promises me the bug will be fixed in their next drop. For the moment, XOM uses a Sun-suggested work-around so it should still work until the underlying bug is fixed.)

While I was at it, I've used TagSoup to make the JavaDoc well-formed (possibly valid, I haven't checked) XHTML. tagsoup-0.9.7.jar is now bundled with the complete distributions. However it's only necessary to build XOM, not to run it.

None of the public APIs have changed in any way. All beta 1 and alpha, code should still run without a recompile.


From the "I thought they were dead department", I'm pleased to note that Netscape has released version 7.2 of their namesake web browser for Mac OS X, Linux, and Windows 98 and later. This release is derived from Mozilla 1.7 and features pop-up blocking, tabbed browsing, fit-to-screen image sizing, full-screen mode, table editing, Flash 7, and print preview. I haven't tested it, but assuming its XML support is similar to Mozilla's (and I have no reason to believe it's not) it should support XML, XHTML, CSS, XSLT, XLinks, and the XPointer element() and xpath1() schemes.

Tuesday, August 17, 2004

The IETF has published the proposed standard version of Uniform Resource Identifier (URI): Generic Syntax. This is a replacement for RFC 2396 that tries to actually be clear and consistent. When it comes to day-to-day use of URLs, this really doesn't change anything. However, the new spec should be a lot more helpful to authors of specifications that depend on URLs and programmers who write URL processing software. The inconsistencies and poor language in RFC 2396 has long been a thorn in the Web's side. It's led otherwise intelligent and agreeable people into flame fests on such subjects as whether a relative URL is really a URL or just a URL referencm, arguments that often degenerate into, "I know the spec doesn't say that, but that's what it should say" or "That's what we meant it to say." Some of these problems have had real consequences for other specs. This new spec is much cleaner and should alleviate a lot fo those issues. Update: Tim Bray pointed out that this is actually more akin to a W3C last call working draft than a proposed standard. It's not quite as far along in the process as I thought.

Monday, August 16, 2004

I'm very pleased to announce the first beta of XOM 1.0, my tree-based API for processing XML with Java. XOM emphasizes correctness, ease-of-use, and performance, in that order. It can also process documents in a streaming mode that handles documents larger than available memory. Features include XML canonicalization, XInclude resolution, and XSL transformation. XPath queries are sadly lacking. :-(

You should note that my criteria for entering beta is much stricter than most. In particular unlike some companies who shall remain nameless but who release major operating systems with thousands of known bugs, XOM Beta 1 has exactly zero known bugs. It is feature complete.

XOM has been tested in both client and server-side environments. It has been tested with a number of parsers including those bundled with Java 1.4 and 1.5. The internals of XOM have some pretty nasty hacks to work around various bugs in third party libraries it depends on, sometimes including the JDK itself. I'm fairly confident works as expected when bundled with the latest versions of Xerces and Xalan; and, aside from a few corner cases, works pretty damn well with earlier versions of these products and with other parsers and transformation engines too. (The XOM unit tests are very tough. They have exposed numerous bugs in underlying libraries over the years. I picked Xerces and Xalan because the Apache Foundation has fixed the bugs I found while working on XOM. All other parsers and transformation engines I've tested still have open bugs. Where possible, I've included code in XOM to work around known bugs, especially for processors bundled with JDKs, but it's not always been possible to do so.)

I now believe XOM to be ready for production use. Of course, I could be wrong about that, which is why there's a beta cycle. I encourage you to try out this library, and let me know of any issues that arise, be they bugs, missing documentation, or weak performance. However, omitted functionality will probably have to wait for 1.1 to be filled in. Unless new bugs are uncovered, this may be the one and only beta release. All that's left on my TODO list before final release is finishing the documentation and doing some minor code clean-ups. These include such housekeeping tasks as splitting long lines, spell checking the comments, and making sure the Javadoc is all valid XHTML. It may take a few months before I get all of this done, but none of it should have any affect on client code.

Beta 1 makes no backwards incompatible changes to the published API. Bug fixes since the final alpha include:

  • The XInclude test suite is loaded and run from the W3C CVS server if it's not installed locally. Mistakes in the test suite (mostly involving document type declarations) are corrected on the fly.
  • Work-arounds for various JDK bugs that prevent round-tripping of some characters in Japanese encodings
  • Work-arounds for bugs in some versions of Xalan, as well for bugs in the OASIS XSLT conformance test suite.
  • Improved compatibility with Java 1.5

XOM is published under the LGPL. Java 1.2 or later is required.

Sunday, August 15, 2004

The W3C SVG Working group has posted the last call working draft of Mobile SVG Profile: SVG Tiny, Version 1.2. "This document defines a mobile profile of SVG 1.2. The profile is called SVG Tiny 1.2 and is defined to be suitable for cellphones."

Saturday, August 14, 2004

John Cowan has updated TagSoup, his Java-language SAX parser for nasty, ugly HTML, to version 0.9.7. This is a bug fix release that improves compatibility with XOM. "In addition, the new 'bogons-empty' feature lets you control whether a non-HTML element gets a content model of EMPTY (as previously) or ANY." (Actually I think he did this in 0.9.6, but I never got an announcnement about that version.)


Eric S. Raymond has released version 1.13 of doclifter, an open source tool that transcodes {n,t,g}roff documentation to DocBook. He claims the "result is usable without further hand-hacking about 95% of the time." This release fixes bugs. Doclifter is written in Python, and requires Python 2.2a1. doclifter is published under the GPL.


Tatu Saloranta has posted WoodStox 0.9, a free-as-in-speech (GPL) XML processor written in Java, that implements StAX API. "StAX specifies interface for standard J2ME 'pull-parsers' (as opposed to "push parser" like SAX API ones); at high-level StAX specifies 2 types (iterator and event based) readers and writers that used to access and output XML documents."


Apple has released Safari 1.0.3 for Mac OS X 10.2 Jaguar, a web browser based on the KHTML rendering engine. Safari supports direct display of XML documents with CSS stylesheets but does not support XSLT. This release, "improves the Safari rendering engine to expand 3rd party application support and delivers the latest security enhancements." You can get it throiugh Software Update or download it from Apple's web site. I'm glad to see Apple hasn't completely abandoned users on not quite the latest release of their operating system. I just wish the same was true for their Java development team.

Friday, August 13, 2004

The IETF has posted the last call working draft of Internationalized Resource Identifiers (IRIs). "An IRI is a sequence of characters from the Universal Character Set (Unicode/ISO 10646). A mapping from IRIs to URIs is defined, which means that IRIs can be used instead of URIs where appropriate to identify resources." In other words this lets you write URLs that use non-ASCII characters such as http://www.libération.fr/. The non-ASCII characters would be converted to a genuine URI using hexadecimally escaped UTF-8. For instance, http://www.libération.fr/ becomes http://www.lib%C3%A9ration.fr/. There's also an alternative, more complicated syntax to be used when the DNS doesn't allow percent escaped domain names. However, the other parts of the IRI (fragment ID, path, scheme, etc.) always use percent escaping.Comments are due by September 8.


Steve Cheng has posted docbook2X 0.8.5, an open source package for Unix that converts DocBook files to to man pages and Texinfo. This release fixes bugs.

Thursday, August 12, 2004

Toni Uusitalo has posted Parsifal 0.8.3, a minimal, non-validating XML parser written in ANSI C. The API is based on SAX2. Version 0.83 speeds up the parser and fixes bugs. Parsifal is in the public domain.

Wednesday, August 11, 2004

Hot diggety dog! IBM and Novell are teaming up to add XForms support to Mozilla! If I were Microsoft, I'd be very, very worried right now.


More immediately, IBM has updated their XML Security Suite with "XPath Filter 2.0 support and lots of bug fixes." This is a Java class library that supports XML encryption, digital signatures, and canonicalization. It's officially supported on Windows and Linux but will probably run on other platforms. It's mostly a programmer tool. The user interface ranges from porr to non-existent.

Tuesday, August 10, 2004

Version 1.95.8 of expat, an open source XML parser written in C has been released. This release adds support for suspending and resuming the parser in mid-parse. This strikes me as an incredibly useful and unique feature that I can see a lot of uses for. A lot of things I've tried to do over the years would have been a lot easier if I had a way to stop the parser at a certain place. Various small bugs have also been fixed.


Speaking of expat, Late Night Software has released version 2.7 of its free-beer, expat based XML Tools AppleScript scripting addition. This release fixes a number of conformance bugs in earlier versions. Mac OS 8.5/AppleScript 1.3 or later are required. I'm told this is Mac OS X native, so maybe I was using it wrong before when it launched Classic on me? I need to explore this further.

Monday, August 9, 2004

I've posted alpha 5 of XOM, my tree-based API for processing XML with Java. Beta status is not far away. Alpha 5 makes no backwards incompatible changes to the published API. Changes since the previous release include:

  • The ParsingException and ValidityException classes now have getURI() methods that return the URI of the document whose error caused the exception.

  • The test suite now runs the OASIS Microsoft and Xalan XSLT tests

  • Improved compatibility with Java 1.2

  • Improved compatibility with recent releases of Xalan, including those bundled with JDK 1.4.2_03 and later

I had a real revelation while talking about XOM at Extreme Markup Languages last week. As Larry Wall is fond of saying, "Easy things should be easy, and hard things should be possible." The revelation I had was that Wall's principle is almost tautological. What's easy and what's hard are functions of what the language allows and enables. A different language or API can make different tasks easy or hard. XOM is not designed around Wall's principle. Instead it follows what I autonymously call Harold's principle: The right things should be easy, and the wrong things should be difficult to impossible. In other words, XOM is designed to make it easy to process XML as it was intended to be processed (e.g. using elements, attributes, and mixed content) and difficult to impossible to process using the wrong techniques (e.g. treating CDATA sections, attribute order, and encoding as semantically significant).


The Jakarta Apache Project has released JXPath 1.2, a class library that "applies XPath expressions to graphs of objects of all kinds: JavaBeans, Maps, Servlet contexts, DOM etc, including mixtures thereof." According to the release notes:

Most changes in 1.2 are in the internal implementation and do not affect public APIs. However there are a couple of publicly visible features:

  • Namespace registration. Namespace URIs can now be associated with prefixes at the level of JXPathContext.
  • JXPathContext has two new convenience methods: selectNodes and selectSingleNode.
  • Type conversion is integrated with BeanUtils.

This release also includes countless bug fixes and implementation improvements.

Sunday, August 8, 2004

The Mozilla Project has released Mozilla 1.7.2 and Firefox 0.9.3 to fix several security vulnerabilities with varying levels of severity. All users should upgrade.

Friday, August 6, 2004

The final day of Extreme kicks off with Simon St. Laurent talking about "General parsed Entities: Unfinished Business". Simon says he's the only presentation looking at the layers below XML rather than building on top of XML. General parsed entities mostly worked in the early days when he was writing XML: A Primer. "Then I discovered that general parsed entities don't actually work" because parsers are allowed to ignore external general entities. "If James Clark wrote it, it must be right", he says referring to expat and its lack of support for external general entities. He encountered problems in the real world when working with DocBook manuscripts at O'Reilly. "Then there's the SGML envy problem which afflicts anyone who uses XML for an extended period of time." SDATA isn't necessary but CDATA and subdocs are cool. XML provides enough rope to hang oneself, but not as comfortably as you could hang yourself with SGML.

XInclude parse="text" replaces CDATA in SGML. XInclude content negotiation is a good thing. He thinks there's a real problem with character entities though the W3C disagrees. The last five years have seen a steady march away from DTDs. "No one wants to open up the XML spec and start over. I'll propose doing so later, but I don't expect to be taken seriously." "Is there room for a general solution?" The solution for character entities is "dead simple". In general, he's presenting a system for resolving entities based on an XML instance document format defining the entities, and an extra processing layer.


The second session of the morning I switched rooms again to hear Walter Perry talk about Dealing with the Instance: Markup and Processing. According to the abstract,

The two practices which define our work are marking up instance texts and then processing those marked-up instances. A particular text might, or might not, be marked up within the confines of a particular vocabulary, schema or DTD, just as an instance text might or might not be processed within the constraints of a particular schematic, which might or might not be the schematic anticipated when the instance was marked up. Thus it is the instance which is crucial both to markup and to processing, and the schematic, if any, is not the primary subject of either. The implications of this premise will determine the future of our field and the applicability of our practices to discrete areas of expertise. In this presentation I intend to derive from this notion of instance-first a clear picture of what our practices must look like in order to carry out this premise and to fulfill its promise.

Yesterday during a coffee break, Walter told me he's been working on marking up Sanskrit grammars for the last six months. Sometimes he might as well be writing in Sanskrit for all that people understand him. His ideas are so radically nonconformant that nobody ever understands him the first time they hear him, or believes him when they finally do understand him. But he's basically right.

Talk begins. Markup and processing are both performed on the instance, not on the schematic. The instance is key. "The processor expects specific markup." It has to. If it can't find the markup it expects, it can't do its job. "Markup expects particular processing." "Processing expects particular markup." But expectations are not requirements. "It is possible (and useful) to break the expected correlation of markup and processing." This allows us to achieve specific expertise in the process. This is useful for fraud detection, audits, and generation of different forms of output.

Validation of output is more important than validation of input. It is more important to produce what you need than to receive what you need. Data structures belong to processes, not markup. Data structures are instantiated from the instance document and its markup. Expectations are generated by business processes, not by the documents and their markup. Bills of lading turn into bills of shipping by the action of a process. Each separate document should be appropriate to what it is (and what the expertise of the process that created it is), not what it expects to be turned into. "This instance centricity is the complement of REST."

Processes expect certain input. You can test instances that they comply to the document types that are expected as input. (Walter says, this is not quite the same as validation, though the difference is unclear to me.)

Someone in the audience is resisting the idea that one cannot reject invalid documents that nonetheless provide the actual information needed. This always happens at Walter's talks.

Markup delineates the author's reading of the document, but it loses intent. Different processes have different intentions.

He's telling a very interesting story about how standardized schemas eliminate the unique value and expertise of different organizations. I may have to transcribe this story later. This reminds me a lot of Joel Spolsky's calls to never start over from scratch. He says you should consolidate "like processes" (and only like processes).

  • Create a catalog of instance documents
  • Create a library of transforms to instantiate input to each process from every document consumed
  • Create a library of transforms to instantiate output for each useful document from the process

Processes are easily schematized and rationalized by consolidation.

What is net.syntags?


I'm editing this page manually, almost in realtime. I mentioned in the coffee break that I was having trouble keeping it well-formed. I don't want the DTD to force me to fill in end-tags and such before I'm ready; but it would be nice if BBEdit (or another editor) gave me a little icon somewhere that indicated the current status of the document, perhaps a green check mark for well-formed, a red X for malformed.


Bruce Rosenblum (co-author Irina Golfman) is giving the final regular session on "Automated quality assurance for heuristic-based XML creation systems." Schemas aren't enough. Validity isn't enough. By show of hands, the vast majority of the authors at this conference used hand editing in emacs or equivalent to write their papers. He's looking at heuristic based, pattern based, and manual conversion from existing documents. In these cases we need quality assurance beyond simply validating against a DTD or a schema.

Some techniques like color coding help manual proofing a lot. But they want more automated checking. They want to run a test suite across the output of an automatic conversion. So far the techniques seem pretty obvious. This is nothing anybody accustomed to running automated test suites on software doesn't already know. They're just testing that their software that converts existing data to XML runs properly. Maybe this is news to some XML-developers (though I doubt it) but anybody doing extreme programming already knows this. Their test suite takes in the ballpark of 8-10 hours to run so they do overnight testing rather than continuous testing.


As is tradition, Michael Sperberg-McQueen delivers the closing keynote. The nominal title is "Runways, product differentiation, snap-together joints, airplane glue, and switches that really switch." The question is "Does XML have a model? A supermodel? Does it matter?"

He begins by talking about glottal stops and learning Danish to read secondary literature about Norse sagas. As an English speaker, he couldn't hear glottal stops. He doesn't understand the effort to drop models into XML any more than he could hear glottal stops. He feels like an alien. He doesn't know what people mean by models. He describes various meanings of the word model and decides they don't apply. Model trains, fashion models. One use of the word model is for things that are a simplification or theory of reality, possibly discredited (phlogiston model of fire, Rutherford model of the atom). Something has to be different between a model and the real thing. Useful models are simpler, or more familiar, or easier to calculate with.

He's suspicious of having a model. He wants different models at different times because not all models capture the same things. To understand the model of SGML you have to understand the difference between a document type declaration and a document type definition. He's bringing up Wilkins and Leibnitz's effort to design a perfect language that does not allow untruthful statements. (Shades of QuickSilver, The Confusion, and Daniel Waterhouse.) But we do not believe there is a perfect universal vocabulary. SQL is too constrained by a single model. Thus, "users of SQL miss XML, but users of XML don't miss SQL." The lack of a single data model is a strength of XML. The models are owned by the users, not by ISO or the W3C. "Go forth and become models."


The conference will take place next year in Montreal, probably in August, probably in the same hotel. Some of the papers seem to be online on the Mulberry Tech website.

Thursday, August 5, 2004

Day 3 of Extreme kicks off with the W3C's Liam Quin (who does win the weirdest hat award for the conference) talking about the status of binary XML, though he's careful not to use the phrase "binary XML" since that's a bit of an oxymoron.

The W3C recently published XML Binary Characterization Use Cases. I haven't had the time to read through all the use cases yet, but what I've skimmed through so far is not compelling. I'll probably read through this and respond when I get back to New York next week.

Last night, Liam told me the W3C has not decided whether or not this effort makes sense. They're worried that if they don't do anything, somebody else may or, worse yet, a hundred different people may do a hundred different things. It's not clear that one format can solve everybody's use cases, or even a majority of the use cases. However, I get the impression that the W3C is convinced that if one format would solve a lot of use cases, they'd like to do it.

This would be a horrible mistake. Of course, there are legitimate needs for binary data; but nothing is gained by polluting XML with binary data. Text readability is an important corner stone of XML. It is a large part of what makes XML so interoperable. XML does what it does so well precisely because it is text. Adding binary data would simply bloat existing tools, while providing no benefit to existing applications. To the extent there's a need for binary data, what's needed are completely new formats optimized for different use cases. There is no call for a single uber-format that tries to be all things to all people, and likely ends up being nothing to no one.

C. Michael Sperberg-McQueen is talking about "Pulling the ladder up behind you", asking whether we're resisting an attempt to resist a hostile takeover of the spec or pulling up the ladder behind us. He doesn't know the answer, but I do. We are resisting a hostile takeover. Other groups are free to build other ladders that fit their needs. He mentions that there was a binary SGML effort at one point. It's not clear how far it got. Anyone know what happened to it?

Liam Quinn: "We're all sisters in the church of XML," and he's challenging the orthodoxy of that church. Once a week someone contacts him and says they need binary XML, but these people often don't know what they mean by "binary XML." Liam says, they're not looking at an incompatible document format; just different ways to transfer the regular XML documents. (I disagree, and I think his statement's more than a little jesuitical. Most of the proposals clearly anticipate changing the infoset and APIs. If the W3C is ruling out such efforts, they should state that clearly.)

Various goals for binary XML:

  • "You don't know what you're doing or you're a looney," says the man wearing a blinking orange crown. :-)
  • Reduced bandwidth
  • Reduced storage
  • Faster to parse
  • Lower memory usage (especially in so far as it leads to increased battery life)
  • Random Access

Cell phones and other micro devices are an important use case. However, as I discussed with Liam last night, there's a question whether these will still be small enough for this to be a problem by the time any binary XML spec were finished. Someone in the audience mentions that at least one cell phone now has a gigabyte of memory (though that serves double duty as both disk and RAM). Moore's law may solve this problem faster than a spec can be agreed on and implemented, though.

The W3C is not interested in hardware-specific, time-sensitive, or schema-required formats. Encryption is not a goal.

Costs of binary XML:

  • Reduces interoperability
  • Harder to debug
  • Creates islands of binary goop
  • "Weakens religious dogma that all processors understand all documents"

"WAP stands for totally useless protocol."

Benefits:

  1. Embrace new communities: XML compatible structures can be used where at present XML is not used now
  2. Embrace new technologies (e.g. graphics, maps)
  3. Open up information currently stored in proprietary binary formats

Me: 1 is disingenuous. Redefining XML as something new does not truly expand the universe of what's done with XML. 2 is done today with text XML. It's not clear binary XML is necessary. There are some size issues here for some documents, but these use cases are very well addressed by gzip. 3 is a very good idea, but has nothing to do with XML. It could be done with a non-XML format.

Liam: The question is not whether we have binary XML-like formats or not. The question is whether we have many or few. Me: It sounds like there's a real control issue here. The W3C gets very nervous with the idea that somebody besides them might define this. Even if they decide it's a bad idea, they'd rather be the one to write the spec than letting someone else do it. They do say a lot of people have come to them asking for this. They did not initiate it.

Side note: In Q&A Liam misunderstands web services. He sees it as an alternative to CORBA/IDL rather than an alternative to REST, and therefore thinks it's a good thing.

This has been a few quick notes typed as I listened. I'll have more to say about this here in a week or so.


The second, less controversial topic of the morning, is Microsoft's Matthew Fuchs' talk on "Achieving extensibility and reuse for XSLT 2.0 stylesheets". He's a self-avowed advocate of object orientation in XML. He's going to talk about how to use the OO features of the W3C XML Schema Language in conjunction with XSLT2 (especially XPath 2.0's element() and attribute() functions to achieve extensibility in the face of unexpected changes.


Stephan Kesper is presenting a simple proof that XSLT and XQuery are Turing complete using μ-recursive functions. There is no concept known that is more powerful for computation than a Turing machine. It is believed but not proved that no such more powerful concept of computation that exists. On the other hand, what exactly a "concept of computation" means is not precisely specified, so this is uncertain.

His technique requires stacks, which he implements using concat(), substring-before(), and substring-after(). Hmm, that seems a bit surprising. I don't think of these functions as a core part of XSLT. I wonder if the proof could be rewritten without using these functions? Kesper thinks so, but the proof would been much more complex.

The XQuery proof is much simpler, because XQuery allows function definition.

He really hates XSLT's XML-based syntax. He much prefers XQuery. He thinks (incorrectly IMO) that XSLT is machine oriented and XQuery is more human oriented.

In Q&A Norm Walsh, thinks the proof depends on a bug in Saxon (Kesper disagrees). The question is whether it's legal to dynamically choose the template name to be called by xsl:call-template; i.e., is the name attribute of xsl:call-template an attribute value template or not? The audience can't decide. Hmm, looking at the spec, I think Norm is right, but Dimitre Novatchev may have presented a workaround for this last year so the flaw in the proof may be fixable.


I could follow most of the last paper, barely. This next one, "Balanced context-free grammars, hedge grammars and pushdown caterpillar automata", by Anne Brüggemann-Klein, (co-author Derrik Wood) may run right past my limits so take what I write with a grain of salt. She says the referees made the same comments. According to the abstract,

The XML community generally takes trees and hedges as the model for XML document instances and element content. In contrast, computer scientists like Berstel and Boasson have discussed XML documents in the framework of extended context-free grammar, modeling XML documents as Dyck strings and schemas as balanced grammars. How can these two models be brought closer together? We examine the close relationship between Dyck strings and hedges, observing that trees and hedges are higher level abstractions than are Dyck primes and Dyck strings. We then argue that hedge grammars are effectively identical to balanced grammars, and that balanced languages are identical to regular hedge languages, modulo encoding. From the close relationship between Dyck strings and hedges, we obtain a two-phase architecture for the parsing of balanced languages. This architecture is based on a caterpillar automaton with an additional pushdown stack.

She's actually only going to talk about hedges, and omitting the Dyck strings to make the talk easier to digest. A hedge differs from a forest in that the individual trees are placed in order. In a forest, the trees have no specified order.

In DTDs element names and types are conflated.


I wish the conference put the papers online earlier. They should be up after the conference is over, but for these more technical talks it would be really helpful to be able to read the paper to clarify the points, not to mention I could link to them for people reading remotely who aren't actually here.


This conference has good wireless access. I've also recorded a few talks for later relistening using my laptop's built-in microphone. It makes me wonder if it might be possible to stream the conference out? I suspect the organizers might have something to say about that, but it would be interesting to webcast an entire conference in realtime, not just an occasional keynote.


The afternoon sessions split between the data-heads and the doc-heads. Since I have feet planted firmly in each camp, I had some tough choices to make but I decided to start with the docheads and TEI. Lou Burnard is talking about "Relaxing with Son of ODD, or What the TEI did Next" (co-author Sebastian Rahtz). According to the abstract,

The Text Encoding Initiative is using literate schema design, as instantiated in the completely redesigned ODD system, for production of the next edition of the TEI Guidelines. Key new aspects of the system include support of multiple schema languages; facilities for interoperability with other ontologies and vocabularies; and facilities for user customization and modularization (including a new web-based tool for schema generation). We'll try to explain the rationale behind the ongoing revision of the TEI Guidelines, how the new tools developed to go with it are taking shape, and describe the mechanics by which the new ODD system delivers its promised goals of customizability and extensibility, while still being a good citizen of a highly inter-operable digital world.

TEI v5 is a major update. They're using Perforce to manage the content management system, and are experimenting with switching to eXist. They chose RELAX NG for the schema language. "DTDs are not XML, and need specialist software." W3C XML Schema Language is much too complex. They use Trang to convert the RELAX NG schemas.

At this point, lunch began to disagree with me so I had to make a quick exit back to my room and missed the rest of the talk. :-( I really shouldn't have had that second dessert. But in the elevator on the way up, I did consider how many projects are switching to RELAX NG. TEI is just the latest victory. Others include DocBook, XHTML 2, and SVG.


Blaise Doughan of Oracle (co-author Donald Smith) is talking about Mapping Java Objects to XML and Relational Databases. Michael Sperberg-McQueen objects to classifying Oracle as a relational database in his introduction to the talk.

So far this looks like another W3C XML Schema Language based Java-XML data binding system. The difference from JAXB is that they use XPath to define the mappings rather than element names. A question from the audience points out that they've made the classic database-centric mistake of ignoring arity problems (i.e. what happens when an element contains multiple child elements of the same type/name). They do support position based mapping. The demo crashed.


The final session of the day is Norm Walsh talking about "Extreme DocBook", particularly the redesign of DocBook in version 5.0 using RELAX NG. This is called DocBook NG. There's currently an experimental "Eaux de Vie" release. "DocBook is my hammer." DocBook has about 400 elements, about half of which have mixed content and almost none have simple content. It is a DTD for prose.

It was originally designed as an exchange format, but today it's pretty much just an authoring format. DocBook has grown by accretion. Thus, it's time to start over. Plus the DTD fails to capture all the constraints. They'll use Relax NG. Trang can convert to W3C XML Schemas, but Trang isn't smart enough to produce the DTD.

RELAX NG patterns allow DocBook NG to combine elements (e.g. one info element instead of book.info, chapter.info, article.info, etc.) while still allowing different content models in different places. DocBook NG has only 356 elements.

He also like's RELAX NG's & connector for unordered content. Co-constraints are useful too. This all allows him to untangle the conflicting CALS and HTML table models. (Wouldn't namespaces be a better fix here?) A few elements such as pubdate benefit from RELAX NG data types (but this is arguable).

For what RELAX NG can't specify, they'll use a rule-based technology such as Schematron. This could implement an exclusion.

Traditionally the version of DocBook has been identified by the public ID. This no longer applies, so a version attribute is needed. This is also useful for referential integrity constraints, such as a footnoteref points to a footnote. They can embed Schematron in a RELAX NG schema.

Ease of use, including ease of subsetting and extending, is crucial. Migration issues are also important. XSLT helps. His XSLT stylesheet to convert covers about 94% of the test cases.

He doesn't think the W3C XML Schema language is the right abstraction for schemas that have lots of mixed content such as DocBook. John Cowan: "XML DTDs are a weak and non-conforming syntax for RELAX NG".

Wednesday, August 4, 2004

I've installed a server side spam filter. Hopefully this will bring my e-mail load down to a manageable level, even when I'm on a low-speed connection or accessing my e-mail via pine. On the flip side, even though it's set to a very conservative level, it's now more likely that I'll miss a few messages since it's much harder to check my spam folder on the server than on the client. If you send something important to me, and you expect a response and don't get it, it might be worth contacting me in some other way. It's funny how dependent we've become on e-mail. John Cowan and I are in the same building for a few days (a relatively rare event) and we're still trading e-mail with each other while tracking down a buggy interaction between XOM and TagSoup.


Murphy's Law strikes again. Five minutes after I wrote the last entry, I'm at the coffee break and John Cowan comes up to me and says, "I got your e-mail but I didn't understand it. What did you mean?" We straightened it out quickly. Sometimes it pays to be physically present. :-)


The second day of Extreme Markup Languages 2004 began with Ian Gorman's presentation on GXParse, about which I'm afraid I can't say too much because I was busy getting ready for my own presentation on XOM Design Principles which was, I'm happy to say, well received. The big question from the audience was whether Java 1.5 and generics changed any of this. The answer is that XOM needs to support Java 1.4 (and indeed 1.3 and 1.2) so generics are not really an option. If I were willing to require Java 1.5 for XOM, the answer might be different. Still it might not be because the lack of type safety in generics is a big problem.


Following the coffee break, B. Tommie Usdin is giving Steve DeRose's paper (Steve wasn't here for some reason) on Markup overlap: A review and a horse. It was very entertaining, even if Usdin and the audience didn't always agree on what DeRose was actually trying to say in his paper. This amusing session was followed by Wendell Piez (who is here) delivering his own talk on Half-steps toward LMNL (Layered Markup and Annotation Language).

The question in both of these papers (and a couple of earlier sessions I missed while in the other track) is how to handle overlapping markup such as

<para>David said, 
<<quote>I tell you, I was nowhere near your house. 
I've never been to your house!
I don't know who took your cat. 
I don't even know what your cat looks like.
</para>
<para>
Why are you accusing me of this anyway? 
It's because you don't like my dog, isn't it? 
You've never liked my dog!
You're a dog hater!
</quote>. 
At that moment, David's bag began to roll around on the floor and meow.
</para>

Apparently, this sort of structure shows up frequently in Biblical studies.


The afternoon begins with a session that looks quite interesting. Christian Siefkes is scheduled to talk about "A shallow algorithm for correcting nesting errors and other well-formedness violations in XML-like input." According to the abstract,

There are some special situations where it can be useful to repair well-formedness violations occurring in XML-like input. Examples from our own work include character-level and simple nesting errors, widowed tags, and missing root elements. We analyze the types of errors that can occur in XML-like input and present a shallow algorithm that fixes most of these errors, without requiring knowledge of a DTD or XML Schema.

I tried to do something like this with XOM and eventually decided it was just too unreliable. You couldn't be sure you were inserting the missing end-tags in the right places. I'm curious to see how or whether he's addressed this problem.

XML has "The most conservative appproach to error handling I have ever heard of." The idea is to repair the errors at the generating side, not the receving side, because different receivers might repair it differently. (Right away that's a difference with what I was trying in XOM.) XML-like input is input that is meant to be XML, but may be malformed. Possible errors include:

  • Unescaped < and &
  • Simple overlap errors wheree end-atgs are in wrong order
  • Singleton start- or end-tags
  • Missing/multiple root elements

He's interested mostly in errors caused by programs that automatically add markup to existing documents by linguistic analysis on plain text. But also by errors caused by human authoring.

When fixing tags, need to choose heuristics. For instance, should the end-tag for a widowed start-tag be placed immediately after the start-tag or as far away as possible?

A mutlipass algorithm. The first pass tokenizes and fixes character errors such as unescaped < signs and missing quotes on attribute values. Second pass fills in missing tags. It's not always possible to do this perfectly. Overall, this looks mildly useful, but there's nothing really earth-shattering here.


For the second session of the afternoon, I swapped roooms away from a session that smelled of the semantic web into a more user-interface focused session delivered by Y. S. Kuo (co-authors N. C. Shih, Lendle Tseng, and Jaspher Wang) on "Avoiding syntactic violations in Forms-XML". Thi seems to be about some sort of XML editing forms toolkit.

Current XML editors provide a text view, a tree view, and/or a presentation view. He thinks only the presentation view is appropriate for end users. I tend to agree. He thinks narrative-focused XML editors are more mature than editors for record-like documents so he's going to concentrate on the latter. Syntactic constraints are independent of the user interface layout.


The final session of the day is a panel discussion of "Update on the Topic Map Reference Model." Hard as it is to believe, Patrick is not wearing the strangest hat at this conference.

Patrick as Gandalf

First up is Lars Marius Garshol. In the beginning there was no topic map model. Topic maps use XLinks. PMTM4, a graph based model with three kinds of modes vs. the infoset based model. These two models were trying to do completely different things. Now there are two different models, reference Model and Data Model. No user will ever interact with the model directly. The model is marketing machinery.

Now Patrick Durusau. "The goal is to define what it means to be a 'topic map', independent of implementation detail or data model concerns" so that user of different implementations can merge their topic maps. They want more people to join the mailing list and give input.

Tommie Usdin wants to know if when the panel sees something they always all agree on whether it is or is not a topic map. The answer appears to be no, they do not know one when they see it, at least if this requires agreeing on what is and isn't a topic map, even though there is an ISO standard that describes this.


John Cowan is delivering a nocturne on TagSoup, a SAX2 parser for ugly, nasty HTML. It processes HTML as it is, not as it should be. "Almost all data is ugly legacy data at any given time. Fashions change." However, this does not work with XHTML! Empty-element tags are not supported. TagSoup is driven by a custom schema in a custom language. It generates well-formed XML 1.0. It does not guarantee namespace well-formedness. TSaxon is a fork of Saxon 6.5.3 that bundles TagSoup. Simon St. Laurent: "It's nice when people give you the same crap over and over instead of different crap." Cowan demoed TagSoup over a bunch of nasty HTML people submitted on a poster over the last couple of days. It mostly worked, with one well-formedness error (an attribute named 4).

Tuesday, August 3, 2004

Extreme Markup Languages 2004 is set to get underway in about 45 minutes. It appears the wireless access does work in the conference rooms where I'm typing this so expect live reports throughout the day. As usual when I'm updating in close to real time, please excuse any spelling errors, typos, misstatements, etc. I also reserve the right to go back and rewrite, correct, and edit my remarks if I misquote, misunderstand, or misparaphrase somebody.


I'm pleased to see the conference has put out lots of power strips for laptops. And in a stroke of genius, they're all in the first three rows so people have to come to the front of the room to get power rather than hiding in the back. :-)


Over 2000 messages piled up in my INBOX since Saturday. 29 of them weren't spam. It didn't help that IBiblio recently decided to start accepting messages from servers that don't follow the SMTP specs correctly. Since I'm on my laptop I'm logging into the server and scanning the headers manually with pine, rather than using Eudora and my normal spam filters. This probably makes me more likely to accidentally delete important email, especially since pine occasionally confuses my terminal emulator when faced with the non-ASCII subject lines of a lot of spam, so I don't always see the right headers. If you're expecting a response from me, and don't get one, try resending the message next week.


I'm typing this on my PowerBook G4 running Mac OS X. This means a lot of the Mac OS 9 AppleScripts I use to do things like change the quote of the day and increment the day's news don't work. I expect to be upgrading my main work system to Mac OS X in the next month or two (depending on when Apple finally ships the G5 I ordered a month and a half ago) so I really need to update all these scripts to work with Mac OS X. Right now this is all done with regular expressions and baling wire. The only excuse I can offer for this is that these are legacy scripts that were originally developed for Cafe au Lait in the pre-XML years. However, the regular expressions package doesn't work in Mac OS X, and as long as I'm rewriting this anyway I might as well base it on XML. Does anyone know of a good XML parser for Mac OS X that's AppleScriptable? So far the only one I've found only runs in the classic environment. could probably teach myself enough about Mac programming in Objective C to write an AppleScript wrapper around Xerces, but surely someone's done this already?


B. Tommie Usdin is giving the keynote address, "Don't pull up the ladder behind you." From the abstract it sounds like she may be targeting me (among others):

The papers submitted for Extreme Markup Languages in 2004 were remarkable in a number of ways. There were more of them than we have ever received before, and their technical quality was, on the whole, better than we have received before. More were based on real-life experience and more used formalisms. Also new this year, however, was a tendency to identify, poorly characterize, and then attack a specification, technology, or approach seen as competition to the author’s. I saw evidence of the implicit assumption that there is limited space for XML applications and specifications and that success of one limits the opportunity for success for another. It is this attitude that leads to “specification poaching” and a variety of other anti-social behaviors that can in fact hurt us all. I believe that XML-space is in fact not limited any more than, for example, UNICODE-space is limited. We can and should learn from each other’s requirements, applications, and implementations. Pulling up that ladder after yourself in this environment hurts not only the people who want to come after you, it hurts you by limiting your access to new ideas, techniques, and opportunities.

I definitely do target some other technologies in my presentation on XOM. However, it's not because I think there's only room for one spec in a space. It's because I think some of the competitors are actively harmful in a number of ways. 20 competing APIs may be a good thing, but only if all 20 of the APIs are good when considered in isolation. One of the nice things about having a plethora of APIs is we can differentiate the good from the bad, and throw away the bad ones. When there's only one API we're stuck with it, and we may not even realize how bad it is, (*cough* DOM *cough*). 15 bad APIs and 5 good ones just confuses everyone and costs productivity. It's better to pick and choose only the technologies that actually work. But I'll get my own 45 minutes to argue this point tomorrow. :-)


Usdin is talking now. She says there were some problems with MathML and internal DTD subsets in the paper submissions, plus one person who insisted that XML was only for bank transactions and it was ridiculous to write conference papers in anything other than Word. MathML may be ready for prime time but is not yet ready for amateur use. She believes in shared document models and doesn't like when people make up their own markup, even when the document submitted is in fact valid. In other words, she doesn't like internal DTD subsets; because this really breaks the whole notion of validity. You can't easily check the validity of a document against the external DTD subset without using the internal DTD subset too. Me: Interesting. I've known the internal DTD subset was a major source of problems for parser implementers for a long time, but it seems to be a problem for users, even experienced users, too. More evidence that XML made the wrong cut on how parsers behave. Parsers should be able to ignore the internal DTD subset, but the spec doesn't allow this

Now she's talking about pulling up the ladder behind us. She seems mostly focused on different XML applications (vocabularies) rather than libraries and APIs.

  • RELAX NG vs. DTDs vs. schemas
  • XLink vs. link tags
  • The regular XML model for narrative text
  • Authorities (W3C, OASIS, etc.) require compatibility with existing specs for new specs, even when the requirements for the new specs don't exactly match the requirements for the old specs. e.g. coffee growers vs. coffee consumers
  • XSL-FO is weakened by requirement to be compatible with CSS
  • Hijacking of spec
  • "Complete" specifications pull up the ladder behind them.
  • Refrain from gratuitous standardization

She compares this to Unicode. Unicode doesn't try to stop other people from using Unicode. I note that Unicode does actively try to stop competing character encodings though. Overall though, it sounds like I might have misguessed where this talk is going. Her objection is to overspecifying the applications of XML, not the core technologies on which these applications build.


James David Mason (co-author Thomas M. Insalaco) of the Y-12 National Security Complex are talking about "Navigating the production maze: The Topic-Mapped enterprise." According to the abstract, "A manufacturing enterprise is an intricate web of links among products, their components, their materials, and the facilities and trained staff needed to turn materials into components and completed products. The Y-12 National Security Complex, owned by the U.S. Department of Energy, has a rather specialized product line, but its problems are typical of large-scale manufacturing in many high-tech industries. A self-contained microcosm of manufacturing, Y-12 has foundries, rolling mills, chemical processing facilities, machine shops, and assembly lines. Faced with the need to maintain products that may have been built decades ago, Y-12 needs to develop information-management capabilities that can find out rapidly for example, what tools are needed to make parts for a particular product or, if tools are replaced, which products will be affected. A topic map, still in preliminary development, reflects a portion of Y-12's web of relationships by treating the products in detail, the component flows, and the facilities and tools available. This topic map has already taught us new things about Y-12. We hope to extend the topic map by including other operational aspects, such as the skills and staffing levels necessary to operate its various processes."

Legacy issues are a big deal for them. They need to be able to resume production on very old equipment with a 36 month lead time. They need to be able to rebuild anything, even though original suppliers like Dupont may not make the components anymore. "Fogbank is a real material, but I can't tell you what it is." "They'd shoot me if I told you what Fogbank is." (I suspect it's made of highly enriched uranium, though. It also appears to be producing lots of Tetrachloroethylene and K-40. Ain't Google wonderful?) Topic maps integrate the data. They help people understand the data, and helps find errors in the existing data. Naming issues (e.g. oralloy vs. enriched uranium) are a big deal. They need to be able to unite topics that go by different names. Access to the data is an as yet unsolved issue, and needs to be added to the ontology. They want to show different associations depending on credentials.

He's used XSLT on HTML tables generated by Excel to create topic maps (among other things).


Next up are Duane Degler and Renee Lewis talking about "Maintaining ontology implementations: The art of listening." They do government work too. They're giving an example of shopping for a car. They seem to be interpreting the inability to find what they want as a metadata problem. I suspect it's more of a user interaction problem. Specifically, web sites tend to be designed according to the mental model of the publisher, rather than the mental model of the reader. This is why you can search for cars by model number but not by four-seat convertible. Auto dealers think about model numbers. Consumers think about features. But they don't believe that. They want to listen to the users' questions to evolve the language and refine the topic maps Specifically they want to monitor search requests, Q&A requests, threaded discussions, etc. Lewis says, "Topic map versioning we put on the back burner because it looked big and ugly." It sounds like a good idea overall, but it's a little light on the practical specifics. User buy-in and management approval is a big problem.


The afternoon sessions split into two tracks. I'm skipping topic maps for Bryan Thompson talking about "Server-Side XPointer" (Co-authors: Graham Moore, Bijan Parsia, and Bradley R. Bebee) He wants to relate core REST principles to linking and document interchange. RESTful interfaces should be opaque. You shouldn't be able to see what's behind the URI. Generic semantics based on request method and response codes.

POST to create a resource and PUT to update one?!? That sounds backwards. XPointer is not just XPath. It is extensible to new schemes. Clients don't need to understand these schemes! It's enough that the server understand them! Now that he's said it, it's obvious. I don't know why I didn't see this before. It makes XPointers much more useful. Use range-unit field in HTTP header to say "xpointer". The server responds with 206 Partial Content. Hmm, this does require some client support. He doesn't like the XML data model and XPath for addressing. These aren't extensible enough. We need to use more application specific addressing schemes and content negotiation.


The next talk has Sebastian Schaffert giving a "Practical Introduction to Xcerpt" (co-author François Bry). He says current query languages intertwine querying with document construction which makes them unsuitable for reasoning with meta-data, a la the Semantic Web. He uses an alternate syntax called "data terms" that is less verbose than XML, and not a strict isomorphism because they want to research alternate possibilities.


Jeremy J. Carroll (co-author Patrick Stickler) is talking about RDF Triples in XML (TriX). RDF can be serialized in XML but is not XML, and is not about markup. The graph should be obvious, more N-triples than RDF/XML. It uses uri, plainLiteral, id, and typedLiteral elements, all of which contain plain text. A triple element contains a subject, predicate and object. (Order matters.) The predicate must be a uri, but the subject can be a literal. A graph contains triples and can be named with a uri. A trix can contain multiple graphs. Interesting idea: XSL transform can be applied before RDF validation to transform document into the form to be validated. That's it. This seems much simpler than RDF/XML.


In the final regular session of the day, Eric Miller (co-author C. M. Sperberg-McQueen) is talking about "On mapping from colloquial XML to RDF using XSLT". This is about schema annotation. Two kinds of data integration:

  • Integrating data from DTD X into DTD Y
  • Creation of DTD-neutral information stores; i.e. RDF

He's talking about the second type. Apple's plist format is really nasty. But it works for Apple. Use XSLT to transform to what we want, but this data is too valuable to be sitting in make files, so put it in the schemas.

Monday, August 2, 2004

John Cowan has updated TagSoup, his Java-language SAX parser for nasty, ugly, short HTML, to version 0.9.5. This is a bug fix release.

Saturday, July 31, 2004

I'm in transit today (Saturday) between New York and Montreal, where I'll be at Extreme Markup Languages talking about XOM Design Principles and listening to others talk about various topics. I'm told the hotel has good wireless access so if it extends to the meeting rooms things should get interesting here again on Tuesday. In the meantime, if anyone else is arriving early for the conference (or just happens to live in or around Montreal) and would like to do some birding on Sunday or Monday, leave a message for me at the Hotel Europa. (You can try e-mail, but I can't promise I'll get it in time.) I've rented a car, and am going to try to hit some of the parks and other promising birding areas on the outskirts of the city.

Friday, July 30, 2004

Sun has posted the proposed final draft specification for JSR-226, Scalable 2D Vector Graphics API . "This API is targeted for low-end mobile devices with constraints in memory, screen size, and computational power. The goal of this specification is to define an optional API package for rendering Scalable 2D vector images, including external images in SVG format. The main target use cases of this API are map visualization, scalable icons and other applications which require scalable, animated graphics."


Altsoft N.V. has released, Xml2PDF, a .NET-based XSL-FO and SVG formatting engine that converts XSL-FO and SVG documents to PDF. There are actually several versions of this with different prices:

  • A $49 GUI client for individual users
  • A $399 command line version intended for batch processing on the server
  • A $1599 .NET library version that can be embedded into custom applications, but apparently is not redistributable
Thursday, July 29, 2004

The W3C XQuery and XSLT Working Groups have updated five working drafts:

Changes in XPath 2.0 include:

  • xdt:untypedAny has been renamed xdt:untyped.

  • Value comparisons now return () if either operand is ().

  • The precedence of the cast and treat operators in the grammar has been increased.

  • The precedence of unary arithmetic operators has been increased.

  • A new component has been added to the static context: context item static type.

  • The specification now clearly distinguishes between "statically-known namespaces" (a static property of an expression) and "in-scope namespaces" (a dynamic property of an element).

  • XPath allows host languages to specify whether they recognize the concept of a namespace node. XQuery does not recognize namespace nodes. Instead, it recognizes an "in-scope namespaces" property of an element node.

Changes in XQuery include:

  • An ordering declaration has been added to the Prolog, which affects the ordering semantics of path expressions, FLWOR expressions, and union, intersect, and except expressions. In addition, ordered and unordered operators have been introduced that permit ordering semantics to be controlled at the expression level within a query.

  • Validation has been separated from construction. Validation now occurs only as a result of an explicit validate expression. Validation modes are strict and lax, and are specified on the validate expression. New construction modes strip and preserve have been defined and are declared in the Prolog. The notion of "validation context" has been deleted. The XQuery definition of validation has been converged with the definition used in XSLT.

  • User defined function overloading: that is, multiple user-defined functions can have the same name as long as they have different numbers of arguments.

  • Computed namespace constructors are now completely static and are allowed only inside a computed element constructor. Namespace declarations in a computed element constructor must come before the element content, and must consist entirely of literals. The namespace prefix is optional. If absent, it has the effect of setting the default namespace for elements and types within the scope of the constructed element.

  • The precedence of the cast and treat operators has increased.

  • The precedence of unary arithmetic operators has increased.

  • Variable initialization in the Prolog now uses an assignment operator (":="). Also, circularities in variable initialization are now considered to be a static errors.

  • Module imports and schema imports now accept multiple location hints, representing multiple physical resources in the same module or schema.

  • CData Sections are no longer considered to be constructors, but are simply a notational convenience for embedding special characters in the content of an element or attribute constructor.

  • Three new components have been added to the static context: XQuery Flagger status, XQuery Static Flagger status, and context item static type.

  • An order by clause may now accept values of mixed type if they have a common type that is reachable by numeric promotion and/or moving up the type derivation hierarchy, and if this common type has a gt operator.

  • In element and document node constructors, if the content sequence contains a document node, that node is replaced by its children (this was previously treated as an error).

  • It is now implementation-defined whether undeclaration of namespace prefixes in an element constructor is supported. If supported, this feature conforms to the semantics of Namespaces 1.1. In other words, if an element constructor binds a namespace prefix to the zero-length string, any binding of that prefix defined at an outer level is suspended within the scope of the constructed element.

Wednesday, July 28, 2004

The Apache Web Services Project has posted version 0.3 of JaxMe 2, an open source implementation of the Java API for XML Binding. Quoting from the web page,

JaxMe 2 is an open source implementation of JAXB, the specification for Java/XML binding.

A Java/XML binding compiler takes as input a schema description (in most cases an XML schema but it may be a DTD, a RelaxNG schema, a Java class inspected via reflection or a database schema). The output is a set of Java classes:

  • A Java bean class compatible with the schema description. (If the schema was obtained via Java reflection, then the original Java bean class.)
  • An unmarshaller that converts a conforming XML document into the equivalent Java bean.
  • Vice versa, a marshaller that converts the Java bean back into the original XML document.

In the case of JaxMe, the generated classes may also

  • Store the Java bean into a database. Preferrably an XML database like eXist, Xindice, or Tamino, but it may also be a relational database like MySQL. (If the schema is sufficiently simple. :-)
  • Query the database for bean instances.
  • Implement an EJB entity or session bean with the same abilities.

In other words, by simply creating a schema and running the JaxMe binding compiler, you have automatically generated classes that implement the complete workflow of a typical web application:

Tuesday, July 27, 2004

The W3C Semantic Web Best Practices and Deployment Working Group has published two new working drafts, Representing Classes As Property Values on the Semantic Web and Defining N-ary Relations on the Semantic Web: Use With Individuals. The first "addresses the issue of using classes as property values in OWL. While OWL Full and RDF Schema do not put any restriction on using classes as property values, OWL DL and OWL Lite do not generally allow this use. The only property that can have a class as its value is rdf:type (and its subproperties). The document examines different approaches to representing this ontological pattern in OWL DL and discusses considerations that the users should keep in mind when choosing one of the approaches." The second describes how to handle non-binary relations. "In Semantic Web languages, such as RDF and OWL, a property is a binary relation; that is, it links two individuals or an individual and a value. How do we represent relations among more than two individuals? How do we represent properties of a relation, such as our certainty about it, severity or strength of a relation, relevance of a relation, and so on? The document presents ontology patterns for representing n-ary relations and discusses what users must consider when choosing these patterns." These are expected to become notes, and are not on the recommendation track.


Tim Bray has posted another beta of Genx, his pure C library for outputting canonical XML. This release plugs a memory leak. Haven't we learned by now that programmers should not be managing their own memory? I know there are perfectly good garbage collection libraries for C++. Are there any for plain vanilla C? Genx is published under the expat license. This is a very liberal, non-viral but GPL-compatible license.


In related news, Garrett Rooney has written GenX4r, a Ruby wrapper around Genx.


The W3C CSS working group has updated the working draft of CSS3 Speech Module. This spec defines CSS properties used when documents are read out loud. These include voice-volume, voice-balance, speak, pause-before, pause-after, pause, cue-before, cue-after, cue, voice-rate, voice-family, voice-pitch, voice-pitch-range, voice-stress, voice-duration, phonemes, @phonetic-alphabet. This draft adds new mark-before, mark-after, and mark properties that attach nnamed markers to the audio stream.

Monday, July 26, 2004

The W3C Privacy Activity has posted the third public working draft of the Platform for Privacy Preferences 1.1 (P3P1.1) Specification. "P3P 1.1 is based on the P3P 1.0 Recommendation and adds some features using the P3P 1.0 Extension mechanism. It also contains a new binding mechanism that can be used to bind policies for XML Applications beyond HTTP transactions." New features in P3P 1.1 include a mechanism to name and group statements together so user agents can organize the summary display of those policies and a generic means of binding P3P Policies to arbitrary XML to support XForms, WSDL, and other XML applications.

Sunday, July 25, 2004

The W3C Voice Browser Working Group has posted the Proposed Recommendation of the Speech Synthesis Markup Language Version 1.0. According to the abstract, the Speech Synthesis Markup Language "is designed to provide a rich, XML-based markup language for assisting the generation of synthetic speech in Web and other applications. The essential role of the markup language is to provide authors of synthesizable content a standard way to control aspects of speech such as pronunciation, volume, pitch, rate, etc. across different synthesis-capable platforms."

Saturday, July 24, 2004

Tom Bradford has posted the first release candidate of dbXML 2.0, an open source native XML database written in Java and published under the GNU General Public License. New features in version 2.0 include:

  • Journaling transactions
  • XSLT transformations
  • Full text indexing and Full text querying
  • Pluggable security models
  • SSL connection support
  • JSP Tag Library

Java 1.4 is required. Changes since the last beta include reverting the paging system o standard Java I/O, UTF-8 support, and an improved administrator tool


Dave Beckett has released the Raptor RDF Parser Toolkit 1.3.2, an open source C library for parsing the RDF/XML and N-Triples Resource Description Framework formats. It uses expat or libxml2 as the underlying XML parser. Version 1.32 fixes some bugs and makes some changes to the build process. Raptor is published under the LGPL.


John Krasnay has released Vex 1.0, an open source (LGPL) XML editor that features a word processor-like interface. Vex is based on the Eclipse platform. It supports DocBook 4.1.2, 4.2, 4.3, Simplified DocBook 1.0, and XHTML 1.0 Strict and can be configured for other DTDs.

Friday, July 23, 2004

Design Science has released MathPlayer 2.0, a free-as-in-beer Internet Explorer 6.0 plug-in for displaying MathML. version 2.0 givens MathPlayer the ability to speak mathematical equations, as well as render them. It also adds support for the XHTML+MathML format currently supported by Mozilla and Netscape.

Thursday, July 22, 2004

The W3C XHTML working group has published the sixth public working draft of XHTML 2.0. XHTML 2.0 is the next, backwards incompatible version of HTML that incorporates XFrames, XForms, and lots of other crunchy XML goodness. However, XLink is not yet included and may never be. (The HTML Working Group are extreme XLink skeptics.) "This version includes an early implementation of XHTML 2.0 in RELAX NG [RELAXNG], but does not include the implementations in DTD or XML Schema form." (It's interesting that even the W3C working groups are starting to prefer RELAX NG.) This release adds:

  • XHTML Hypertext Attributes Module
  • XHTML I18N Attribute Module
  • XHTML Bi-directional Text Attribute Module
  • XHTML Edit Attributes Module
  • XHTML Embedding Attributes Module
  • XHTML Image Map Attributes Module

Also, the <hr /> element has been renamed <separator/>.


The Big Faceless Organization has released the Big Faceless Report Generator 1.1.20, a $1200 payware Java application for converting XML documents to PDF. Unlike most similar tools it appears to be based on HTML and CSS rather than XSL Formatting Objects. This is mostly a bug fix release but does add the ability page sizes in fractions of a point. Java 1.2 or later is required.

Wednesday, July 21, 2004

I've posted the fourth alpha release of XOM 1.0, my free-as-in-speech library for processing XML with Java. While there are still some flaky parts in the internals to be worked out in the next release — there's a nasty bug involving FIXED default attribute values in the internal DTD subset, and the XSLT processing is not completely compatible with the latest versions of Xalan — the API is now considered to be complete and frozen. Code you write to XOM today should not require recompilation against any future 1.x release, and if a really major flaw is discovered in the API design, I'll try to provide a deprecation cycle first before removing the flawed methods. The API has not changed in a backwards incompatible way since 1.0d25. All code that ran with 1.0d25 should not require recompilation to work with 1.0a1. Internal and backwards compatible changes since alpha 1 include:

  • Nodes.remove(int) now returns the node removed.
  • The IBM virtual machine 1.4.1 is no longer special cased.
  • The API documentation has undergone extensive editing.
  • The unpublished nu.xom.xerces package has been removed.
  • Most methods have been made non-recursive, so they shouldn't cause stack overflows in deep documents.
  • The W3C XML Schema Language and WML and HTML DOMs have been removed from the bundled version of Xerces to save space.
  • XOM now uses character references only when necessary for all encodings supported by the local virtual machine. However, this may be quite a bit slower than the explicitly supported encodings like UTF-8 and the ISO-8859 character sets. Measurements remain to be performed.
  • URI verification and base URI resolution are now performed according to the RFC2396bis algorithm, rather than by using the Xerces and java.net URI classes.
  • The Builder class no longer sets any Java system properties for improved compatibility with applets and multiclassloader environments.

I plan one more alpha release to fix the bugs with XSLT processing and FIXED ATTLIST declarations, and then it's on to beta. The beta releases will be pretty much feature complete and bug free, and focus mostly on finishing the documentation.

Tuesday, July 20, 2004

The W3C XML Schema Working Group has posted the first public drafts of XML Schema 1.1 Part 1: Structures and XML Schema 1.1 Part 2: Datatypes. According to the introduction to the structures spec,

The Working Group has two main goals for this version of W3C XML Schema:

  • Significant improvements in simplicity of design and clarity of exposition without loss of backward or forward compatibility;
  • Provision of support for versioning of XML languages defined using the XML Schema specification, including the XML transfer syntax for schemas itself.

These goals are slightly in tension with one another -- the following summarizes the Working Group's strategic guidelines for changes between versions 1.0 and 1.1:

  1. Add support for versioning (acknowledging that this may be slightly disruptive to the XML transfer syntax at the margins)
  2. Allow bug fixes (unless in specific cases we decide that the fix is too disruptive for a point release)
  3. Allow editorial changes
  4. Allow design cleanup to change behavior in edge cases
  5. Allow relatively non-disruptive changes to type hierarchy (to better support current and forthcoming international standards and W3C recommendations)
  6. Allow design cleanup to change component structure (changes to functionality restricted to edge cases)
  7. Do not allow any significant changes in functionality
  8. Do not allow any changes to XML transfer syntax except those required by version control hooks and bug fixes

The overall aim as regards compatibility is that

  • All schema documents conformant to version 1.0 of this specification should also conform to version 1.1, and should have the same validation behaviour across 1.0 and 1.1 implementations (except possibly in edge cases and in the details of the resulting PSVI);
  • The vast majority of schema documents conformant to version 1.1 of this specification should also conform to version 1.0, leaving aside any incompatibilities arising from support for versioning, and when they are conformant to version 1.0 (or are made conformant by the removal of versioning information), should have the same validation behaviour across 1.0 and 1.1 implementations (again except possibly in edge cases and in the details of the resulting PSVI);

Changes in the data type spec include:

  • Distinction between identity and equality; for instance positive and negative zero would be equal but not identical. Think of the difference between == and equals() in Java.
  • New yearMonthDuration and dayTimeDuration types
  • A precisionDecimal type

There are also numerous open issues including how to align the W3C XML schema language with XML 1.1 and whether to allow a year 0.

Monday, July 19, 2004

The OASIS XSLT Conformance Working Group finally noticed that their test suite didn't actually contain any expected output, which rendered the entire suite pretty much useless. They never bothered to acknowledge or respond to my repeated queries on this issue, but the Mozilla Project managed to get their attention on this matter. They've posted a new version of the test suite that includes the expected output for the first time. However, it seems the catalog is still out of sync with the actual output files. That is, the catalog says you'll find the output somewhere like attribset/attribset01.out but you'll actually find it in REF_OUT/attribset/attribset01.out. And there are some tests like boolean_bool15, extend_ac137.xsl, namespace_nspc05, namespace_nspc24, expression_expr04, processorinfo_prop03, and most of the Xalan output and string test cases where the file names given in the catalog don't match the file names you'll find in the directories. There's at least one test case, mdocs_mdocsxxx, where either the files are missing or the test was never written or finished. Some of the Xalan string tests cases found in the archive aren't actually referenced in the catalog (numbers 31 to 36). I'd report the bugs directly to the working group, but based on past experience using their online form or posting to the mailing list appears to drop the message into a black hole where it's never heard from again. I figure if I post it here, there's at least some chance someone in the working group might see it. Still, despite the bugs, I'm glad to see test suite is moving forward again. For a long time, I was afraid they'd given up.

Sunday, July 18, 2004

The W3C XML Schema Working Group has published an updated working draft of XML Schema: Component Designators. This spec proposes a scheme for naming and identifying XML Schema components. Such components include:

  • Simple and complex type definitions
  • Attribute declarations
  • Element declarations
  • Attribute and model group definitions
  • Identity-constraint definitions
  • Notation declarations
  • Annotations
  • Model groups
  • Particles
  • Wildcards
  • Attribute uses
  • The master schema component representing the schema as a whole.
  • Facets

The goal is to be able to name, for example, the literallayout notation in the DocBook schema, as well as every other significant piece of the schema. These names could then be used as fragment identifiers in URI references that point to schemas. The draft gives these examples of the current syntax proposal:

schema-URI#xscd(/type(Items))
schema-URI#xscd(/type(Items)/item)
schema-URI#xscd(/type(Items)/item/type())
schema-URI#xscd(/type(Items)/item/productName)
schema-URI#xscd(/type(Items)/item/quantity)
schema-URI#xscd(/type(Items)/item/quantity/type())
schema-URI#xscd(/type(Items)/item/quantity/type()/facet(maxExclusive))
schema-URI#xscd(/type(Items)/item/USPrice)
schema-URI#xscd(/comment)
schema-URI#xscd(/type(Items)/item/shipDate)
schema-URI#xscd(/type(Items)/item/@partNum)
Saturday, July 17, 2004

OpenOffice.org has posted snapshot m47 of OpenOffice 2.0, the open source word processor/spreadsheet/drawing program/presentation software that saves all its documents in zipped XML. "This snapshot identifies itself for the first time as 1.9.x to indicate that 2.0 release is not that far away." Among other new features in 2.0, HTML export should now produce valid 'XHTML 1.0 Strict' documents, and XHTML export will be enabled for Calc, Draw and Impress as well as Writer. Furthermore, "The expert user should be able to install XSLT-based input/output filters for StarOffice/OpenOffice.org. The average user should be able to use them seamlessly. For OOo2.0, XSL-transformations are also supported for the modules Writer/Web, Master Document, and Math." It's not immediately clear if this is true in the current milestone, however. Other new features include Word Count in selections, nested tables, and improved compatibility with Microsoft Office. So far only an English build is available though localized versions will be added closer to release.

Friday, July 16, 2004

Version 4.2 of the payware <Oxygen/> XML editor has been released. Oxygen supports XML, XSL, DTDs, and the W3C XML Schema Language. New features in version 4.2 include:

  • A Schema Model View
  • Content completion now shows the schema documentation
  • Contrast control to change the transparency levels for markup and text.
  • THe ability to cancel transforms in progress
  • JSP 1.2 XML editing support

Oxygen requires Java 1.3 or later. It costs $128 with support. Upgrades from 4.x are free. Upgrades from previous versions are $76.

Thursday, July 15, 2004

OpenOffice.org 1.1.2, the open source word processor/spreadsheet/drawing program/presentation software that saves all its documents in zipped XML, has now been ported to Mac OS X. This is still based on X11 rather than Aqua, and I haven't tested it myself yet, but I'm told it's a major step forward in usability on the Mac. The press release claims, "This release has attained a maturity that enables it to be used as the default productivity suite by individuals and organizations alike. Not only is the release significantly faster and more robust than early versions, and available in numerous languages, but fonts are smooth, and overall integration into the Mac OS X desktop has been enhanced. For instance, there is clipboard functionality and, through a helper application, OpenOffice.org documents can be opened in OpenOffice.org 1.1.2 the usual Mac way, by double clicking on them." They're still looking for experienced Mac developers to help them port this to Aqua.

Wednesday, July 14, 2004

After a two-year absence, Yuval Oren has resurfaced and released Piccolo 1.0.4, a non-validating, open source SAX parser written in Java. In my initial tests, this version is much improved over 1.0.3 although there are still a few bugs where the wrong exception is thrown or ID and IDREF type attributes aren't normalized properly. I need to update and rerun my SAX conformance tests. The GNU JAXP folks have raised a couple of issues with some of the test cases out on the corners, so I'll need to fix those first. But in the meantime, I plugged Piccolo into XOM's unit test suite (all of which passes with Xerces 2.6.1 under JDK 1.4.2_02 on Linux) and there were 18 test failures. Still, that's a lot better than it did before. Besides bug fixes, this release changes the license fron LGPL to Apache License 2.0.


Oracle has also released version 10.1.0.2.0 of the Oracle XML Developer's Kit for both Java and C++. This is available for AIX, HP-UX64 and LINUX. New features in this release include:

  • XSLT 2.0
  • JAXB Class Generator
  • DOM 3.0 Load & Save
  • DOM 3 Validation
  • JAXP 1.2 support allows XML schema validation using JAXP
  • New C++ XML APIs provide unified DOM support for XMLType
  • A new XMLSAXSerializer class

Registration is required to download it. After plugging the Oracle parser into XOM's test suite, there were 19 test failures. Some of them looked like things I could work around in XOM if it seemed important, particularly with respect to incorrect decodings of characters in certain ISO character sets.

Tuesday, July 13, 2004

The Mozilla Project has released Mozilla 1.7.1 and Firefox 0.9.2 to fix a serious Windows security vulnerability. Mac OS X, Linux, and other versions are not affected.


Version 1.1 of ButterflyXML, a free-as-in-speech (GPL) XML IDE has been released. ButterflyXML is "built on top of a new real-time incremental XML parsing algorithm. The editor features syntax and error highlighting, incremental validation, code completion, XSLT pipelines, and side by side DOM and source viewing." This release addds an XSLT debugger, a Docbook editor and renderer, and an XSL:FO editor and renderer.


Nate Nielsen has released RTFX 0.9.2 (formerly RTFM), an open source (BSD license) tool for converting Rich Text Format (RTF) files into XML. "It majors on keeping meta data like style names, etc... rather than every bit of formatting. This makes it handy for converting RTF documents into a custom XML format (using XSL or an additional processing step)." Version 0.9.2 adds support for footnotes, superscripts, and subscripts and fixes various bugs.


Engage Interactive has updated two open source XML parsers written in PHP. SAXY 0.8.4 exposes a SAX like interface. DOMIT! 0.9.5 exposes an API based on the Document Object Model (DOM) Level 1. Both are published under the GPL. These are bug fix releases.

Monday, July 12, 2004

The W3C XQuery Working Group has updated XQuery 1.0 and XPath 2.0 Full-Text Use Cases and posted the first public working draft of XQuery 1.0 and XPath 2.0 Full-Text. According to the latter,

XML documents may contain highly-structured data (numbers, dates), unstructured data (untagged free-flowing text), and semi-structured data (text with embedded tags). Where a document contains unstructured or semi-structured data, it is important to be able to search that data using Information Retrieval techniques such as full-text search. Full-text search is different from substring search in many ways:

  1. A full-text search searches for phrases (a sequence of words) rather than substrings. A substring search for news items that contain the string "lease" will return a news item that contains "Foobar Corporation releases the 20.9 version ...". A full-text search for the phrase "lease" will not.

  2. There is an expectation that a full-text search will support language- and token-based searches which substring search cannot. An example of a language-based search is "find me all the news items that contain a word with the same linguistic stem as "mouse" (finds "mouse" and "mice"). An example of a token-based search is "find me all the news items that contain the word "XML" within 3 words (tokens) of "Query".

  3. Full-text search is subject to the vageries and nuances of language. The results it returns are often of varying usefulness. When you search a web site for all cameras that cost less than $100, this is an exact search. There is a set of cameras that match this search, and a set that do not. Similarly, when you do a string search across news items for "mouse", there is only 1 expected result set. When you do a full-text search for, say, all the news items that contain the word "mouse", you probably expect to find news items with the word "mice", and possibly "rodents" (or possibly "computers"!). But not all results are equal : some results are more "mousey" than others. Because full-text search can be inexact, we have the notion of score or relevance : we generally expect to see the most relevant results at the top of the results list. Of course, relevance is in the eye of the beholder. Note: as XQuery/XPath evolves, it may apply the notion of score to querying structured search. For example, when making travel plans or shopping for cameras, it is sometimes more useful to get an ordered list of near-matches. If XQuery/XPath defines a generalized inexact match, we assume that XQuery/XPath can utilize the scoring framework provided by the full-text language.

  4. As XML becomes mainstream, users expect to be able to store and search all their documents in XML. This requires a standard way to do full-text search, as well as structured searches, against XML documents. A similar requirement for full-text search led ISO to define the SQL/MM-FT standard. SQL/MM-FT defines extensions to SQL to express full-text queries providing similar functionality as this full-text language extension to XQuery 1.0/XPath 2.0 does.

  5. Full-text queries are performed on text which has been tokenized, i.e., broken into a sequence of words, units of punctuation, and spaces.

  6. A word is defined as any character, n-gram, or sequence of characters returned by a tokenizer as a basic unit to be queried. Each instance of a word consists of one or more consecutive characters. Beyond that, words are implementation defined. Note that consecutive words need not be separated by either punctuation or space, and words may overlap. A phrase is a sequence of ordered words which can contain any number of words.

  7. Tokenization enables functions and operators which work with the relative positions of words (e.g., proximity operators). It also uniquely identifies sentences and paragraphs in which words appear. Tokenization also enables functions and operators which operate on a part or the root of the word (e.g., wildcards, stemming).

  8. We use the namespace "ft" (for full-text) that corresponds to the URL http://www.w3.org/2004/07/xquery-full-text and defines the namespace of full-text search. We also use "fts" for definitional purposes in semantics Section.

Thursday, July 8, 2004

The W3C Resource Description Framework (RDF) and Web Ontology Language (OWL) working groups have declared victory and gone home. The W3C has officially closed both working groups for having completed all their deliverables. The question now shifts to whether or not these technologies will actually achieve broad adoption.


JAPISoft has released EditiX 1.4.2, a $59 payware XML editor written in Java. Features include XPath location and syntax error detection, context sensitive popups based on DTD, W3C XML Schema Language, and RelaxNG schemas, and XSLT and XSL-FO previews. Version 1.4 adds a new XSLT editor and makes various bug fixes and speed-ups. EditiX is available for Mac OS X, Linux, and Windows.

Wednesday, July 7, 2004

The W3C Technical Architecture Group (TAG) has updated the working draft of Architecture of the World Wide Web, First Edition. "Web architecture includes the definition of the information space in terms of identification and representation of its contents, and of the protocols that support the interaction of agents in an information system making use of the space. Web architecture is influenced by social requirements and software engineering principles. These lead to design choices and constraints on the behavior of systems that use the Web in order to achieve desired properties of the shared information space: efficiency, scalability, and the potential for indefinite growth across languages, cultures, and media. Good practice by agents in the system is also important to the success of the system. This document reflects the three bases of Web architecture: identification, interaction, and representation."

Tuesday, July 6, 2004

Antenna House, Inc has released XSL Formatter 3.1 Lite for Linux and Windows. This tool converts XSL-FO files to PDF. The lite version costs $300 and up on Windows and $900 and up on Linux/Unix, but is limited to 300 pages. Prices for the uncrippled version start around $1250 on Windows and $3000 on Linux/Unix.


The Gnome Project has released version 2.6.11 of libxml2, the open source XML C library for Gnome. This release doesn't add any new features but fixes lots of bugs in ancillary technologies like XInclude and schemas (the core XML parser seems fairly stable and bug free at this point) and makes various performance improvements. They've also released version 1.1.8 of libxslt, the GNOME XSLT library for C and C++. This is also a bug fix release.


Jim Rankin has commenced work on Excelsior! Path, an open source XPath library for Mac OS X's Cocoa API.

Monday, July 5, 2004

I think I've got the various servlets on elharo.com that drive the Fibonacci web services used as examples in Processing XML with Java running again. They went down a few months ago when security issues necessitated switching the server from Red Hat to a supported distribution. If you weren't able to connect over the last few months, please try them again and let me know if they work for you.

I'm now using Apache 2.0 and Tomcat 5. Tomcat 5 is a definite improvement over Tomcat 4 in several ways. It was much easier to install the servlets than it had ever been before. I just uploaded a .war file through a web interface and they worked immediately. However, the connection between Tomcat and Apache is still painful to set up, and the documentation ranges from poor to actively misleading. The Apache Project has a nasty habit of not considering decent documentation to be a prerequisite for a release.


Living Logic has released XIST 2.5, a Python library fopr generating XHTML generator. "XIST is also a DOM parser (built on top of SAX2) with a very simple and pythonesque tree API. Every XML element type corresponds to a Python class and these Python classes provide a conversion method to transform the XML tree (e.g. into HTML)."

Sunday, July 4, 2004

A new working draft of Streaming Transformations for XML (STX) has been posted. "Streaming Transformations for XML (STX) is a one-pass transformation language for XML documents that builds on the Simple API for XML (SAX). STX is intended as a high-speed, low memory consumption alternative to XSLT. Since it does not require the construction of an in-memory tree, it is suitable for use in resource constrained scenarios." This draft derives the STX data model from the XPath 2.0 data model, and derives STXPath from XPath 2.

Saturday, July 3, 2004

Andy Clark has posted version 0.9.3 of his CyberNeko Tools HTML Parser for the Xerces Native Interface (NekoXNI). This is mostly a bug fix release, but does add one class to return the current version of the library. He's also released version 0.1.11 of the CyberNeko DTD parser to add support for entity resolvers. CyberNeko is writen in Java. Besides the HTML and DTD parsers, CyberNeko includes a generic XML pull parser, a RELAX NG validator, and a DTD to XML converter.

Friday, July 2, 2004
Get Firefox

The Mozilla Project has released FireFox 0.9.1, an open source web browser for Windows, Mac OS X, and Linux that supports XML, XHTML, XSL, HTML, and CSS. Unlike the heavier weight Mozilla from which it is derived, this is just abrowser; no e-mail client, newreader, LDAP browser, or microwave oven is included. "This maintenance release provides a few updates based on user feedback - including changes to the Extension System and icon improvements."

Thursday, July 1, 2004

The W3C SVG Working group has posted the third working draft of Mobile SVG Profiles: SVG Tiny and SVG Basic, Version 1.2. "This document defines a mobile profiles [sic] of SVG 1.2. The profile is called SVG Tiny 1.2 and is defined to be suitable for cellphones." It is designed to be used on cell phones. The name of the spec is a little confusing, but "SVG Mobile 1.1 defined two profiles: SVG Tiny and and SVG Basic. The SVG Mobile 1.2 specification only defines one profile: SVG Tiny 1.2." Changes in this draft include support for text wrapping, a DOM subset, and zooming and panning.

Wednesday, June 30, 2004

Once again, I'll be chairing the XML track for Software Development 2005 West in Santa Clara next March. The Call for Proposals for is now live. Besides XML, tracks include Web Services, Java, Emerging Technologies, C++, Requirements & Analysis, .NET, and Mobile Development. Plus we have two new tracks this year: System Security and Scripting. Most sessions are 90 minute classes, but we also have room for half and full-day tutorials (I prefer half-day tutorials in the XML track), Birds-of-a-feather sessions, and panels. Submissions are due by August 6th.

For the XML track, we're interested in practical sessions covering all aspects of XML. This is not specifically an XML show, so we tend to find that our audience responds better to more practical, how-to, basic sessions as opposed to more theoretical, high-level sessions. For instance, a simple introduction to XQuery would go over better than a detailed comparison of XQuery optimization techniques. One thing previous attendees have told us is that they'd like to see more new sessions at each show, so we're going to be looking preferentially for talks that have not previously been given at SD West.


Sonic Software has released Stylus Studio 5.3, a $395 payware XML editor for Windows. Features include:

  • XML differencing
  • XSLT debugging
  • XSLT mapping
  • XSLT profiling
  • XSL:FO
  • XQuery editing, mapping, and debugging.
  • XML Schema Editor
  • Document Type Definition (DTD) Editor
  • XPath Evaluator
  • XPath Expression Generator
  • Web Service Call Composer
  • UDDI Registry Browser
  • Tools for mapping to and from XML documents, Web service data, relational data, and flat files
  • Import/export utilities for RDBMS, XML, CSV, ADO, and flat files
  • JSP Editor

IBM's alphaWorks has updated the XML Schema Quality Checker. This program reads "an XML Schema written in the W3C XML schema language and diagnoses improper uses of the schema language." This update appears to be avery minor bug fix release.

Tuesday, June 29, 2004

JAPISoft has released EditiX 1.4, a $59 payware XML editor written in Java. Features include XPath location and syntax error detection, context sensitive popups based on DTD, W3C XML Schema Language, and RelaxNG schemas, and XSLT and XSL-FO previews. Version 1.4 makes various small improvements here and there, inlcuding a "Surround CDATA Section" command. EditiX is available for Mac OS X, Linux, and Windows.

Monday, June 28, 2004

XimpleWare has posted VTD-XML 0.5, a free (GPL) non-extractive Java library for processing XML. According to the announcement,

Capable of random-access, VTD-XML attempts to be both memory efficient and high performance. The starting point of this project is the observation that, for XML documents that don't declare entities in DTD, tokenization can indeed be done by only recording the starting offset and length of a token.

The core technology of VTD-XML is a binary format specification called Virtual Token Descriptor (VTD). A VTD record is a 64-bit integer that encodes the starting offset, length, type and nesting depth of a token in an XML document. Because VTD records don't contain actually token content, they work alongside of the original XML document, which is maintained intact in memory by the processing model.

VTD's memory-conserving features can be summarized as follows:

  • Avoid Per-object overhead -- In many VM-based object-oriented programming languages, per-object allocation incurs a small amount of memory overhead. A VTD record is immune to the overhead because it is not an object.
  • Bulk-allocation of storage -- Fixed in length, VTD records can be stored in large memory blocks, which are more efficient to allocate and GC. By allocating a large array for 4096 VTD records, one incurs the per-array overhead (16 bytes in JDK 1.4) only once across 4096 records, thus reducing per-record overhead to very little.

Our benchmark indicates that VTD-XML processes XML at the performance level similar to (and often better than) SAX with NULL content handler. The memory usage is typically between 1.3x ~ 1.6x of the size of the document, with "1" being the document itself.

Other features included in this release are:

  • Incremental update -- VTD-XML allows one to modify content of XML without touching irrelevant parts of the document.
  • Content extraction -- VTD-XML also allows one to pull an element out of XML in its serialized format. This can be an important feature for partial signing/encryption of SOAP payload for WS-security.

In the upcoming releases, we plan to add the persistence support so that one can save/load VTD to/from the disk along with the XML documents to avoid repetitive parsing in read-only situations. XPATH support is also on the development roadmap. However, we would like to collect as many suggestions and bug reports before taking the next step.

The algorithms sound interesting. Unfortunately VTD-XML cannot process arbitrary XML, at least not yet. First off, it places some arbitrary limits on the size of qualified names and of the entire document, though this would seem to be fixable. The size of the qualified names could easily be run up to as much as is supported by a Java String, which is all competing APIs can claim. The limits on document size may be fundamental, but they are at least competitive with other in-memory APIs like DOM, though not with streaming APIs like SAX and StAX. Bigger problems include entity resolution, default attribute values, and attribute value normalization. VTD-XML does not support entity references other than five predefined entities (&amp; &lt;, etc.). The documentation doesn't discuss default attributes or attribute value normalization, but given the algorithms used these seem unlikely to be supported. More than once I've seen completing the last 10% of XML conformance demolish the speed that was so impressive in earlier, less complete betas. :-( It remains to be seen whether XimpleWare can extend their algorithms to support XML in all its complexity.


Ian E. Gorman has released GXParse 1.3, a free (LGPL) Java library that sits on top of a SAX parser and provides semi-random access to the XML document. The documentation isn't very clear, but as near as I can tell, it buffers various constructs like elements until their end is seen, rather than dumping pieces on you immediately like SAX does.

Sunday, June 27, 2004

MetaStuff Ltd. has posted the second beta of dom4j 1.5, an open source tree-based API for processing XML with Java. dom4j is based on interfaces rather than classes, which distinguishes it from alternatives like JDOM and XOM (Not to its credit, in my opinion. Using concrete classes instead of interfaces was one of the crucial decisions that made JDOM as simple as it is.) Version 1.5 seems to be mostly a collection of bug fixes and small, backwards compatible, API enhancements.

Saturday, June 26, 2004

The Mozilla Project has posted Camino 0.8, a Mac OS X web browser based on the Gecko 1.7 rendering engine and the Quartz GUI toolkit. This release makes major updates to the rendering engine. For me, the most important new feature is XSLT support. (Now if we could just get XSLT supported in Safari.) Other new features include a Google Search bar, session history on back/forward buttons, improved cookie management, pop-up white-lists, and incremental type-ahead find. Mac OS X 10.1.5 or later is required.

Friday, June 25, 2004

Sun has released the Java Web Services Developer Pack 1.4. Version 1.4 supports the April 19, 2004 draft versions of the WS-I basic profile 1.1, the WS-I Attachment Profile 1.0, Web Services Security: SOAP Message Security 1.0, and the WS-I Simple SOAP Binding Profile 1.0. Bundled Java APIs include:

  • XML and Web Services Security 1.0 early access release 3
  • XML Digital Signatures 1.0 early access release
  • Java Architecture for XML Binding (JAXB) 1.0.3
  • Java API for XML Processing (JAXP) 1.2.6
  • Java API for XML Registries (JAXR) 1.0.6
  • Java API for XML-based RPC (JAX-RPC) 1.1.2
  • SOAP with Attachments API for Java (SAAJ) 1.2.1
  • JavaServer Pages Standard Tag Library (JSTL) 1.1
  • Java WSDP Registry Server 1.0_06 FCS
  • Ant 1.5.4

Tomcat's been dropped form this release, but it should all work with Tomcat if you install that separately.

Potentially useful (especially if this gets into the next version of the JDK) is that they've now renamed the bundled Apache packages from org.apache to com.sun.org.apache so it should be more easily possible to replace them without having to use endorsed directories, -Xbootclasspath, and other CLASSPATH-fu.

Thursday, June 24, 2004

Sun has posted an early draft review of Java Specification Request 222: Java™ API for XML Data Binding 2.0. This makes various minor updates to support Java 1.5 features, align the spec with JAX-RPC 2.0, and support subsitution groups.

This whole spec is based on a very outmoded and discredited W3C XML Schema-centric view of the world. Most laughable line: "Any nontrivial application of XML will, then, be based upon one or more schemas and will involve one or more programs that create, consume, and manipulate documents whose syntax and semantics are governed by those schemas;" and by "schemas" they really do mean W3C XML Schema Language schemas exclusively. Hmm, I guess XSLT, DocBook, OpenOffice, and RSS are all trivial, then? But really, this spec gets things monstrously wrong in almost every paragraph. I don't think the authors have any clue just how myopic their view of XML really is. Comments are due by July 23, but sometimes it's best just to ignore fundamentally broken things and let them die a quiet death in the market.


Sun has also posted an early draft review of Java Specification Request 224: Java™ API for XML-Based RPC (JAX-RPC) 2.0. JAX-RPC is a java API for working with SOAP and WSDL based web services.

This draft addresses the following goals and requirements:

  • Addition of client side asynchrony
  • Improved support for document and message centric usage
  • Integration with JAXB
  • Improvements to the handler framework
  • Improved protocol neutrality

Subsequent versions of this document will address the following additional goals and requirements:

  • Support for WS-I Basic Profile 1.1 and Attachments Profile 1.0
  • Support for SOAP 1.2
  • Support for WSDL 2.0
  • Versioning and evolution of web services
  • Web services security
  • Integration with JSR 181 (Web Services Metadata)
  • Service endpoint model
  • Runtime services

Comments are due by July 23.


Syntext has released Dtd2Xs 2.0, a tool for converting complex, modularized XML DTDs to W3C XML Schema Language schemas. Dtd2Xs runs on Windows and Linux. Dtd2Xs is $49 on Windows, $39 on Linux, and free for non-commercial use. Version 2.0 adds a graphical user interface.

Wednesday, June 23, 2004

The xframe project has posted beta 5 of xsddoc, an open source documentation generator for W3C XML Schemas based on XSLT. xsddoc generates JavaDoc-like documentation of schemas. Java 1.3 or later is required.

Tuesday, June 22, 2004

Tip of the day:

  1. Open Microsoft Excel.
  2. Select Tools/Options (Excel/Preferences on Mac OS X, Tools/Preferences on Mac OS 9).
  3. Click General.
  4. In the middle of the dialog, where it says "Sheets in new workbook", type 1.
  5. Press the OK Button.
Excel Preferences Dialog

I am so sick of seeing Excel spreadsheets with two blank sheets per document. I'm sure they're Excel power users out there who need multiple sheet documents, but I've never met them; and every Excel document I receive has one filled sheet and two blank ones.

Monday, June 21, 2004

The W3C XML Protocol Working Group has published several working drafts releating to XML-binary Optimized Packaging (XOP). The XML-binary Optimized Packaging spec defines a new XOP Infoset (which is not an XML Infoset) based on a XOP document (which is not an XML document). A XOP document is a MIME multipart document containing a single XML document and lots of unparsed binary data in the other MIME parts. The XML document contains xop:Include elements that point to the binary data. The XOP document can be converted to an XML document by extracting the main XML document from the MIME document and then replacing all the xop:Include elements with the Base-64 encoded versions of the binary data they point to.

SOAP Resource Representation Header "describes the semantics and serialization of a SOAP header block for carrying resource representations in SOAP messages." This defines elements that hold a Base-64 encoded picture, MPEG file, plain text file, database, or just about anything else from a particular URL, and then stuffing that in a SOAP header. In practice, this would not actually be Base-64 encoded. The data would be replaced by a xop:Include element that contains a copy of the original binary data. This way a client could receive a collection of interlinked resources as a single SOAP message.

Assigning Media Types to Binary Data in XML attempts to make sure the MIME media types of all this binary data flying around doesn't get lost when the XOP document is transmitted, stored, or serialized. To this end, it defines a contentType attribute for indicating the media type of XML element content whose type is xs:base64Binary. It also defines an expectedMediaType for use in schema annotations to indicate what the contentType attribute may say.

A few other specs try to explain how all this fits together and justify thw work to skeptics. (The working group has up till now done a notoriously poor job of communicating what they're up to, thereby engendering much hostility from other groups and people that don't really disagree with them in any fundamental way.) A new FAQ list attempts to answer some frequently asked questions about all this, but it's rather poorly written and makes several obvious mistakes so it probably won't help matters any. The other specs in this family seem to be written more carefully and correctly. SOAP Message Transmission Optimization Mechanism describes how these different technologies work together, and SOAP Optimized Serialization Use Cases and Requirements explains just why the W3C XML Protocol Working Group thinks this is so important.

Comments on these drafts are due by June 29.


Brian Quinlan has posted version 0.9 of Pyana, a Python interface to the Xalan C XSLT processor. This release supports Xalan 1.8 and Xerces-C 2.5, implements basic tracing support, and removes transform to DOM support.

Saturday, June 19, 2004

application/soap+xml has been registered as the official MIME media type for SOAP requests.


The OpenOffice Project has released OpenOffice 1.1.2, an open source office suite for Linux and Windows that saves all its files as zipped XML. I used the previous 1.0 version to write Effective XML. 1.1.2 introduces the FontOOo Autopilot, a tool that downloads and installs fonts from various sources. This release also improves support for dBase database files and XML export; but mostly it fixes a lot of bugs and makes numerous improvements to localisation in a few languages. OpenOffice is dual licensed under the LGPL and Sun Industry Standards Source License.

Friday, June 18, 2004

The Mozilla Project has released Mozilla 1.7, an open source web browser, chat and e-mail client that supports XML, CSS, XSLT, XUL, HTML, XHTML, MathML, SVG, and lots of other crunchy XML goodness. It's available for Linux, Mac OS X, and Windows. Version 1.7 improves popup blocking, lets users review and open blocked popups, supports multiple identities in the same email account, provides a "show passwords" mode that displays saved passwords, and makes various other small improvements. It also has some small but significant performance optimizations.

Thursday, June 17, 2004

Nicholas Cull has released version 1.0 of his XHTML negotiation module for Apache 1.3.x that enables this web server to negotiate content types for XHTML documents approporiate for different browsers. That is, it allows you to serve application/xhtml+xml to modern, standards conformant browsers like Mozilla, and text/html to out of date, non-conformant browsers like Internet Explorer.


The Free Software Foundation has released GNU Libidn 0.4.9, "an implementation of the Stringprep, Punycode and IDNA specifications defined by the IETF Internationalized Domain Names (IDN) working group, used for internationalized domain names." It's available in java and C with bindings to Perl. This is published under the Lesser General Public License.


Dave Beckett has released the Raptor RDF Parser Toolkit 1.3.1, an open source C library for parsing the RDF/XML and N-Triples Resource Description Framework formats. It uses expat or libxml2 as the underlying XML parser. Version 1.31 fixes some bugs and is more portable to Windows. Raptor is published under the LGPL.


RenderX has released version 3.8 of XEP, its payware XSL Formatting Objects to PDF and PostScript converter. Version 3.8 supports kerning, fixes bugs, and adds a rx:page-number-citation-last extension element. The basic client is $299.95. The developer edition with an API is $999.95. The server version is $4999.95. Updates from 3.0 are free.


Toni Uusitalo has posted Parsifal 0.8.1, a minimal, non-validating XML parser written in ANSI C. The API is based on SAX2. Version 0.81 fixes bugs. Parsifal is in the public domain.


Version 4.1 of the payware <Oxygen/> XML editor has been released. Oxygen supports XML, XSL, DTDs, and the W3C XML Schema Language. New features in version 4.1 improves a few features here and there, nothing earth-shattering since 4.0. Oxygen requires Java 1.3 or later. It costs $128 with support. Upgrades from 4.0 are free. Upgrades from previous versions are $76.

Wednesday, June 16, 2004
Get Firefox

The Mozilla Project has released FireFox 0.9, an open source web browser for Windows, Mac OS X, and Linux that supports XML, XHTML, XSL, HTML, and CSS. Unlike the heavier weight Mozilla from which it is derived, this is just abrowser; no e-mail client, newreader, LDAP browser, or microwave oven is included. New features in this release include

  • A New Default Theme
  • Comprehensive Data Migration from Internet Explorer, Mozilla 1.x, Netscape 4.x, 6.x and 7.x, and Opera.
  • New Extension and Theme Managers
  • A new online help system
  • Copy Image
  • SMB/SFTP support on GNOME
  • Linux/GTK2 Installer
Tuesday, June 15, 2004

www.elharo.com is back up after a three-month absence. I still need to set up the virtual hosts on that server (macfaq.com, namespaces.cafeconleche.org, xom.nu), get the servlets working, and update the page itself; but overall this seems to be going a lot more smoothly than my experiences with Linux in the past. This time I installed Debian. Apt is a very nice program, but what really makes a difference seems to be installing a bare minimum of functionality and adding the rest one package at a time as I need it. Debian's bare minimum is still too maxxed out for my tastes. For instance, "vacation" is hardly an essential package. (I removed it.) I also manually installed the key programs I wanted on the server past the minimum: Apache, Tomcat 5, JDK, etc. This way I have a lot more control over what's running and how it's configured than I did by installing distro packages.

I reiterate my call for distro vendors to stop competing by seeing how much junk they can squeeze onto a CD, or now a DVD. I suggest they carefully select the packages they include, and hew to the maxim, "Less is more." One GUI toolkit, not two. One kernel, not three. One window manager, not seven. One text editor, not seventeen. No servers running by default. No office suites. No games or graphics editors. Do make it easy for users to install extra functionality they happen to want. (Apt seems to do this quite well.) However, installing everything any user might ever need does not accomplish this.


Noted on Slashdot, BugMeNot is a cool site that provides communal user names and passwords for various free registration sites like the New York Times.

Monday, June 14, 2004

The W3C Device Independence Working Group has posted the first public working draft of Content Selection for Device Independence (DISelect) 1.0. According to the abstract, "This document specifies a syntax and processing model general purpose selection. Selection involves conditional processing of various parts of an XML information set according to the results of the evaluation of expressions. Using this mechanism some parts of the information set can be selected for further processing and others can be suppressed. The specification of the parts of the infoset affected and the expressions that govern processing is by means of XML-friendly syntax. This includes elements, attributes and XPath expressions. This document specifies how these components work together to provide general purpose selection."

That sounds unobjectionable, but what the working group is really proposing is XML markup that can be added to a page to indicate which devices certain content is appropriate for. For example, this sel:if element says that the image should only be displayed if the user's device supports color or has a window size wider than 500 pixels.

<sel:if sel:expr="dc:cssmq-width('px') &gt; 500" 
    & dc:cssmq-color() > 0" >
  <object src="picture.png"/>
</sel:select> 

This feels more than a little like presentation based markup. This is very much like using JavaScript or server side programs to identify different browsers and send them content tailored specifically to them. This syntax is definitely easier-to-use, and more powerful than the various JavaScript and server-side hacks people use today; but should we be doing this at all? Wahtever happened to the vision of sending browsers XML documents with appropriate stylesheets and letting the client decide how to best present it? The thing that bothers me the most about this proposal is that the syntax mixes the presentation information straight into the document, rather than linking to it from a separate hints sheet. In many ways, this document seems to reflect a belief that the the W3C has been going down the wrong road for the last eight years in attempting to separate content from presentation.

Sunday, June 13, 2004

Michael Kay has released Saxon 8.0, an implementation of XSLT 2.0, XPath 2.0, and XQuery in Java. Saxon 8.0 is published in two versions for both of which Java 1.4 is required. Saxon 8.0B is an open source product published under the Mozilla Public License 1.0. Saxon 8.0SA is a payware version (price as yet unannounced) that adds support for schema types.

Saturday, June 12, 2004

The W3C XForms working group has posted a note on XForms 1.1 Requirements. Ne features planned for XForms 1.1 include:

  • Repeat/Insert Enhancements
  • Email-address Datatype
  • Association from XML data to Documents containing XForms
  • Power Function (for raising x to the yth power)
  • Search for Instance Data by Key Value
  • Selection across Models
Friday, June 11, 2004

The Mozilla Project has posted the third release candidate of Mozilla 1.7, an open source web browser, chat and e-mail client that supports XML, CSS, XSLT, XUL, HTML, XHTML, MathML, SVG, and lots of other crunchy XML goodness. It's available for Linux, Mac OS X, and Windows. Version 1.7 improves popup blocking, lets users review and open blocked popups, supports multiple identities in the same email account, provides a "show passwords" mode that displays saved passwords, and makes various other small improvements. It also has some small but significant performance optimizations.


Syntext has posted the first beta Serna 2.0. a $299 payware XSL-based WYSIWYG XML Document Editor for Mac OS X, Windows, and Unix. Features include on-the-fly XSL-driven XML rendering and transformation, on-the-fly XML Schema validation, and spell checking. Version 2.0 adds a customizable GUI, "liquid" dialog boxes, multiple validation modes (strict, on, and off), and large document support.

Thursday, June 10, 2004

I'm pleased to announce the first alpha of XOM 1.0, my free-as-in-speech library for processing XML with Java. My definition of alpha is a lot stricter than most people's. While there are still some flaky parts in the internals to be worked out — the internal URI parsing is a complete hack, and I'd like to remove recursion from the last few methods where it occurs to avoid potential stack overflows in very deep documents — but the API is now considered to be complete and frozen. Code you write to XOM today should not require recompilation against any future 1.x release, and if a really major flaw is discovered in the API design, I'll try to provide a deprecation cycle first before removing the flawed methods.

The API has changed not at all since 1.0d25. All code that ran with 1.0d25 should not require recompilation to work with 1.0a1. Internal changes include:

  • Support for the 2nd candidate recommendation of XInclude; including preservation of xml:lang values
  • The base URI handling has been modified as follows:
    1. getBaseURI() always returns an absolute URI or the empty string if the base URI is not known. Other than the empty string it never returns a relative URI. It never returns null.
    2. The actual base URI of an element does not change when it is detached or copied, unless affected by a different set of xml:base attributes.
    3. The setBaseURI() method only accepts an absolute URI. It throws a MalformedURIException if you attempt to pass it a relative URI, or a URI with a fragment identifier. (Relative URIs are still allowed in xml:base attributes.)

In addition several bugs have been fixed and some operations have been sped up. Bradley Huffman is pushing back on the design of the base URI handling, but this is just a really ugly, nasty part of the XML world; and XOM can't help but reflect some of that ugliness. The Infoset, URI, and XML Base specifications really don't say how to handle parts of documents that don't have base URIs. Trying to make this work in a sensible way is like trying to push a bubble out of wall-to-wall carpet. Anywhere I fix the problem, it pops up somewhere else. I tried to push the base URI bubble into the least apparent place I could find, and there I think it's going to stay.


Sun has released Rome 0.1 (alpha), an open source (Apache license) library for parsing, translating, and saving various syndication formats including RSS 0.90, 0.91, 0.92, 0.93, 0.94, 1.0, and 2.0 and Atom 0.3. "The parsers can give you back Java objects that are either specific for the format you want to work with, or a generic normalized SyndFeed object that lets you work on with the data without bothering about the underlying format. Rome is called like this because all feeds lead to Rome." It's not immediately clear if Rome will wo rwill not handle malformed feeds, but since Rome is based on JDOM and Xerces I suspect not.

Wednesday, June 9, 2004

IBM's alphaWorks has released Stylesheet Splicer, a customizable XSLT processor for Windows that can manage multiple versions of XSL stylesheets from a common code base.

A single XSL stylesheet, with a stylesheet processor such as Stylesheet Splicer, will transform a source XML document from one format (such as ColdFusion CFXML) into a target XML document of another format (such as Rosetta PIP). Usually nodes and values not present in the source document are added. When a variety of target documents is required, a number of different stylesheets are usually prepared with extensive duplication among them. With Stylesheet Splicer, multiple stylesheets may be segmented into their unique component parts, without duplication, and assembled dynamically for the particular target variation desired. Stylesheet Splicer implements a simple preprocessor language to control the mixing and matching of XSLT fragments and also to create run-time pop-up windows, allowing users to enter XSLT parameter values needed in the transformation.

Cladonia Ltd.has released the Exchanger XML Editor 2.0, a $98 payware Java-based XML Editor. Features include

  • Schema Based Editing
  • Tag Prompting
  • Validation against DTD, XML Schema, RelaxNG
  • Tree View and Outliner for Tag Free editing
  • XPath and Regular expression searches
  • Schema Conversion
  • XSLT
  • Project Management
  • SVG Viewer and Conversion
  • Easy SOAP Invocations
  • Find in Files
  • Extension Handling
  • DTD editing
  • XML catalogs
  • RelaxNG and DTD based tag completion.

New features in version 2.0 include:

  • XSLT Debugger
  • XML Signature support
  • Better performance with large documents
  • WSDL Analyzer
  • WebDAV and FTP support
  • XInclude resolution

Upgrades from 1.0 are free.

Tuesday, June 8, 2004

Stefan Champailler has posted DTDDoc 0.0.10, a JavaDoc like tool for creating HTML documentation of document type definitions from embedded DTD comments. This release improves support for pre elements. DTDDoc is published under the GPL.

Monday, June 7, 2004

Ian E. Gorman has posted the first alpha of GXParse 1.2, a free (LGPL) Java library that sits on top of a SAX parser and provides semi-random access to the XML document. The documentation isn't very clear, but as near as I can tell, it buffers various constructs like elements until their end is seen, rather than dumping pieces on you immediately like SAX does.

Sunday, June 6, 2004

Version 0.95 of Chiba, an open source, web-based implementation of XForms based on servlets and XSLT, has been released. Chiba enables XForms to be used in current browsers without plugins or special requirements on the client-side. Version 0.95 "finally adds the missing upload control with support for anyURI, base64 and hexBinary encoding. EXSLT functions now also can be used in Chiba along with the internal XForms functions." Chiba is published under the artistic license.

Saturday, June 5, 2004

Germane Software has released REXML 3.1.0, an XML parser for Ruby. This release adds preliminary support for RELAX NG. REXML is published under the Ruby license.

Friday, June 4, 2004

The RDF Data Access Working Group has published the first public working draft of RDF Data Access Use Cases and Requirements. According to the introduction,

The W3C's Semantic Web Activity is based on RDF's flexibility as a means of representing data. While there are several standards covering RDF itself, there has not yet been any work done to create standards for querying or accessing RDF data. There is no formal, publicly standardized language for querying RDF information. Likewise, there is no formal, publicly standardized data access protocol for interacting with remote or local RDF storage servers.

Despite the lack of standards, developers in commercial and in open source projects have created many query languages for RDF data. But these languages lack both a common syntax and a common semantics. In fact, the extant query languages cover a significant semantic range: from declarative, SQL-like languages, to path languages, to rule or production-like systems. The existing languages also exhibit a range of extensibility features and built-in capabilities, including inferencing and distributed query.

Further, there may be as many different methods of accessing remote RDF storage servers as there are distinct RDF storage server projects. Even where the basic access protocol is standardized in some sense—HTTP, SOAP, or XML-RPC—there is little common ground upon which to develop generic client support to access a wide variety of such servers.

The following use cases characterize some of the most important and most common motivations behind the development of existing RDF query languages and access protocols. The use cases, in turn, inform decisions about requirements, that is, the critical features that a standard RDF query language and data access protocol require, as well as design objectives that aren't on the critical path.


The W3C Quality Assurance (QA) Activity has sent the QA Specification Guidelines back to working draft status "after the decision to completely redesign the QA Framework (QAF), resolved by the QA Working Group (QAWG) at its 2004 Technical Plenary face-to-face. It is a lighter-weight, less authoritarian, more user-friendly and useful version of the previously published Candidate Recommendation version of the Specification Guidelines."

Thursday, June 3, 2004

Toni Uusitalo has posted Parsifal 0.8.0, a minimal, non-validating XML parser written in ANSI C. The API is based on SAX2. Version 0.80 adds support for processing the DTD, something that's necessary even in a non-validating parser to supply default attribute values, identify ignorable white space, resolve entities, and so forth. Parsifal is in the public domain.


Eric S. Raymond has released version 1.9 of doclifter, an open source tool that transcodes {n,t,g}roff documentation to DocBook. He claims the "result is usable without further hand-hacking about 95% of the time." This release supports the Vt macro in mdoc. Doclifter is written in Python, and requires Python 2.2a1. doclifter is published under the GPL.

Wednesday, June 2, 2004

The IETF has posted a new working draft of Internationalized Resource Identifiers (IRIs). "An IRI is a sequence of characters from the Universal Character Set (Unicode/ISO 10646). A mapping from IRIs to URIs is defined, which means that IRIs can be used instead of URIs where appropriate to identify resources." In other words this lets you write URLs that use non-ASCII characters such as http://www.libération.fr/. The non-ASCII characters would be converted to a genuine URI using hexadecimally escaped UTF-8. For instance, http://www.libération.fr/ becomes http://www.lib%C3%A9ration.fr/. There's also an alternative, more complicated syntax to be used when the DNS doesn't allow percent escaped domain names. However, the other parts of the IRI (fragment ID, path, scheme, etc.) always use percent escaping.

Thursday, May 27, 2004

I'm going to Maine for the holiday weekend to visit friends and look at birds. Regular updates will resume on Tuesday.


The W3C Multimodal Interaction Working Group has posted a note on Modality Component to Host Environment DOM Requirements and Capabilities Assessment. (That's a mouthful.) According to the abstract,

This document describes the DOM capabilities needed to support a heterogeneous multimodal environment and the current state of DOM interfaces supporting those capabilities. These DOM interfaces are used between modality components and their host environment in the W3C Multimodal Interaction Framework as proposed by the W3C Multimodal Interaction Activity.

The Multimodal Interaction Framework separates multimodal systems into a set of functional units, including Input and Output components, an Interaction Mananger, Session Components, System and Environment, and Application Functions. In order for those functional components to interact with each other to form an application interpreter, the browser implementation must allow for communication and coordination between those components. This DOM interface identifies the DOM APIs used to communicate and coordinate at the browser implemention level. Multimodal browsers can be stand-alone or distributed systems.

Wednesday, May 26, 2004

IBM and Oracle have posted an early draft specification of Java Specification Request (JSR) 225, XQuery API for Java™ (XQJ). Comments are due by July 9.


Altova has released XMLSPY 2004r4, a popular closed source XML editor for Windows. This release adds a free beer Home Edition and improves various features. The $499 payware professional edition adds numerous editing features including context sensitive entry, text folding, line numbers, spell checking, and so forth. It also adds XPath support, XSLT debugging, and XML differencing, integration with various databases, APIs for controlling XMLSpy from external programs, macros, and templates for various common DTDs like DocBook. The $999 payware enterprise edition adds various SOAP and WSDL features.

Tuesday, May 25, 2004

The Mozilla Project has posted the first alpha of Mozilla 1.8a (and in case you missed it amidst all the WWW20004 coverage last week, they've also posted the second release candidate of Mozilla 1.7 and the first beta of Camino 0.8 which is based on Mozilla 1.7) New features in 1.8 include FTP uploads, improved junk mail filtering, and an increase in the number of cookies that Mozilla can remember. It also makes various small user interface improvements and adds support for CSS quotes.


Yesterday I wrote about what I didn't like at WWW2004 (the Semantic Web). Today I'm going to write about what I did like, because there was one technology presented that really impressed me, and that I think is going to be a key part of development in the very near future, with an exponential growth rate for the next couple of years. That technology is XForms.

Like many successful technologies before it (XML, HTML, Java, Linux), XForms doesn't really let you do anything you can't do today. It is not radically new. It does not require reorganizing the way one runs a business or develops software. Unlike the semantic web, it does not require learning completely new and unfamiliar areas of technology such as ontologies and inference systems. What XForms does do is give developers the tools to write a lot of the applications they're already writing today much more quickly, cleanly, and robustly.

I'm still learning about XForms, but what I see impresses me a lot. They are much better designed than HTML forms ever were. They have been designed with usability, accessibility, and internationalization in mind (not that any of those features ever really sell development tools, but they're nice to have when they don't cost you anything more). More importantly for XForms adoption, they lend themselves to the writing of clean, powerful code. Many things programmers really do care about like data validation are built right into XForms. Previously these have had to be hacked together with really ugly JavaScript tricks and server side frameworks. More importantly XForms are going to enable really new, server-deployed applications running inside browsers that just can't be handled in HTML. For instance, one XForm demoed at the conference essentially embedded a spreadsheet inside an HTML page. This wasn't done with ActiveX code like in IE. It was all written with declarative XForms. Up till now, browser-based applications have had fairly poor user interfaces limited functionality. XForms changes all that. You still wouldn't write Photoshop or Quake in XForms, but you certainly can write a lot of business based apps for retail, data entry, customer service, and the like; and this brings me to my next point:

I've seen many businesses standardize on IE over the last few years because they like using Microsoft's proprietary technologies for integrating data into their Intranets. It's easier to develop a lot of applications by feeding data into grid controls embedded in web pages. They get richer, more powerful user interfaces that work a lot better than anything they can achieve with standard HTML+JavaScript; and they can deploy the app over the network straight into the browser. It's a very cost effective, compelling story for a shop that's mostly all-Microsoft anyway. But with XForms the development advantage shifts. The same rich interfaces can be prepared with XForms (or XForms+CSS) even faster. The resulting applications will be much more robust, and have naturally greater separation of the data model from the user interface. (Of course, you can separate the data model from the user interface in an IE DOM solution as well, but you have to think about it and work to make it happen. With XForms, by contrast, you'd have to work pretty hard not to separate the model from the user interface.) The benefits are even larger for someone moving from traditional HTML forms+JavaScript, or PHP, or JSP. You'll still be able to use all these technologies, but you'll need to use them a lot less. Much more of the work can be offloaded into the forms themselves, and the data presented to your application in a much richer format than x-www-formurlencoded name=value pairs. XForms won't do everything, but like XML, they will do a lot more so your application can do less.

There is one big open question about XForms and that's implementation. From what I gathered at the conference there's currently exactly one high quality implementation, x-port's FormsPlayer, and it isn't open source. Pricing is of the "Call us, and we'll figure out how much we can shake out of you" variety. There's a free-beer version for developers who just want to play with this stuff, but the cost for widespread deployment is too high. There is one open source implementation of the spec, Chiba, but it isn't complete, and the people who should know at the conference didn't seem too impressed with it. Still, it's at least worth a look, and might be a good investment of time for any programmer with the time to spare and an itch to scratch.

However the holy grail for XForms support is not a browser plug-in like FormsPlayer, a standalone application like X-Smiles, or a Java/JavaScript/servlet hack like IBM's XForms compiler. Rather it's direct support for XForms built right into the browser. An organization that installed an XForms aware browser on every user's desktop could thereafter forget about most installation hassles. New custom apps could be delivered just by loading the right web page. I know we've heard this story before, but XForms looks like the technology that could finally deliver it. So, how likely is browser based support for XForms? In IE, not very likely. IE is essentially frozen, and even a few years down the line Microsoft is really focused on its own XAML, likely to the exclusion of competing technologies like XForms. The more likely host is Mozilla. Mozilla doesn't have the market share IE does, especially in business, but that's changing. Real XForms support would give businesses the same ease-of-application-development incentive to make the switch from IE to Mozilla that caused them to move from Netscape to IE several years ago. Mozilla+XForms is a potential IE killer. Unfortunately, this doesn't look likely to happen tomorrow. The vendors and developers of alternative browsers are showing surprising resistance to XForms. For instance, Opera's Ian Hickson writes:

For us (the Web browser vendors: Opera, Mozilla, and Apple), the "backwards compatible" requirement is not really negotiable, because it is quite clear that solutions that don't work in the market leader's browser won't be accepted by mainstream Web developers. I think a lot of people in the W3C world are having difficulty accepting this, especially given that Microsoft have basically said that IE has been end-of-lined (it is my understanding that IE in the next version of Windows will have no changes to its HTML/CSS/DOM/XML implementations and still no support for XHTML, and Microsoft have also stated that there will be no new separately-downloadable versions of IE available anyway, so even if they did upgrade it, it would only be used by those who upgraded their operating system).

I think what he misses is that XForms is a compelling enough story to displace IE. Market leaders can be tossed out. IE displaced Navigator, partially because Microsoft had the advantage of bundling IE with Windows, but also because they produced a superior product, at least on some platforms for a few years. It's no longer true that IE is a better browser than Mozilla, Opera, Safari, and other competitors. In fact, almost everyone who's tried one or more of the alternatives agrees that Microsoft is falling farther behind every day. Microsoft's decision to halt IE development until Longhorn ships (possibly in 2006, probably later) is a disaster for them. If Mozilla implemented XForms quickly, Longhorn would enter a market in which businesses and developers were already strongly committed to XForms. XAML instead of hitting the market like Visual Basic did 15 years ago, would be more like C#, a day late and a dollar short; a nice idea, but not one that would offer developers anything they didn't already have. As Ray Kroc liked to note, and as Bill Gates undoubtedly knows, when you see a competitor drowning, shove a fire hose down their throat. IE is drowning. XForms is a nice, big firehose.


Malcolm Wallace and Colin Runciman have released version 1.12 of HaXml, a bug fix release of the XML processing library for the Haskell language. According to the web page,

HaXml is a collection of utilities for using Haskell and XML together. Its basic facilities include:

  • a parser for XML,
  • a separate error-correcting parser for HTML,
  • an XML validator,
  • pretty-printers for XML and HTML.

For processing XML documents, the following components are provided:

  • Combinators is a combinator library for generic XML document processing, including transformation, editing, and generation.
  • Haskell2Xml is a replacement class for Haskell's Show/Read classes: it allows you to read and write ordinary Haskell data as XML documents. The DrIFT tool (available from http://repetae.net/~john/computer/haskell/DrIFT/) can automatically derive this class for you.
  • DtdToHaskell is a tool for translating any valid XML DTD into equivalent Haskell types.
  • In conjunction with the Xml2Haskell class framework, this allows you to generate, edit, and transform documents as normal typed values in programs, and to read and write them as human-readable XML documents.
  • Finally, Xtract is a grep-like tool for XML documents, loosely based on the XPath and XQL query languages. It can be used either from the command-line, or within your own code as part of the library.

This release changes the license to LGPL for the libraries and GPL for the tools. It also fixes bugs.

Monday, May 24, 2004

Digesting last week's WWW2004 conference, I think I've come to two conclusions. The first conclusion is that the semantic web as envisioned at the W3C (RDF, OWL, URIs) is hype. Nobody is actually using it to accomplish anything useful. It's of great interest to theoreticians but has little to no practical impact, and is not likely to have any for the foreseeable future. I watched a lot of semantic web presentations over four days and they basically divided into two broad types:

  • Discussions of how to search, infer, query, store, manipulate or otherwise process RDF; without actually showing any practical applications.

  • Case studies that sounded good and useful, and indeed were. However, when I actually googled my way into the projects' web sites and looked at the code, the projects proved to be using little to no semantic web technology. A few had rdf:about attributes stuck somewhere in a mass of plain XML. Some didn't even have that. Any semantics these applications had was based purely on element names and namespace URIs. RDF tuples were an afterthought, if indeed they were present at all.

Someone suggested during the lunch discussions at the end of the conference that XML was over; that the XML track had the least attendance and least interest from the attendees. The latter point may be true. The conference was full of academics who've hitched their wagons to the semantic web ox; and most XML focussed academics go to the IDEAlliance conferences instead of the W3C ones, so the audience in New York was both positively selected for the semantic web and negatively selected for XML. However, I saw no evidence that XML was over. When one looked a little deeper than the paper title, it was pretty obvious that applications were being built on top of XML, and that the semantic web helped little to none. XML may not sound as sexy as the semantic web, but it's infinitely more useful for getting the job done.

What's the second thing I noticed? Well, there was one technology shown at the conference that did sneak up on me, that I haven't paid much attention to in the past, and that I'm now convinced is going to be a critical part of application development in the near future. This could well be the next big thing with an exponential growth curve that matches past stars like Java, XML, HTML, and Linux and with similarly disruptive effects on the development ecosystem. More on that subject tomorrow.


The W3C has released version 8.5 of Amaya, their open source testbed web browser and authoring tool for Solaris, Linux, Windows, and Mac OS X that supports HTML 4.01, XHTML 1.0, XHTML Basic, XHTML 1.1, HTTP 1.1, MathML 2.0, SVG, and much of CSS 2 . This release fixes assorted bugs.

Saturday, May 22, 2004

Day 4 (Developer Day) of WWW2004 kicks off with an 8:30 A.M. morning keynote. There are many fewer people here this morning, maybe 150. I remembered the camera today so expect more low quality, amateur pictures throughout the day. About three minutes before the conference closed, Stuart Roebuck explained why none of the microphones worked on my PowerBook. I'll know for next time.


The Developer Day Plenary is Doug Cutting talking about Nutch, an open source search engine. His OpenOffice Impress slides "totally failed to open this morning" so he reconstructed them in 5 minutes. He contrasts Nutch with Lucene. Lucene is a mature, Apache project. It is a Java library for text indexing and search meant to be embedded in other applications. It is not an end-user application. JIRA, FURL, Lookout, and others use Lucene. It is not designed for site search. "Documentation is always a hard one for open source projects to get right and books to ?tend to? help here."

Nutch is a young open source project for webmasters, still not end users. It has a few part time paid developers. It is written in Java based on NekoHTML. It is a search site, but wants "to power lots of sites". They may modify the license in the future. They may use the license to force people to make the searches transparent. "Algorithms aren't really objective." It's a research project, but they're trying to reinvent what a lot of commercial companies have already invented. The goal is to increase the transparency of the web search. Why are pages ranked the way they are? Nutch is used by MozDex, Objects Search and other search engines.

Technical goals are scaling well, billions of pages, millions of servers, very noisy data, complete crawl takes weeks, thousands of searches per second, state-of-the-art search result quality. One terabyte of disk per 100 million pages. It's a distributed system. They use link analysis over the link database (various possible algorithms) and anchor text indexing (matches to search terms). 90% of the improvement is done by anchor text indexing as opposed to link analysis. They're within an order of magnitude of the state of the art and probably better than that. Calendars are tricky (infinite page count). Link analysis doesn't help on the Intranet.

In Q&A a concern is raised about the performance of using threads vs. asynchronous I/O for fetching pages. Cutting says he tried java.nio and it didn't work. They could saturate the ISPs' connections using threads. The I/O API is not a bottleneck.


Paul Ford of Harper's is showing an amusing semantic web app. However, it uses pure XML. It does not use RDF. They do have taxonomies. "It's all done by hand." At least the markup is doen by hand in vi and BBEdit. This is then slurped up in XSLT 2 (Saxon), and HTML is spit out onto the site. It was hard to get started but easy to keep rolling. RDF is not right for redundant content and conditionals. They can use XSLT to move to real RDF if they ever need to. This is in the semantic web track, but it occurs to me that if this had been presented five yhears ago we would have just called it an XML app. They do use a taxonomy they've developed, but it's all custom markup and names. They aren't even using URIs to name things as near as I can tell. The web site published HTML and RSS. The original marked up content is not published.


The MuseumFinland is trying to enable search across all 1000 or so Finnish museums.


The Simile Project is trying to provide semantic interoperability for digital library metadata. "metadata quality is a function of heterogeneity" Open questions for the semantic web: How do you deal with disagreements? How do you distinguish disgareements from mistakes?


This conference is making me think a lot about the semantic web. I'm certainly learning more about the details (RDF, OWL etc.). However, I still don't see the point. For instance what does RDF bring to the party? The basic idea of RDF is that a collection of URIs forms a vocabulary. Different organizations and people define different vocabularies, and the URIs sort out whose name, date, title, etc. property you're using at any given time. Remind you of anything? It reminds me a lot of XML + namespaces. What exactly does RDF bring to the party? OWL (if I understand it) lets you connect different vocabularies. But so does XSLT. I guess the RDF model is a little simpler. It's all just triples, that can be automatically combined with other triples, and thereby inferences can be drawn. Does this actually produce anything useful, though? I don't see the killer app. Theoretically a lot of people are talking about combining RDF and ontologies from multiple sources too find knowledge that isn't obvious from any one source. However, no one's actually publishing their RDF. They're all transforming to HTML and publishing that.

Usability of RDF is a common theme among the semanticists. They all see various GUIs being used to store and present RDF. They all want to hide the RDF from end users. It's not clear, however, if there is (or can be, or should be) a generic RDF GUI like the browser is for HTML (or even XML, with style sheets).


Lunch at Developer's Day  at WWW2004

After an entertaining lunch featuring Q&A with Tim Berners-Lee (shown above), I decided to desert the semantic web for the final afternoon of the show. Instead I've gone off to learn about the much more practical XForms. Unlikke the semantic web, I believe XForms really can work. My main question is whether browsers will ever implement this, or if there'll be other interesting implementations.

The session begins with a tutorial from the W3C's always entertaining Stephen Pemberton. He claims 20+ implementations on the day of the release and about 30 now. Some large companies (U.S. Navy, Bristol-Myers-Squibb, Daiwa, Frauenhofer) are already using this. He's repeating a lot of his points from Wednesday. XForms separates the data being collected and the constraints on it (the model) from the user interface.

XForms speakers at WWW2004

What's d16n? It's not just for forms. You can also use it for editing XML, spreadsheet like apps, output transformations, etc.

Fields are hidden by default. There's no form element any more. I'm not sure I'm going to write any more about the syntax. It's too hard to explain without seeing thre examples, and I can't type that fast, but it is quite clean. XForms support HTTP PUT! (and GET and POST. DELETE and WebDAV methods are not supported in 1.0 but may be added in the future.) You can submit to XML-RPC and SOAP as well as HTTP servers. And it works with all well-formed XML pages, not just XHTML (but not classic, malformed HTML). XForms has equivalents for all HTML form widgets, and may customize some according to data type. Plus it adds a range control and an output control. There are toggles and switches to hide and reveal parts of the user interface. These are useful for wizards. There are repeating fields like in FileMaker. It supports conditionals. A single form can be submitted to multiple servers.

One thing strikes me as dangerous about XForms: they provide so much power to restrict what can be entered that server side developers are likely to not validate the input like they should, instead relying on the client to have checked the input. It would still be possible to feed unexpected input to the server by using a different server or a client that doesn't enforce the constraints.

An XForm result may replace only the form instance, not the whole page, but what then about namespace mappings, etc.? What is the URI of the combined page? This is a very tricky area.


T. V. Raman, author of XForms: XML Powered Web Forms, is talking about XForms accessibility and usability. "People get so hung up on RDF's syntax. That's not why it's there." He predicts that Mark Birbeck will implement a semantic web browser in XForms, purely by writing angle brackets (XForms code) without writing regular source code. According to Pemberton, CSS is adding a lot of properties specifically for use with XForms.

T.V. Raman at WWW2004

Next Mark Birbeck is doing a demo of FormsPlayer. "We've had a semantic web overload."

Mark Birbeck at WWW2004

Pure Edge's John Boyer is speaking about security in in XForms. Maybe he'll address my question about server side validation of user input. Hmm, it seems sending the data as a full XML document rather than as a list of name=value pairs might solve this. It would allow the server to easily use standard validation talks. This talk is mostly concerned with XML digital signatures and its supporting technologies. Now on the client side, how does the client know what he or she is signing? If an XML document is transmitted, what's the guarantee that that XML document was correctly and fully displayed to the user? Is what you see, what you signed? e.g. was the color of the fine print the same as the background color of the page? It turns out they have thought of this. The entire form document, plus all style sheets, etc, are signed.

John Boyer at WWW2004

The last speaker of the afternoon (and the conference) is Mark Seaborne, who will be talking about using XForms. His slide show is in an XForm. He's using an XForm control to step through the forms! He works in insurance in the UK, an industry that's completely paper forms driven. It's important that the users not have to fill in the full form before rejections are detected. "There's a huge number of projects listed on the W3C website, but most of them aren't finished and some of them are actually dead." There are lots of different bugs and inconsistencies between different implementations, many of which have to do with CSS. IBM has announced that they're starting work on putting XForms support into Mozilla.

Slides in XForms in Virtual PC at WWW2004

That's it. The show is over. Go home. I'm exhausted. I'll see you on Monday.

Friday, May 21, 2004

Day 3 of WWW2004 kicks off with the usual 9:00 A.M. morning keynote. The conference is beginning to drag a bit, and there are maybe half as many people here this morning as yesterday. No pictures today. I remembered to charge up the camera and then forgot to put it back in my bag when I left the house.

Audio Recorder did provide a nice simple way to record an MP3 file of the day's proceedings. However, the internal microphone in my PowerBook paid more attention to the keyboard clicks than the speakers. I've brought a couple of external microphones this morning to try them out. Hmm, OK the first microphone has the wrong plug for this laptop. That's why I brought two. Hmm, the second one plugs in but doesn't seem to be hearing anything. Possibly that's not a microphone port on the back of my TiBook. Maybe I need a USB microphone? No, according to Apple's this tech note and another tech note, that is indeed a line-in port on the back of my laptop. Maybe I need a newer microphone? A sensible person might have tried plugging these in at home before dragging them into town, but what would be the excitement of that? Maybe I'll run down to CompUSA at lunch.


This morning's keynote will be delivered by Mozelle Thompson, of the Federal Trade Commission. "I have a few notes, but I don't have a PowerPoint presentation. I hate those. Good morning. I'm from the government and I'm here to help you." "We're a small agency but there are those who love us." When listing the laws they enforce, the National Do Not Call list got applause. The Can-Spam Act got silence. About 4 or 5 people in the room admit to having been victims of identity theft. It looks like about half the audience is from outside the U.S. 53,000,000 phone numbers on the Do-Not-Call list. He opposes spyware legislation "at this time." He thinks such legislation is likely to be overbroad, and cover legitimate, consumer beneficial activities like instant updates and anti-virus updates. To head off Congress, he asked industry to, within 90 days, give consumers meaningful notice of what they're doing. Next, he asked them to develop a public education campaign about spyware. Finally he asked the industry to develop a mechanism to talk to washington to identify the really bad actors. If industry does not act, he thinks legislators will act on incomplete, inaccurate information. Stopping spam requires international cooperation. Defensive use of patents is not well reflected in current system.


Today's second keynote is "Higher Learning in the Digital Age" from James J. Duderstadt of the University of Michigan. he thinks the Internet/Web could change higher education in the next decade or two as radically as changed in the two decades after the Civil War. Peer-to-peer interaction is replacing traditional professor-student learning. (I'm not sure I believe this. I certainly don't see it in my own classes at Polytechnic. To the extent they're learning from each other, they're learning bad habits. Perhaps they could learn very rough things like how to put a window on the screen, but they certainly aren't learning how to write code well by reading the Internet and talking to each other. Maybe what he describes is more true in elite universities like Carnegie-Mellon — one of his example — or in undergraduate classes — I mostly teach graduate students, but with some undergraduates mixed in. However, I rarely see my students teaching each other. Half of them barely talk to anyone else in the class. When they do talk to each other, they're more interested in copying the homework than learning or teaching how to do something.) For the near-term (next decade) universities look pretty much like they do now. But over the longer term, the basic structure of the university may change in dramatic ways.


For the morning sessions I decided to go to the panel discussion of "Multimodal Interaction with XML: Are We There Yet?" Alan Turing: "A machine is intelligent only if it can carry on an intelligent conversation." Participants include Kuansan Wang of Microsoft, IBM's Yi-Ming Chee, Motorola's Mark Randolph, AT&T's Michael Johnson, the W3C's Max Froumentin, and Carnegie Mellon's Alex Rudnicky.

Question: Is it possible yet to use XML as the prime language in multimodal interaction, and what is still missing in current XML technology in order for XML to play that role? According to Michael Johnson?, speech is the most developed modality. Pen and gesture support is far behind. "Is it useful to create common XML standards for multimodal interaction?" "What kind of role should XML standards play, and what part of multimodal interaction can be standardized to accelerate the use of XML in multimodal interaction?" There's some restrained dispute about the verbosity of XML, and whether it matters. "What levels of semantics can XML represent in multimodal interaction?" Alex Rudnicky says we don't yet know the right primitives or levels of abstraction for multimodal interaction. "What is the product/research use of XML for multimodal interaction?"


First afternoon session. The current talk is about TeXQuery, which has nothing to do with TeX. It's an extension to XQUery for full text search. It allows you to prioritize matches to particular terms; e.g. finding all documents with an 0.8 score for "Goddesses" and an 0.2 score for "Nike." Co-authors are Sihem Amer-Yahia, Chavdar Botev, and Jayavel Shanmugasundaram. I missed which one was actually presenting. It looks interesting, but the name has to change.


Sebastiano Vigna, of the Università degli Studi di Milano is talking about "The WebGraph Framework I: Compression Techniques" (co-author: Paolo Boldi). He wants to store the entire web directed graph (in a mathematical sense where URIs are nodes and links are arcs) which requires significant compressions because the Web is so big. It's nice to see some math for a change, but the practical impact (if any) escapes me. Apparently it escapes Vigna as well. "We do it because it's fun." It does let you do Google-like page rank tests on a PC. They got slashdotted. They can compress down to 3-3.5 bits per link. "WebGraph exploits the fact that many links within a page are consecutive (with respect to lexicographic order)." He's running RedHat on his laptop. First time I've noticed that at this show (though PowerBooks are fairly common).


For the final session of the regular conference, I plan to return to the W3C track to hear about "Future Work in W3C - Public Q&A" chaired by Steve Bratt. First, Tim Berners-Lee is talking about "What is coming up in W3C?" and then there's the public Q&A session. TimBL uses a PowerBook, and writes (or at least published) his slides in HTML, and he has iTunes playing some New York themed jazz before the session starts. The outline for the talk is:

What Might be Next for W3C? ...
New Working Groups on XML Binary Characterization, SYMM, Math Interest Group, RDF Data Access, and Semantic Web Best Practices and Deployment. The QA working group has not attracted a lot of volunteers from W3C members. Everyone wants someone else to do it.
Considerations for New Work (part 1)

Future of XML. XML attracted the least interest of all the tracks here at WWW 2004. "Maybe it's kind of done." XML is the foundation. "For a lot of people XML's done". (I'm not convinced: I think a lot of XML folks may have just gone to Amsterdam last month instead of this show. There were a lot of people there I don't see here. In fact, off the top of my head there's exactly one speaker in common between the two.) However, RDF has noticed that merging pieces of documents isn't really well addressed. Maybe this requires more work?

The Semantic Web is driving interest in privacy because people are scared of what the semantic Web may do.

Should the W3C open up broad horizontal apps/vocabularies such as life sciences, geospatial, calendaring, social networking (e.g. Friendster), publishing/syndication/RSS, etc.? Calendaring work is ongoing at the IETF. He mistakenly claims RSS 0.9 wasn't RDF. I don't think that's quite true. It was consistent with the RDF draft in existence at the time.

There's an upcoming workshop on device independence in October, location to be announced.

Considerations for New Work (part 2)
Usability is not just accessibility. Why content filtering for just mobile devices? "It's amazing how little we do that is secure. How few things are signed and encrypted." What are we going to about the digital divide between the developed and developing world?

Q&A commences. According to IBM's Mary Ellen Zurko, Germany is considering banning JavaScript. One of the panelists (Phillippe?) suggests that asking users if they want to run active code is pointless. They always click OK. Zurko agrees. TimBL: "Should the W3C start looking at mail?" The W3C has three people fighting spam full time. "Nominally not really in our area." Are the groups looking at spam too academic, he wonders?

TimBL: Haystack is a new UI metaphor. There's a "dire need" for a good user interface for the semantic web.

A couple of questions from the audience: What about publishing and annotations? TimBL recounts some interesting history. Amaya does do annotations.

Fabio Vitali wants to know why there aren't synchronous HTML editors/browsers. If I understand him, this is editing directly in the browser frame without switching modes and saving just as you would in a word processor (e.g. like I've been saving these notes in BBEdit directly onto the site for the last three days).

TimBL: "The Semantic Web is not AI. It's just databases....nobody doing the semsntic web is holding their breath for strong artificial intelligence."


The hotel air conditioning seems to be set to "Ice Box" and I've developed a nasty cold over the last three days. If it doesn't get worse, I'll probably be back tomorrow for the Developer Day.

Thursday, May 20, 2004

Today I'm at WWW2004 in New York again, I'll be here through Saturday, and live updates will continue as long as my laptop battery and the wireless network hold out. There may be fewer photos today, though, since I forgot to charge up my camera battery last night. :-(

By the way, if anyone wants to post notes from the sessions I can't attend (they're about six running in parallel) I'll be happy to post them here.


Thursday kicks off with keynotes by Udi Manber of Amazon.com and Rick Rashid from Microsoft. Apparently this morning is a joint session with the ACM SIG-ECOM. I guess there's an e-commerce conference running simultaneously.

Rashid's nominal topic is "Empowering the Individual". "Scientists and engineers don't create the future. We create the physical and intellectual raw materials from which the future can be built." He's mostly telling us things we already knew, including that the Internet squashed Microsoft's vision of the Information superhighway as effectively as an 18-wheeler squashes an armadillo (my metaphor, not his). Plus he's said several things that sound great, but that Microsoft is actively working against, notable sharing what you want, when you want, with who you want. He's feeding us a line of absolute crap about a virtual telescope that just doesn't fit with how astronomy really works. There may be something valuable in what he's referring to, but there's no way it's anything like what he's saying.

He's talking about people sharing with each other. He's not (yet) talking about sharing with big corporations, telemarketers, law enforcement, etc. Something called Wallop is Microsoft's response to blogging/Friendster. They add rich multimedia information. It involves a dynamically computed social network.

Moving on to storage. What happens when the individual has a terabyte of storage? Today, it costs about $1000. That's a lifetime of conversation or a year of full-motion video. (I'm not sure those numbers add up. One DVD is five gigs and that's only two hours). Stuff I've Seen: archive everything you've ever looked at on the Web, all e-mail (from Outlook), and everything in your Documents folder. The logical extreme is the SenseCam (Lyndsay Williams, et al), a gargoyle? in Neal Stephenson's terminology (Snowcrash) that records everything you see and hear. Well, not quite everything. The current prototype only captures 2000 still images a day. Image captures can be triggered by various triggers. It uses a fisheye lens, which is a bad idea (IMO). A regular lens that focuses on what I'm looking at would make a lot more sense. I'm generally not very interested in what happens 90° away from where I'm looking. Of course, this raises privacy issues. Can the government subpoena this stuff? Of course. They can subpoena personal diaries now.


Anyone know of a good audio recording package for Mac OS X that allows me to record lectures straight from my PowerBook? Maybe iMovie can do it? Hmm, it's complaining about the disk responding too slowly. It also seemed to stop recording after about 14 seconds. And the audio quality is pretty low. Pradeep Bashyal and Stuart Roebuck both suggested Audio Recorder. Downloading it now. At first glance, it seems to work, and it's nice and simple. It takes about 1 meg a minute for MP3 recording. After the session is over I'll have to see if it actually recorded anything. David Pogue suggests Microsoft Word 2004.


Udi Manber of Amazon/A9 isn't allowed to talk about other companies or what they're working on now by Amazon policy. (Someone needs to buy a ticket on the cluetrain.) They aren't trying to compete with Google. Some observations about search:

  • Ease of use is a huge barrier to advanced search techniques.
  • Relevancy is hard to measure, and changes all the time, and is different for different people. It's difficult because it's about people, not material.
  • Anecdotes (particular queries) will lead you astray.
  • It's not about speed or size; it's all about quality/relevancy of results.

Bottom line: search is hard.

He's talking about search inside the book. They scanned all the books themselves. They originally planned to use screensavers to do the OCR, but they found idle disaster recovery machines to do the work. About half the audience here today has tried this (according to a quick show of hands). Edd (on IRC) suggests ego-surfing inside the book.

A9 can remember searches over time, and the new results called out. Can annotate pages with a diary using the A9 toolbar.

A pessimist sees the glass as half empty. An optimist sees the glass as half full. A engineer sees a glass that's twice as large as necessary. If the audio recording works so I can get the exact wording on that, and if it turns out not be by somebody else, that may be a quote of the day. Nope, appears to not be original.

What if everyone becomes an author?


For the first parallel sessions this morning I think I'm going to return to the W3C track for Semantic Web, Phase 2: Developments and Deployment. Hopefully I can snag an electrical outlet since my battery is running down. If not, you may not hear from me till lunch.


Eric Miller is delivering a Semantic Web Activity Update. (When looking at the notes, just cancel the username/password dialog and you'll get in.) The RDF working group charter will expire and not be renewed at the end of the May. Application work will continue. The RDF Data Access Group led by Dan Connolly is moving quickly and will publish its first public draft soon. The Semantic Web Best Practices and Deployment Working Group provides "guidance, in the form of documents and demonstrators, for developers of Semantic Web applications." Several other individuals are now going to demonstrate examples of deployment beginning with Chuck Meyers.

Eric Miller at WWW2004

Chuck Meyers from Adobe says the latest Adobe publishing products are all RDF enabled using the Extensible Metadata Platform (XMP). The problem is someone has to enter the metadata to go with the pictures, but they do import Exif data from digital photos. (Now if only my digital camera would stop forgetting the date.) This work goes back six years. The toolkit is open source, and 3rd party ports have expanded platform support. Going open source is unusual for Adobe. XMP uses processing instructions?!? for a lot of content. It's not clear why they don't use plain elements. I couldn't see what RDF brought to this party that you can't get from regular XML.

Chuck Meyers at WWW2004

Frank Careccia from Brandsoft is talking about Commercializing RDF. "RDF is to enterprise software what IP was to networking." Lack of ownership of corporate web sites is a problem. There's no control. He sees this as a problem rather than a strength. He wants to present one unified picture of the company to the outside world, rather than letting the individual people and departments explain themselves. Another person who needs a ticket on the cluetrain.

Frank Careccia at WWW2004

Dennis Kwan from IBM is talking about BioHaystack; a gateway to the Biological Semantic Web. They want to automate the gathering of data from multiple sites. LSID URNs are a common naming convention. The trick identifying the same objects across different databases. They want non-programmers to be able to use BioHaystack to be able to access these heterogenous biological data sources. So far this talk came closer than most SW talks to actually saying something. I felt like I could almost see a real problem being solved here, but still not quite. The talk was not concrete enough to show how RDF and friends actually solved a concrete problem.

Dennis Kwan at WWW2004

Jeff Pollock? from Network Inference is talking about the business case for the Semantic Web. Customers are not interested in Semantic Web solutions per se. They're looking for painkillers rather than vitamins, but he thinks SW is a painkiller that can reduce cost. RDF and OWL enable standard, machine interpretable semantics. XML enables only syntax. Or at least so he claims. I agree about XML, but so far I've yet to see evidence that there's more semantics in RDF/OWL than in plain XML.

Jeff Pollock at WWW2004

There's a huge number of big words being tossed around this morning (business inferencing, .NET tier, dynamic applications, reclassify corporate data, proprietary metadata markup, "align the semantics of federated distributed sources", "rich, automatic, service orchestration", etc.) which mostly seems to obscure the fact that none of this does anything. OK, finally he talks about a use case of a chart of accounts problem for a Fortune 500 electronics corp. Reporting dollars for cameras vs.phones for cameras phones. OK. This is somewhat concrete, but I'd like to see more details.

Finally Dave Reynolds from HP is talking about work at HP Labs. Jena is an open source semantic web framework. Joseki is Jena's RDF server. "There's no single killer app." They're investing in and exploring a broad range of applications: Semantic blogging, information portals, and SMILE Joint. He's describing several application areas, but again I don't see how RDF/OWL have anything to do with what he's talking about. I realized part of the problem: no one is showing any code. I feel like I'm a mechanical engineer in 1904 listening to a bunch of other engineers talks about airplanes, but nobody's willing to show me how they actually expect to get their flying machines into the air. Maybe they can do it, but I won't believe it until I see a plane in the air, and even then I really want to take the machine apart before I believe it isn't a disguised hot air balloon. A lot of what I'm hearing this morning sounds like it could float a few balloons.

Dave Reynolds
at WWW2004

Question from the audience: have any of you done any testing to make sure your products are interoperable?


For the first block of afternoon parallel sessions I headed down to conference room E, which apparently also has one of those annoying routers that lets you connect but doesn't provide DNS service. Bleah. Consequently you won't get to read these notes in real time, but at least I can spell check them before posting. One of the conference organizers is hunting down the network admins to see if they can fix it, so maybe you'll get to read these a little sooner, but I'm not holding my breath. I finally managed to connect to the wireless network in the next room over. I'm not sure how long this will work. This room's router still isn't working. Hmm, I think I just spotted it. I wonder what would happen if I just walked over and rebooted it?

The first afternoon session is Optimizing Encoding, chaired by Junghoo Cho of UCLA. The first of three talks in this session is Xinyi Yin of the National University of Singapore on "Using Link Analysis to Improve layout on Mobile Devices." (Co-author: Wee Sun Lee). The challenge is too provide a reasonable browsing experience on a mobile device with small screen and low bandwidth. They use an idea similar to Google's page rank to select the most important information on the page, and present that most important content only (or first) on the small screen. That sounds very challenging and original. (The presenter does know of some prior art though, which he cites, even though I hadn't heard of it.) However, they weight elements' importance by the probability of the user focusing on an object and by the probability of the user's eye moving from that object to another object. Mostly this seems to be based on size with some additional weight given to the objects in the center, the width/height ratio. (Unimportant info like navigation bars and copyright info tend to be quite narrow), and the connection between words in the object and words in the URL. This is primarily designed for new sites like CNN.

Xinyi Yin
at WWW2004

The talk that sold me on the Optimizing Encoding session was IBM's Roberto J. Bayardo "An Evaluation of Binary XML Encoding Optimizations for Fast Stream based XML Processing." (Co-authors: Daniel Gruhl, Vanja Josifvoski, and Jussi Myllymaki) I suspect I'll hate it, but I could be wrong. My prediction for yesterday's schema talk proved to be quite off base. Maybe these guys will surprise me too. My specific prediction is that they'll speed up XML processing by failing to do well-formedness checking. Of course, they'll hide that in a binary encoding and claim the speed gain is from using binary rather than text. However, I've seen a dozen of these binary panaceas and they never really work the way the authors think they work. An XML processor needs to be able to accept any stream of data and behave reliably, without crashing, freezing, or reporting incorrect content. And indeed a real XML parser can do exactly this. No matter what you feed it, it either succeeds or reports an error in a predictable, safe way. Most binary schemes incorrectly assume that because the data is binary they don't need to check it for well-formedness. But as anyone who's ever had Microsoft Word crash because of a corrupted .doc file or had to restore a hosed database from a backup should know, that's not actually true. Binary data is often incorrect, and robust programs need to be prepared for that. XML parsers are robust. Most binary parsers aren't. The fact is when you pull well-formedness checking out of a text-based XML parser it speeds up a lot too, as many developers of incomplete parsers like ???? have found. Binary is not fundamentally faster than text, as long as the binary parser does the same work as the text based parser does. Anyway, that's my prediction for where this talk is going (I haven't read the paper yet.) In an hour, we'll see if my prediction was right.

He thinks XML is too heavyweight for high performance apps. (I completely disagree. I'm really having to bite my tongue here to let the speaker finish in his allotted small time.) both in parsing overhead and space requirements. "Bad side of XML is addressed by throwing away (much of) the good side." XML compression trades speed for small size. Also negatively impacts the ability to stream. Some parsing optimizations slow down the encoding. In other words they trade speed of encoding for speed of decoding. He's only considering single stream encodings. Strategies tested include:

  • Alternate delimiters
  • Tokenization of common strings: string table makes this inappropriate for streaming because you need to know the tokenized string in advance. But demand driven tokenization may work, but causes problems for random access.
  • Random access (indexing) support: doesn't work well with most APIs and many queries
  • Schema-based optimizations: doesn't work for schema-less documents and brittle in face of schema changes

Most of their evaluations are based on IBM's XTalk plus various enhanced versions of XTalk. (He seems to be confusing SAX with unofficial C++ versions. ) TurboXPath is a representative application. Two sample data sets: record-oriented (a big bibliography, 180MB) and more deeply nested collection of real estate listings grouped by cities (50 MB). Used Visual C+++ on Windows. Linux results were similar. They tested expat and Xerces-C. (Why not libxml2?) Their numbers show maybe a third to a half improvement by using binary parsing. expat was three times faster than Xerces-C. (on their first example query). On the second query, improvement is a little more, but in the same ballpark. Skip pointers are a significant improvement here. "Don't throw out plain XML if you haven't tried expat." "Java and DOM based parsers are not competitive."

It took till Q&A to find out for sure, but I was right. They aren't doing full well-formedness checking. For instance, they don't check for legal characters in names. That's why XTalk is faster. They claim expat isn't either (I need to check that) but there was disagreement from the audience on this point. But the bottom line is they failed to demonstrate a difference in speed due to binary vs. text.


The closing talk of this session is Les Kitchen of the University of Melbourne (co-author Jacqueline Spiesser) on "Optimization of HTML Automatically Generated by WYSIWYG programs" such as Microsoft Word, Microsoft Excel, Microsoft FrontPage and Microsoft Publisher. (The talk should really be titled "Optimization of HTML Automatically Generated by Microsoft programs.") This was inspired by an HTML train timetable that too 20 minutes to download over a modem.

Techniques:

  • Parsing/unparsing
  • Factoring out style classes
  • Attribute rearrangement (moving common attributes to parent elements)
  • Dead code Elimination

Sizes went from 53% of original to 142% of original based on parsing and unparsing alone. Adding the factoring out of style classes brings this from 39% to 89% of the original sizes. Optimal attribute placement improves this again, but I'm not sure I'm reading the numbers in his table correctly. Optimal attribute placement is slow. Total of all four optimizations reduces size to 33% to 72% of the original size. 56% is the "totally bogus average on such a small data set." Their implementation was written in Haskell.

He suggests plugging this in as a proxy server or a stand-alone optimizer. However, to my way of thinking this needs to be fixed in Microsoft Word (and similar tools). Fixing it anywhere else will have a negligible impact.


For the second session block of the afternoon I was torn between XForms and XML: Progress Report and new Initiatives; but I picked the latter to see what Liam Quin had to say about the progress of "Binary Interchange of XML." However, the session begins with Lionel Villard's inoffensive XSLT 2.0 tutorial. There don't seem to be any surprises here.

Lionel Villard
at WWW2004

Michael Rhys, SQL Server program manager at Microsoft (which has recently pledged not to support XSLT 2) is now going to talk about XQuery 1.0 and XPath 2. XQuery was the largest working group up to that time. SOAP later surpassed it. "It's huge. There are functions in there I haven't even seen yet." There are over 1200 public comments; 50/50 editorial vs. substantive. "It probably takes another four years to get through with them. I hope not." Summer 2004 XQuery Full-Text language proposal published. Spring 2005 short last call period for XQuery 1.0/XPath 2.0. Plan to finish in 2005.

Michael Rhys
at WWW2004

Next up the W3C's Hugo Haas is going to talk about XML Security; i.e. XML Signatures and XML Encryption. Nothing majorly new here; a brief overview of the existing specs.

Hugo Haas
at WWW2004

Finally, the talk I came to hear: Liam Quin is talking about Binary Interchange of XML. "Wake up! This is the controversial stuff!" No wonder: less than a minute in and he's already going off the rails. He explicitly defines an XML document as some in-memory model that needs to be serialized with angle brackets. No! The angle brackets serialization is the XML. The in memory data structure is not. Apparently schema-based schemes for binary formats have been ruled out. That's good to hear. "You might lose comments or processing instructions"?!? That's shocking. These are in the infoset. Furthermore, he really wants to bring in binary data; not just binary encoded XML text but real binary data for video or mapping data. This is a big extension of the XML story. It really adds something to the Infoset that isn't there today. What's being proposed is a lot more than a simple binary encoding for XML. It is a superset of the XML model, not an encoding of it! This is not even Infoset or XML API (SAX, DOM, XOM, etc.) compatible in any meaningful sense. I know you could always encode everything in Base-64 but that would eliminate the speed and size advantages this effort is trying to achieve. (David Booth suggests doing this with unparsed entities? Would that work? I need to think further on this.) This is worse than I thought. However, there may be hope. According to the notes, the working group "has a One-year charter; if they can't make a case by then, the case isn't yet strong enough." The clock started ticking three months ago.

Liam Quin
at WWW2004
Wednesday, May 19, 2004

Today I'm at WWW2004 in New York, I'll be here for the next four days. Fortunately there's wireless access here (in fact, there are several wireless networks) so I'll be reporting live. There don't seem to be a lot of PowerBooks here, but I should probably finally figure out how to use this iChat thingie to see if anyone else is chatting about this live. Wasn't there an O'Reilly article about this a year or so ago? Anyone have a URL? There don't seem to be a lot of power plugs here, though. This may grind to a halt in about two hours if I can't find a place to plug in.

Actually, it looks like people are using IRC. xmlhack (may not be dead yet) is running an IRC channel collecting various comments about the show. "To join in, point your IRC client at irc.freenode.net's #www2004 channel. The channel is publicly logged." I tried to login, but Mozilla's IRC client froze on me. It looks like Mozilla 1.7 RC 2 has just been posted. I'll have to try that. OK, that seems to work. The chat is being logged for anyone who can't connect, or who just wants to read. 34 users online as of 3:00.

I'm doing something I don't normally do, editing this file directly on the live site, so there may be occasional glitches and spelling errors (or more than usual). I'll correct them as quickly as I can. Hmm, I already see one glitch. The non-ASCII characters got munged by BBEdit (again!). Can't we just all agree to use UTF-8, all the time? I'll fix it as soon as I get a minute. Also I reserve the right to change my mind and delete or rewrite something I've already written.


Scanning the papers, I now understand why my XOM poster was rejected. I submitted the actual poster instead of a paper about the subject of the poster. I just wish I could have gotten someone to tell me this before I wrote the thing. I asked repeatedly and was unable to get any useful information about what they were looking for in poster submissions.


Day 1 (after two days of tutorials) kicks off with a keynote address from Tim Berners-Lee. They're doing it in a ballroom with everyone seated at tables, which makes it feel a lot more crowded than it actually is. Still, it looks like there's about a thousand people here, give or take three hundred.

There are a few suits before Berners-Lee including someone from the ACM (a cosponsor I assume) pitching ACM membership, someone who brings up 9/11 for no apparent reason, and someone from the Mayor's office. (The Mayor is sorry he couldn't be here. We're very glad to have you here. Technology is important. He proclaims today to be World Wide Web Day. Yadda, yadda, yadda...)

OK, Berners-Lee is up now. He wants to randomly muse, especially on top-level domain names and the semantic web. He thinks .biz and .info have been little more than added expense. He doesn't see why we need more. .xxx is wrong place to put metadata about what is and is not porn. People have different ideas about this. "I have a fairly high tolerance for people with no clothes on and a fairly low tolerance for violence." .mobi is also misguided, but for different reasons. Shouldn't need a different URI for mobile content. Instead use content negotiation (my inference, not his statement). Plus it's not clear what a mobile device is: no keyboard? low bandwidth? These issues change from device to device and over time. The device independence working group is addressing these, but the problem is more complicated than can be fixed by adding .mobi to the end of a few URIs. The W3C site works pretty well on cell phones. CSS is a success story we should celebrate.

Tim Berners-Lee opening keynote at WWW2004

On to the semantic Web. This actually helps with the mobile device problem by shipping the real info rather than a presentation of the real info. The user interface can be customized client side to the device. MIT's Haystack is an implementation of this idea. He wants to remove processing instructions from XML and bags from RDF (not that he really plans to do this or expects it to happen). His point is that we have RDF even if the syntax is imperfect. (That's an understatement.) It's time to start using RDF. The semantic web phase 2 is coming. I didn't realize phase 1 was finished or working. I guess he sees phase 1 as the RDF and OWL specs. Phase 2 is actually using this stuff to build applications. I'm still a skeptic. The semantic web requires URIs for names and identifiers. To drive adoptions, don't change existing systems. Instead use RDF adapters. Justify on short term gain rather than network effect. We have to put up the actual OWL ontologies, not hide them behind presentational HTML. Let a thousand ontologies bloom. We don't need to standardize the ontolgies. It's all about stitching different ontologies together.


I found a power plug, so I should be able to keep this up through at least the next couple of sessions. Unfortunately, the wireless network in the current room is having DNS troubles (or at least it is with my PowerBook; others seem to be connecting fine) so you may not read this in real time. :-( The XML sessions this afternoon are in a different room. Hopefully the wireless will work there.


As usual at good conferences, there are a lot of interesting sessions to choose from. I decided to sit in the security and privacy track for this morning. There appears to be only one half-day track on XML, happening this afternoon.

The session has three talks. First up is an IBM/Verity talk on "Anti-Aliasing on the Web." by Jasmine Novak, Prabhakar Raghavan, and Andrew Tomkins (presented by Novak). The question: Is it possible to discover if an author is writing under multiple aliases/pseudonyms by analyzing their text? Stylometrics uses a large body of text figure out who wrote what. This goes back at least 40 years. Think of the outing of Joe Klein as the author of Primary Colors, but there are also cases where this technique has failed. Traditionally this uses small, function words like "but", "and", etc. that are topical-vocabulary independent. Also emoticons. She claims they achieved an error rate of less than 8% based on fewer words. Their research is based on Court TV message boards, specifically the Lacy Peterson case and the war in Iraq. They gave postings fake aliases and then attempted to reunite them. They used 50 messages, and removed signatures, headers, and other content that would clearly reveal who was who. The idea is to see if they can then reconnect the messages to the authors. Using all words proved to be the best algorithm, though in this case everyone was writing on the same topic to begin with, and the amount of text analyzed was small so the traditional reasons for analyzing by function words don't apply. They use compression theory to measure the similarity of two texts, based on optimal encodings (KL similarity).

Could this be fooled by a deliberate effort to fool it? e.g. I overuse the word use. What if I deliberately wrote under a pseudonym without using the word use? What if I used tools to check how similar something I was writing pseudonymously was to my usual writing? They have considered this, and think it would be possible.

There seems to be a big flaw in this research. First is the extremely small sample size, which is not nearly large enough to prove the algorithm actually works, especially since they used this data to choose the algorithm. To be valid this needs to be tested on many other data sets. Ideally this should be a challenge response; i.e. I send them data sets where they don't know who's who. I may be misinterpreting what she's saying about the methodology. I'll have to read the paper.

She says, this would not scale to millions of aliases and terabytes of data. In Q&A she admits a problem with two people on opposite sides of the argument being misidentified as identical because they tend to use the same words when replying to each other.


Next up Fang Yu of Academica Sinica is talking about static analysis of web application security. I'm quite fond of static analysis tools, in general. Because web applications are public, almost by definition, firewalls can't help. Symantec estimates that 8 of the top 10 threats are web application vulnerabilities. The problem is well known. You write a CGI script that passes commands to the shell, and the attacker passes unexpected data (e.g. "elharo; \rm -r *" instead of just a user name) to run arbitrary shell commands. JWIG and JIF are Java tools for doing static analysis for these problems. Yu is interested in weakly typed languages such as PHP. His WebSSARI tool does static analysis on PHP. This requires type inference which imposes a performance cost. They analyzed over one million lines of open source PHP code in 38 projects. It reported 863 vulnerabilities, of which 607 were real problems. The rest were false positives. That's pretty good. (Question from the audience: how does one assess the false negative rate?) Co-authors include Yao-Wen Huang, Christian Hang, Chung-Hung Tsai, Der-Tsai Lee, and Sy-Yen Kuo.


Halvard Skogsrud of the University of New South Wales is talking about "Trust-Serv: Model-Driven Lifecycle Management of Trust Negotiation Policies for Web Services." (Co-authors: Boualem Benatallah and Fabio Casati.) Trust negotiation maps well to state machines.


The XML track kicks off the afternoon sessions (in a room with no wireless access at all, I'm sorry to say so once again there won't be any realtime updates. Oh wait, looks like someone just turned on the router. Let's see if it works. Cool! It does. Updates commencing now.) There are about 45 people attending There doesn't seem to be a lot of specifically XML content at this show, just three sessions. I guess most of the XML folks are going to the IDEAlliance shows instead. I've seen relatively few people I recognize, neither the usual XML folks nor the usual New York folks. I do know a lot of the New York web crowd are getting ready for CEBIT next week.

Quanzhong Li of the University of Arizona/IBM is talking about "XVM: A Bridge between XML Data and its Behavior." (Co-authors: Michelle Y Kim, Edward So, and Steve Wood) So far this seems like another server side framework that decides whether to send XML or HTML based on browser capabilities. It loads different code components for processing different elements as necessary. XVM stands for "XML Virtual Machine", but I don't quite understand why. I'll have to read the paper to learn more.

Quanzhong Li speaking at WWW 2004

Fabio Vitali of the University of Bologna is talking about "SchemaPath, a Minimal Extension to XML Schema for Conditional Constraints" (Co-authors: Claudio Sacerdoti Coen and Paolo Marinelli) Before I hear it, I'd guess this is about how to say things like "the xinclude:include element can have an xpointer attribute only if the parse attribute has the value xml." OK, let's see if I guessed right.

They divide schema languages into grammar based (RELAX NG, DTD, W3C XML Schema, etc.) and rule-based languages (XLinkIt, Schematron). Hmm, I wonder if he wants to add rules to W3C XML Schemas? He notes requirements that cannot be expressed in W3C XML schemas:

  • Mutual exclusion
  • Deep exclusions (An XHTML a element cannot contain another a element, even as a descendant, not just as children)
  • Structure dependent structures
  • Data dependent structures: this is like the XInclude constraint I mentioned. Yep, I guessed right

He call these co-constraints or, more commonly, co-occurence constraints.

They add an xsd:alt element and an xsd:error type. The xsd:alt element allows conditional typing (one type if condition is true, another type if it's false.) The condition is expressed as an XPath expression in a cond attribute. The xsd:error type is used if the alternate condition is forbidden. I need to look at this further, but this feels like a really solid idea that fills a major hole in the W3C XML Schema Language. If I have this right, the XInclude constraint could be written like this:

<xsd:element name="xinclude:include">
  <xsd:alt cond="@parse='text' and @xpointer" type="xsd:error"/>
  <xsd:alt type="SomeNamedType"/>
</xsd:element>

This seems much simpler than adding a Schematron schema to a W3C XML Schema Language annotation. I wonder if these folks are working on Schemas 1.1 at the W3C?

Fabio Vitali talking at WWW2004

Martin Bernauer of the Vienna University of Technology is talking about "Composite Events for XML" (Co-authors: Gerti Kappel and Gerhard Kramler). By event based processing he means things like SAX or DOM 2 Events. Rules are executed/fired when triggers are detected. But DOM events are not always sufficient. He wants events that include multiple elements; e.g. an item element event followed by a price element event. Composite events combine these individual events into the level of granularity/clumpiness appropriate for the application. Traditionally this has been done with sequential events, but XML documents/events also have a hierarchical order. This work is based on something called "Snoop", which I haven't heard of before.

Martin Bernauer talking at WWW2004

One thing I've noted is how international this show is. Despite the U.S. location, it looks like a majority of the presenters and authors are from Europe or Asia. The attendees have a somewhat higher percentage of U.S. residents, but I'm still hearing a lot of Italian, Chinese, British English, and other languages in the halls.


For the second session of the afternoon I was torn between Search Engineering I and Mixing Markup and Style for Interactive Content, so I decided based on which room had power plugs and wireless access. "Two of the presenters are the two blondest W3C staff members."

First up is the W3C's Stephen Pemberton on Web Forms - XForms 1.0. XForms had more implementations on the day of release than any other W3C spec. HTML forms are an immense success. XForms try to learn the lessons of HTML forms. To this end:

  • They add type checking.
  • Don't mix presentation with data and function
  • No reliance on scripting
  • Integrate with existing data streams
  • Device independent: e.g. different widgets for lists depending on screen size
  • Better for complex forms
  • Internationalizable, especially with respect to character set differences between client and server
  • Accessible: which names go with which fields?

Look at CSS Zen Garden to prove you don't need to mix the document with presentation.

Stephen
Pemberton lecturing at WWW2004

According to a study of 8,000 users, there are only four reasons people prefer one site to another:

  • Good content
  • Usability
  • I missed one????
  • Fresh Content

For complex forms XForms adds:

  • Repeating structures
  • Wizards
  • Multiple submissions
  • Prefilling forms with XML data

XForms is more like an application environment. It could be used for things like spreadsheets.


Bert Bos is talking about CSS3 for Behaviors and Hypertext. Should we style user interface widgets such as buttons? probably not. Users won't recognize them, and they'll be ugly. But we might want to blend it in with the site's CSS.

Bert Bos lecturing at WWW2004

Dean Jackson closes the day's talks with Mixed markup; i.e. compound documents like an XHTML+SVG+MathML document or XHTML+XForms. No browser can handle this, so he had to write his own for this presentation. He uses SVG as a graphics API and maps HTML into the SVG graphics. The Adobe SVG viewer 6 preview is required to display this. edd (Dumbill?) on IRC: "every year I rant that I can't see SVG, every year I'm told it'll be better :( no dice." According to Jackson, the real problem is authoring these compound documents, and writing a DTD for them is painful.

Dean Jackson lecturing at WWW2004

I'm going to go look at the posters and the exhibits now. More tomorrow.

Tuesday, May 18, 2004

JAPISoft has released EditiX 1.3, a $59 payware XML editor written in Java. Features include XPath location and syntax error detection, context sensitive popups based on DTD, W3C XML Schema Language, and RelaxNG schemas, and XSLT and XSL-FO previews. Version 1.3 adds a tree element view, supports relative XPaths, and makes various improvements and speed-ups. EditiX is available for Mac OS X, Linux, and Windows.

Monday, May 17, 2004

The W3C CSS working group has posted the candidate recommendation of CSS3 module: Basic User Interface. According to the abstract, "CSS (Cascading Style Sheets) is a language for describing the rendering of HTML and XML documents on screen, on paper, in speech, etc. It uses various selectors, properties and values to style basic user interface elements in a document. This specification describes those user interface related selectors, properties and values that are proposed for CSS level 3 to style HTML and XML (including XHTML and XForms). It includes and extends user interface related features from the selectors, properties and values of CSS level 2 revision 1 and Selectors specifications." Defined properties (and property additions) include:

  • appearance
  • content
  • icon
  • box-sizing
  • outline
  • outline-width
  • outline-style
  • outline-color
  • outline-offset
  • resize
  • cursor
  • nav-index
  • nav-up
  • nav-right
  • nav-down
  • nav-left

This spec also defines these pseudo-elements and pseudo-classes:

  • :active
  • :default
  • :valid
  • :invalid
  • :in-range
  • :out-of-range
  • :required
  • :optional
  • :read-only
  • :read-write
  • ::value
  • ::choices
  • ::repeat-item
  • ::repeat-index
Sunday, May 16, 2004

The W3C Web Services Internationalization Task Force has published the third public working draft of Web Services Internationalization Usage Scenarios. This describes various issues that arise when using SOAP services in multi-language environments. For example, is it possible to send error messages in both English and Japanese?

Saturday, May 15, 2004

The GEO (Guidelines, Education & Outreach) Task Force of the W3C Internationalization Working Group (I18N WG) has published three new working drafts on Authoring Techniques for XHTML & HTML Internationalization:

  • Authoring Techniques for XHTML & HTML Internationalization: Handling Bidirectional Text 1.0 "provides advice for the use of markup and CSS to create pages for languages that use bidirectional text, such as Arabic and Hebrew. It attempts to counter many of the misunderstandings or over-complexities that currently abound. It also offers advice to those preparing content that will be localized into scripts that behave like Arabic and Hebrew."

  • Authoring Techniques for XHTML & HTML Internationalization: Characters and Encodings 1.0 " provides practical techniques related to character sets, encodings, and other character-specific matters that HTML content authors can use to ensure that their HTML is easily adaptable for an international audience. These are techniques that are best addressed from the start of content development if unnecessary costs and resource issues are to be avoided later on."

  • Authoring Techniques for XHTML & HTML Internationalization: Specifying the language of content 1.0 according to whose abstract, "Specifying the language of content is useful for a wide number of applications, from linguistically sensitive searching to applying language-specific display properties. In some cases the full application is still awaiting full development, whereas in others, such as detection of language by voice browsers, it is a necessity today. Marking up language meta information is something that can and should be done today. Without it, none of these applications can be taken advantage of." Reading this document, I notice that I've been negligent in declaring the languages of my pages. I'll start fixing that now.

These documents are derived from the previous, monolithic Authoring Techniques for XHTML & HTML Internationalization 1.0. "The material in that document will now be published as a number of smaller independent documents to allow for easier ongoing improvements and updates. The total number of such documents is not fixed, but will grow as material and resources become available. The title of all related documents will begin with 'Authoring Techniques for XHTML & HTML Internationalization:...' and they can be found in the W3C technical reports index."

Friday, May 14, 2004

Recently I've received queries from several readers of Chinese translations of Processing XML with Java asking why they can't get to this site (which provides errata and updates for the book, among many other things). I've now confirmed my suspicion that the Chinese government is blocking this and all other sites hosted on IBiblio. I suspect IBiblio's support for Tibetan freedom and independence might have something to do with this. There's probably a way to get around the firewall, but off the top of my head I couldn't tell you how to do it; and of course even if I did, the people who need to know couldn't read it here anyway. :-(

Pointless it may be to post it here, but Bil Hays suggests, "Assuming that ibiblio is completely blocked, the best thing would be if he could get an account with shell access on any other machine in the world that isn't being blocked to china, and isn't blocking ibiblio. Then he could use ssh to that machine to build a tunnel to ibiblio, mapping a local port on his machine to port 80 on ibiblio through the intermediate server. It would be a longer round trip, tho. If you haven't looked into this kind of thing before, I know it sounds complicated, but it's not really. I've got some theory up at <http://wwwx.cs.unc.edu/help/network/info_sheets/tunneling.html>."

Wes Felter and Fred Stutzman both suggested using the Tor overlay network. According to Felter, "It goes without saying that it's probably illegal in China, but I suppose people who are bypassing the firewall know the risk."


RenderX has released three schemas for XSL Formatting objects: a DTD, a validating XSLT Stylesheet, and a Relax NG schema.

Thursday, May 13, 2004

Opera Software has released version 7.5 of their namesake web browser for Windows, Mac, Linux, FreeBSD and Solaris. Opera supports HTML, XML, XHTML, RSS, WML 2.0, and CSS. XSLT is not supported. Other features include IRC, mail, and news clients and pop-up blocking. New features in this release include an IRC client, RSS support, full-text indexing of e-mail messages, and spell check. Opera is $39 payware.

Wednesday, May 12, 2004

The W3C has posted version 0.6.5 of their Markup Validator. Version 0.6.5 makes the error messages more explicit, simplifies navigation, improves consistency with many different browsers, supports more HTTP Status Codes, expands the documentation, uses more recent DTDs for ISO-HTML and SVG 1.0 DTD, supports the ISO-8859-16 (Romanian) and Big5-HKSCS (Chinese) encodings, supports the data: URI scheme, and no longer treats a missing DOCTYPE or Charset as a fatal error. It can be used from the W3C's web site or installed on a local server.


The W3C Quality Assurance Working Group has published the first public working draft of the QA Handbook. According to the abstract, "The QA Handbook (QAH) is a non-normative handbook about the process and operational aspects of the quality practices of W3C's Working Groups (WG). It is intended for Working Group chairs and team contacts, to help them to avoid known pitfalls and to benefit from experiences gathered from the W3C Working Groups themselves. It provides techniques, tools, and templates that should facilitate and accelerate the work of the WGs. This document is one of the QA Framework family of documents of the Quality Assurance (QA) Activity, which includes the other existing or in-progress specifications: Specification Guidelines; and, Test Guidelines."


Dave Beckett has released the Raptor RDF Parser Toolkit 1.3.0, an open source C library for parsing the RDF/XML and N-Triples Resource Description Framework formats. It uses expat or libxml2 as the underlying XML parser. Version 1.3 adds integer literals and utility sequence and stringbuffer classes. It also fixes some bugs. Raptor is published under the LGPL.

Tuesday, May 11, 2004

The W3C Scalable Vector Graphics Working Group has posted the seventh public working draft of Scalable Vector Graphics (SVG) 1.2. "This release is identical to the sixth draft, other than Appendix A, the SVG Tiny 1.2 DOM. This release was timed to coincide with the publication of the Java Community Process JSR 226 Expert Group's specification which relies on the SVG Tiny 1.2 DOM."

Monday, May 10, 2004

The IETF has posted the last call working draft of Internationalized Resource Identifiers (IRIs). "An IRI is a sequence of characters from the Universal Character Set (Unicode/ISO 10646). A mapping from IRIs to URIs is defined, which means that IRIs can be used instead of URIs where appropriate to identify resources." In other words this lets you write URLs that use non-ASCII characters such as http://www.libération.fr/. The non-ASCII characters would be converted to a genuine URI using hexadecimally escaped UTF-8. For instance, http://www.libération.fr/ becomes http://www.lib%C3%A9ration.fr/. There's also an alternative, more complicated syntax to be used when the DNS doesn't allow percent escaped domain names. However, the other parts of the IRI (fragment ID, path, scheme, etc.) always use percent escaping. Comments are due by May 23.

Sunday, May 9, 2004

Tim Bray has posted beta 6 of Genx, his pure C library for outputting canonical XML. This version fixes a bug that caused the code to break under some optimizers. This involves some changes to method signatures. Genx is published under the expat license. This is a very liberal, non-viral but GPL-compatible license.


Version 0.9.4 of TM4J, "a topic map processing engine written in Java providing a pure Java API, support for the Tolog query language, support for importing XTM and LTM syntaxes; support for exporting XTM syntax; persistence of topic map information in a wide variety of databases." This release supports the TMAPI 1.0 beta.

Friday, May 7, 2004

Processing XML with Java has been translated into Chinese.


Mick Twomey has releaed pygenx, a set of Python bindings for Tim Bray's genx Canonical XML serializer.

Thursday, May 6, 2004

Alexandre Brillant has released FastParser 1.6.4, a $50 shareware, non-validating, XML parser for Java that supports SAX and some of DOM. Version 1.6.4 fixes various bugs. Brillant claims this parser is faster than Xerces, but his benchmarks only test one file; and it's not clear from his result whether FastParser was used in a mode that doesn't perform full well-formedness checking.

Wednesday, May 5, 2004

At Bob DuCharme's suggestion, I've added id attributes to most of the elements in the permalink versions of these pages starting with yesterday's news. This should allow for reasonably stable, long-term linking to individual news items and parts thereof. There's no particular scheme for how the ID values are chosen. You'll need to view source on the page for the date's news (not the main page at http://www.cafeconleche.org/) to find them. If a page is edited during the day, the IDs might change, but they should stabilize within 24 hours of posting, and normally much faster than that. Even if the IDs do change when I add a new item or edit a preexisting one, the old IDs should still point to soemthing close to what they originally pointed to. This is very much a "worse-is-better" solution, just like the one that generates the permalink pages. It will solve probably 95-99% of the problem at a cost well below what a full solution would require. (In other words, I hacked this together in about 15 minutes instead of the days that would have been required to make the links immediately stable.)


The W3C XML Protocol Working Group has published the first public working draft of SOAP Resource Representation Header. In brief this proposes encoding resources such as JPEG images or other XML documents in a SOAP header. The spec provides this example:

<soap:Envelope xmlns:soap='http://www.w3.org/2002/12/soap-envelope' 
               xmlns:rep='http://www.w3.org/2004/02/representation' 
               xmlns:xmime='@@@@'>
  <soap:Header>
    <rep:Representation resource='http://example.org/me.png'>
      <rep:Data xmime:media-type='image/png'>
        /aWKKapGGyQ=
      </rep:Data>
    </rep:Representation>
  </soap:Header>
  <soap:Body>
    <x:MyData xmlns:x='http://example.org/mystuff'>
      <x:name>John Q. Public</x:name>
      <x:img src='http://example.org/me.png'/>
    </x:MyData>
  </soap:Body>
</soap:Envelope>

A processor that was decoding the document could load the PNG image referenced by the img element by decoding the header rather than making a second trip to the server. There might be a reson to do this, though I'm very nervous that as soon as I say that people are going to start suggesting that we changes the APIs like DOM and SAX to mnot provide the real XML. And then they'll want to stop shipping around real XML, and instead send the binary data itself, because, hey, that's what everyone's going to use anyway. I say this because I've seen every bit of this before. The relatively reasonable XOP proposal, which achieves essentially the same goals but by bundling everything in a MIME envelope rather than a XML document, is now proposing exactly this. This may be the first step down a very slippery slope that leads right over a cliff; and at the bottom of the cliff XML will be shattered into a confusing mess of uninteroperable, inefficient, vendor-locked-in, patented, DRM-encumbered binary data.

Personally, I wonder if bundling everything up in a zip file wouldn't be architecturally cleaner, not to mention smaller. The big issue with that approach is that resolving the URLs (especially absolute URLs) becomes tricky, and there's no convenient place to store the URLs of the cached resources. But perhaps we could do this with a manifest file as in Java's JAR archive, which is really just a zip file anyway?

Tuesday, May 4, 2004

The W3C Web Services Choreography Working Group has posted the first public working draft of Web Services Choreography Description Language Version 1.0. According to the abstract,

The Web Services Choreography Description Language (WS-CDL) is an XML-based language that describes peer-to-peer collaborations of Web Services participants by defining, from a global viewpoint, their common and complementary observable behavior; where ordered message exchanges result in accomplishing a common business goal.

The Web Services specifications offer a communication bridge between the heterogeneous computational environments used to develop and host applications. The future of E-Business applications requires the ability to perform long-lived, peer-to-peer collaborations between the participating services, within or across the trusted domains of an organization.

The Web Services Choreography specification is targeted for composing interoperable peer-to-peer collaborations between any type of Web Service participant regardless of the supporting platform or programming model used by the implementation of the hosting environment.


The W3C Privacy Activity has posted the second public working draft of the Platform for Privacy Preferences 1.1 (P3P1.1) Specification. "P3P 1.1 is based on the P3P 1.0 Recommendation and adds some features using the P3P 1.0 Extension mechanism. It also contains a new binding mechanism that can be used to bind policies for XML Applications beyond HTTP transactions." New features in P3P 1.1 include a mechanism to name and group statements together so user agents can organize the summary display of those policies and a generic means of binding P3P Policies to arbitrary XML to support XForms, WSDL, and other XML applications.


The W3C Voice Browser Working Group has published the first last call working draft of Voice Browser Call Control: CCXML Version 1.0. According to the spec abstract, "CCXML is designed to provide telephony call control support for VoiceXML [VOICEXML] or other dialog systems. CCXML has been designed to complement and integrate with a VoiceXML interpreter. Because of this there are many references to VoiceXML's capabilities and limitations. There are also details on how VoiceXML and CCXML can be integrated. However it should be noted that the two languages are separate and are not required in an implementation of either language. For example CCXML could be integrated with a more traditional Interactive Voice Response (IVR) system and VoiceXML or other dialog systems could be integrated with some other call control systems."


YesLogic has released Prince 3.1, a $295 payware batch formatter for Linux and Windows that produces PDF and PostScript from XML documents with CSS stylesheets. Version 3.1 supports SVG shapes and CSS properties including vertical-align. There's also a demo version that stamps pages with a YesLogic link.

Monday, May 3, 2004

Version 4.0 of the payware <Oxygen/> XML editor has been released. Oxygen supports XML, XSL, DTDs, and the W3C XML Schema Language. New features in version 4.0 include an XSLT debugger and XInclude support. Oxygen requires Java 1.3 or later. It costs $74. Upgrades from previous versions are $32.

Saturday, May 1, 2004

RenderX has released XEP.NET 3.7.8, a payware XSL Formatting Objects to PDF and PostScript converter. As near as I can tell, this is the same basic product as XEP except that it's written in .NET instead of Java. The basic client is $299.95. The developer edition with an API is $999.95. The server version is $4999.95.

Friday, April 30, 2004

The Gnome Project has posted version 2.6.9 of libxml2, the open source XML C library for Gnome. This release adds support for xml:id and fixes various bugs. They've also released version 1.1.6 of libxslt, the GNOME XSLT library for C and C++. This is also a bug fix release.


RenderX has released version 3.7.7 of XEP, its payware XSL Formatting Objects to PDF and PostScript converter. Version 3.7.7 is a bug fix release. The basic client is $299.95. The developer edition with an API is $999.95. The server version is $4999.95. Updates from 3.0 are free.

Thursday, April 29, 2004

I have posted the SAX test suite I talked about last week at XML Europe. You should read the paper that describes the suite (and presents the results) first before attempting to run the suite, so you'll understand what's being tested and why. This is primarily intended for developers writing SAX parsers, not for casual users. The framework is fairly rough at this point, and may require my personal help to get running.

The main result of the test is that you should be using Xerces. Xerces isn't perfect, but it is the best SAX parser currently available, and several of the bugs I identified are already fixed in CVS, and should be available with the upcoming 2.7 release.

One surprising result of the test was just how poor the selection of SAX parsers in Java really is. There are only two actively maintained parsers, Xerces and Oracle; maybe three if you count GNU JAXP. However, GNU JAXP has serious bugs that aren't being fixed, and seem unlikely to be fixed in the near future unless a motivated new maintainer is found. Both Oracle and Xerces are heavyweight parsers which support DTD and schema validation, and many other features. They are both quite large. In particular, there is no current reliable small parser that simply parses XML documents without validating, either with or without exterrnal DTD subset support. A lot of people need such a parser, and many people who do are using either Piccolo or an Ælfred derivative. However, my tests made it very obvious that Piccolo and all the Ælfreds are seriously buggy, and should not be relied on. Furthermore, none of these are actively developed at this time, so it's unlikely any of the bugs the tests identified will be fixed. :-(


The Big Faceless Organization has released the Big Faceless Report Generator 1.1.18, a $1200 payware Java application for converting XML documents to PDF. Unlike most similar tools it appears to be based on HTML and CSS rather than XSL Formatting Objects. This is mostly a bug fix release. Java 1.2 or later is required.


IBM's alphaworks has released version 3.1 of its Web Services Tool Kit for Mobile Devices which "provides tools and run-time environments that allow development of applications that use Web Services on small mobile devices, gateway devices, and intelligent controllers. This tool kit's JavaTM Web service run-time environment is supported devices that support the J2ME, WCE, and SMF environments. The C Web service run-time environment is supported on the Palm and Symbian." This release now "includes new IBM Mobile Soap Server and Mobile Device Gateway code" and "provides an implementation of the Device Web Services Framework."

Wednesday, April 28, 2004

David Megginson has released SAX 2.0.2. "This mini-release is intended mainly to add support for XML 1.1 and Namespaces 1.1, including Unicode normalization. The changes are all in features, properties, and JavaDoc, not in the actual class and method signatures; since the new features and properties are optional, existing libraries should continue to work with the new release." New features in this release include:

  • The Attributes2 interface exposes which attributes were specified in the source text, rather than defaulted through the DTD. The http://xml.org/sax/features/use-attributes2 is true if the parser passes Attributes2 objects to startElement().
  • DefaultHandler2 extends org.xml.sax.helpers.DefaultHandler and additionally implements DeclHandler, LexicalHandler, and EntityResolver2 with do-nothing methods.
  • EntityResolver2 can provide external DTD subsets for documents that don't have one and provides a resolveEntity() method with more parameters, which lets you cope with situations where the base URI of the document and the entity declaration have become mismatched. The http://xml.org/sax/features/use-entity-resolver2 feature determines whether the parser will use EntityResolver2 methods where appropriate.
  • The Locator2 interface exposes the version (1.0 or 1.1) and encoding of the current entity. If it's unclear to you what this has to do with location information, you're not alone. The http://xml.org/sax/features/use-locator2 feature is true if the Locator object passed to ContentHandler.setDocumentLocator() supports the new Locator2 interface.
  • The read-only http://xml.org/sax/features/is-standalone feature is true if and only if the document is declared to be standalone
  • The http://xml.org/sax/features/resolve-dtd-uris feature can be set to false to prevent DTDHandler and DeclHandler from absolutizing system IDs.

This should all be bundled with Xerces-J 2.7 in the near future, and Java 1.5 a little further out.

Tuesday, April 27, 2004

Syntext has released Serna 1.5. a $299 payware XSL-based WYSIWYG XML Document Editor for Mac OS X, Windows, and Unix. Features include on-the-fly XSL-driven XML rendering and transformation, on-the-fly XML Schema validation, and spell checking. Version 1.5 adds support for XML Catalogs, Docbook 4.3, and DITA 1.3 (whatever that is).

Monday, April 26, 2004

The Mozilla Project has posted the first release candidate of Mozilla 1.7, an open source web browser, chat and e-mail client that supports XML, CSS, XSLT, XUL, HTML, XHTML, MathML, SVG, and lots of other crunchy XML goodness. It's available for Linux, Mac OS X, and Windows. Version 1.7 improves popup blocking, lets users review and open blocked popups, supports multiple identities in the same email account, provides a "show passwords" mode that displays saved passwords, and makes various other small improvements. It also has some small but significant performance optimizations.


The XML Apache Project has released version 1.1 of XML Security, Java version. This release fixes a few bugs in digital signatures and adds beta support for XML encryption.


Tim Bray has posted beta 4 of Genx, his pure C library for outputting canonical XML. Bray tells me this version now passes my canonical XML test suite. Genx is published under the expat license. This is a very liberal, non-viral but GPL-compatible license.

Sunday, April 25, 2004

I've returned safely from Amsterdam. I'll be playing catch-up over the next few days. Updates will remain slow for a while though while I catch up with various work that was sidelined while I was away, including reviewing the latest SAX pre-release, packaging up the SAX conformance test suite for release, and finishing the next editions of a couple of books.


The XML Apache Project has released XMLBeans 1.0.2, one of many XML data binding frameworks for Java. This one is based on the W3C XML Schema Language and also provides access to the full underlying XML Infoset through an XML Cursor API. It's not immediately clear what's changed in 1.0.2. One assumes this is a bug fix release.

Wednesday, April 21, 2004

Day 3 of XML Europe begins with Ken Holman talking about "Writing Formatting Specifications for XML Documents: A UBL Case Study." They're about 50 people in the room, a good turnout for 9:00 A.M. UBL is the Universal Business Language. This is based on what happened in the UBL Forms subcommittee. UBL is the payload for EBXML. EBXML says how things go around, not what goes around. The UBL 1.0 release package is being assembled this week.

Jon Bosak asked Ken to write some stylesheets for UBL 1.0, but the group didn't really know what they wanted the result documents to contain or look like. So Ken formed a committee to design standard forms presentations. These are not based on XSL-FO, but they do use XPath to identify the information in the documents to be displayed in the visual representation. Other technologies such as PDF or PostScript could be used instead.

Ken seems to believe the visual representation needs to be standardized, but this denies the value of XML. The normative document becomes the printout, not the XML.


I just noticed something weird: I've been at this conference for more than two days now and I haven't yet heard anyone say the words "architectural forms." Times have changed. I've had heard a lot of people saying the words "RDF", "OWL", and "topic maps". I still don't really know what those words mean myself, but people are still saying them.


Second talk of the morning is Klaas Bals, CTO of Inventive Designers, a developer of Scriptura, XSL-FO rendering engine, on "Using XSL-FO 1.1 for Business Type Documents." He prefers XSLT pull (xsl:for-each, xsl:value-of) to XSLT push (xsl:apply-templates) for business type documents, because in a business type document such as an invoice, the layout drives where the content is placed. In forms there's lots of absolute positions, limited if any flowing of content.

XSL-FO 1.1 is a working draft that will change in the future.

The absolute-position property can only be applied to fo:block-container!?!

BarcodeML is an XML application for bar codes that can be generated easily by XSLT. He suggests developing a ChartML for charts that would make it easier to generate charts from XSLT, and which could be processed by special purpose processors as MathML and SVG are processed by MathML and SVG renderers today. It's an interesting idea. It would certainly be easier than generating the SVG for a chart directly from the XML data using XSLT. You perhaps could implement a renderer on top of JFreeChart, and similar libraries. I wish I had time to work on it. It might make a nice paper for XML 2004 in Philly in December, but I doubt I could do it in time for the deadline. Hmm, perhaps there's something like this built into OpenOffice or Excel? Do those products' XML formats use a special charting vocabulary or just a generic graphics vocabulary? I should check. He says Chrysalis has also begun work on a charting XML application.


Back to room D and the technical track for the final two sessions of the show. First Alex Brown is talking about "Refactoring XML." He wants to refactor XML itself, the technology, not refactoring XML instance documents. Oh god, he wants to talk about elements vs. attributes, again! Hasn't everything there is to say about this already been said? he thinks the question (which to use when) indicates a "bad smell" in XML. I disagree.

DocHeads (developers who work with narrative documents) work round XML by augmentation. DataHeads (developers who work with record-like documents) work around XML by reduction. E.g. SOAP forbidding processing instructions. Interesting point. It seems reasonable, and I hadn't thought of the split that way before.

He suggests that XML was not intended for human consumption based on the XML design goals, specifically "Terseness in XML markup is of minimal importance." "XML is verbose by design", and "XML is text, but isn't meant to be read." I disagree. First of all he's confusing consumption with production. Secondly, I do not think lack of terseness is not a problem for humans. He ignored Item 3, "XML documents should be human-legible and reasonably clear," and I can't find two of his other quotes in the spec. OK, they're from XML in 10 Points, not the spec. And the idea that XML isn't 't meant to be read is almost 180° wrong. Liam Quin appears to agree with me. He makes reference to Terry Pratchett's "Lies to Children", and suggests these goals are examples of such "Lies to Children." See the Science of Discworld (Great book by the way. Last I checked it wasn't available in the States. You can order it from Amazon UK.)

He proposes to leave out everything from XML except tags and text: no processing instructions, DTDs, attributes, comments, etc. He wants to derive it from SGML rather than XML. He claims this will control proliferation of ad hoc XML subsets, and terminate permathreads on other formats. Then new lexical layers (non-angle brackets, short tags, SDATA entities, etc.) can be built on top of this. However, this is only for DataHeads, not DocHeads. The main benefit is simpler parser implementation (in other words a non-conformant parser that doesn't support real XML). Google Rick Jelliffe on Goldilocks and SML for a rebuttal.

This is so wrong on so many levels, I don't know where to begin. I think this talk gets the booby prize for the single worst idea of the conference. Henry Thompson sums up, "Sorry, but no. I came with an open mind, but I'm not convinced."

Interesting historical point, according to Thompson, "Tim" (Berners-Lee? or Bray?) directly and personally rejected the original (processing instructions?) namespaces proposal, and Eliot Kimber walked out of the working group as a result. But in hindsight, Thompson thinks Tim was right. He didn't at the time.


Next Mark Birbeck is giving a late-breaking news presentation about XHTML and RDF (co-authored with Stephen Pemberton). The goal is a new syntax for RDF and XHTML that would allow the two of them to integrate better.

  • HTML meta element's name attribute becomes the property attribute, which may appear on any element:

    <span property="dc:date">January 24, 2003</span>

  • HTML href attribute becomes the resource attribute which can appear on any element, and make it a clickable link

    The <span resource="http://www.davidbeckham.com">England captain</span> had his hair cut

    The right choice of URIs is necessary to make this reliable metadata. The taxononmies are based on URIs.
  • Elements (not just head) can contain link and meta child elements to identify metadata about the element.
  • Add a content attribute. For instance this tells you which England captain (football or rugby) is referred to:

    The <span content="David Beckham">England captain</span> had his hair cut

  • Adds a datatype attribute:

    <span datatype="xsd:date">2003-01-24</span>

According to Pemberton, this is a snapshot of an unfinished work in progress. The XML limitation of one ID per element is apparently a problem that remains to be solved.

Overall, this seems interesting and it might be helpful, but it really doesn't do anything about the fundamental problem of getting content publishers to provide accurate, useful metadata. Maybe that's too harsh. This syntax would make adding metadata easier, which might expand its use somewhat. This syntax is a lot easier to stomach than traditional RDF syntax. Liam Quin points out that this has the problem of QNames in element content, which makes cut and paste fragile because you can lose the namespace declarations. Henry Thompson doesn't think this is such a big problem. Liam also notes validating these XHTML documents is a problem, but I don't see that. (Then again, I don't really care if my XHTML validates. This page doesn't.)


Conference chair Edd Dumbill is giving the closing keynote address on "The State of XML." "The state of XML is pretty good." He's been told Microsoft writes their schemas in RELAX NG and then translates them into W3C XML Schema Language. He's worried about a lot of the web services specs being devloped outside the W3C (shows the Feigenbaum? Swale?, period doubling diagram) and suspects we're heading for a train wreck. He likes REST and document oriented web services. He's optimistic about XForms. "These days we all need to be librarians." He believes we need standard schemas and taxonomies to achieve interoperability. (I disagree.) Mobile devices (PDAs, etc.) will drive adoption of XHTML. He foresees more regulations governing the Web as it becomes more and more important to our daily lives. The general buzz of the conference is the human readability and editability of XML. Over 80% of attendees (at a previous conference) used a text editor to edit their XML. "A successful document type is a readable document type." Microsoft is starting to get this. Illegible XML is a problem for RDF. XML syntax may not be right for all aplications (RELAX NG, RDF). He wants us to be inspired about the state of XML. His speech will be posted on xml.com tomorrow.

Marion Elledge introduces Edd Dumbill

The conference didn't provide us with a CD or a printed copy of the proceedings. These should be posted on the conference web site soon, if they're not there already. I'll upload my own paper here on Cafe con Leche on Monday when I get back to the States.

Tuesday, April 20, 2004

Memo to self: I really need to update the scripts used to edit this site so they use a real XML parser instead of regular expressions, which don't even work on my PowerBook anyway. The only excuse I have for this bogosity is that these scripts predate XML by a year or two.


Second memo to self: I have to figure out whether it's BBEdit on Mac OS X, rsync, or something else that keeps corrupting all my UTF-8 files when I move them from the Linux box to the Powerbook.


Day 2 of XML Europe, More stream of consciousness notes from the show, though probably fewer today since I also have to prepare for and deliver my own talk on SAX Conformance Testing. I'll put the notes, paper, and software for that up here next week when I return to the U.S., and have the time to discuss it on various mailing lists.


Memo to conference organizers: open wireless access at the conference is a must in 2004. If the venue won't allow this, find another venue!

Memo to conference attendees: ask the conference if they provide open wireless access. If the conference doesn't, find another conference!

Having wireless access radically changes the experience at the conference. It enables many things (besides net surfing in the boring talks). Live note taking and Rendezvous enable the audience to communicate with each other and comment on the talks in real time without disturbing others. When you're curious about a speaker's point, it's easy to Google it. Providing wireless access makes the sessions much more interactive.


The morning began with a session entitled, "Topic Maps Are Emerging. Why Should I Care?" Unfortunately the question in the title wasn't really answered in the session. I've been hearing about topic maps for years, and have yet to see what they (or RDF, or OWL, or other similar technologies) actually accomplish. What application is easier to write with topic maps than without? What problem does this stuff actually solve? All I really want to hear is one or two clear, specific examples and use cases. So far I haven't seen one.


Next Alexander Peshkov is talking about a RELAX NG schema for XSL FO.


After some technical glitches, Uche Ogbuji is talking about XML good practices and antipatterns in a talk entitled "XML Design Principles for Form and Function". Subjects include (I love these names)

  • "Boochist markup": written by somebody who'd rather be writing C++ than XML
  • "Tool Chic markup": never intended to be read by a human (e.g. WSDL)
  • "Jeweler's markup": every single thing is marked up
  • "Carpet tag bombs": nothing is marked up. e.g. putting an entire Java class in one java element
  • "Punched-card markup"
  • "Namespace race"
Uche Ogbuji at XML Europe 2004

He doesn't like "hump case" (camel case).

Using attributes to qualify other attributes is a big No-No. If you're doing this, you're swimming upstream. You should switch to elements.

Envelope elements (company contains employees contains employee; library contains books contains book) makes processing easier; but not always. Use them only if they really represent something in the problem domain, not just to make life easier for the processing tools.

Don't overuse namespaces. He (unlike Tim Bray) likes URNs for namespaces, mostly to avoid accidental dereferencing. He also suggests RDDL. He suggests "namespace normal form" Declare all namespaces at the top of the document. Do not declare two different prefixes for the same namespace.

A very good talk. I look forward to reading the paper. FYI, he's a wonderful speaker; probably the best I've heard here yet. (Stephen Pemberton and Chris Lilley were pretty good too.) Someone remind me to invite them to SD next year.

Componentize XML. Avoid large (gigabyte) documents.

Be wary of reflex use of data typing. Pre-packaged data types often don't fit your problem.

"Enforce well-formedness checks at every application boundary."

Forget "Binary XML." Use gzip. "The idea of binary XML flies in the face of all the concepts that make XML work."

The acetaminophen paracetamol acid test for markup vocabularies: Show a sample document to a typical XML-aware but non-expert user. Does it give them a headache?


Next up is Brandon Jockman of Innodata Isogen on "Test-Driven XML Development". Hmm, the A/V equipment in this room seems to be giving everyone fits today. It worked well yesterday. This does not bode well for my presentation this afternoon.

One thing I'm noting in this and several of the other talks is that in a mere 45-minute session the traditional tripartite outline structure (tell your audience what you're going to tell them, tell them, and then tell them what you told them) doesn't really work. There's not enough time to do it, nor is the talk long enough that it's necessary. At most summarize the talk in one sentence, not even an entire slide. In fact the title of the talk (if it isn't too cute) is often a sufficient summary.

"XSLT gives you a really big hammer to hit yourself with." He suggests using Eric van der Vlist's XSLTunit for writing XSLT that tests XSLT. Also recommends XMLUnit for the .NET folks. I should look at this to see if they're any good ideas here I can borrow for XOM's XOMTestCase class.


Mark Scardina, "owner" of Oracle's XML Developer Kit, is talking about "High Performance XML Data Retrieval"

Mark Scardina, Oracle XML Product Manager

XPath is the preferred query language, apparently because of its broad support in different standards like DOM, XSLT, and XQuery.

The DOM Working Group is finished and will not be rechartered. DOM Level 3 XPath is limited to XPath 1.0. Multiple XPath queries require multiple tree traversals (at least in a naive, non-caching implementation -ERH).

High performance requirements include managed memory resources, even for very large (gigabyte) documents. This requires streaming, but SAX/StAX aren't good fits. Also need to handle multiple XPaths (i.e. XPath location paths) with minimum node traversals. Knowing the XPath in advance helps. Will not handle situation where everything is dynamic. This must support both DTDs and schemas (and documents with neither).

These requirements led to "Extractor for XPath." This is based on SAX, for streaming and multicasting support. First you need to register the XPaths and handlers. This absolutizes the XPaths. Then Extractor compiles XPaths. This requires determining whether or not the XPath is streamable. Can reject non-streamable XPaths. It also builds a predicate table and an index tree.

"XPath Tracking" maintains XPath state and matches document XPaths with the indexed XPaths. XPath is implemented as a state machine implemented via a stack. It uses fake nodes to handle /*/ and //. Output sends matching XPaths along with document content. Henry Thompson seems skeptical of the performance of the state machine. He thinks a ?bottom-up parser? might be much faster. I really don't understand this. I'm just copying Scardina's notes.


I ran all the way across the convention hall carrying my sleeping laptop, something which I hate to do, (Has anyone noticed that age is directly correlated to the care one takes of computer equipment? I am amazed at how cavalierly the students at Polytechnic treat their laptops. I suspect it involves both the cost and fragility of computers when one first learned to use them. At the rate we're going, children born this year will be laying hacky-sack with their laptops in the school yard.) to catch Sebastian Rahtz talking about "A Unified Model for Text Markup: TEI, DocBook, and Beyond." The "Beyond" part includes includes other formats like HTML and MathML. The main purpose of this seems to be to allow DocBook to be used in TEI and vice versa, for elements that one has that the other has no real equivalent; e.g. a DocBook guimenu element in a mainly TEI document. This is done with RELAX NG schemas. He recommends David Tolpin's RNV parser and James Clark's emacs mode for XML.

Sebastian Rahtz lecturing on DocBook and TEI

That's it for today. I'm going to wander into the park behind the convention center to see if it looks like a good site for some birding. Come back tomorrow for updates from the final day of the show.

Monday, April 19, 2004

I've arrived at XML Europe. I'm reporting this conference in chronological order (earliest item on top) as opposed to my usual archaeological order (most recent item on top) so if you're coming back to read this, scroll to the bottom to see if I've added anything.


Quick head count at the first keynote shows about 120 people here. I'd love to provide live updates from the conference, but the wireless network seems to be password protected. :-( Also the keynote hall is notably lacking in power plugs. On the other hand the chair in front of you can be folded down to form a very nice desk.


The first keynote is about Amazon Web Services by Amazon's Jeff Barr. I've been meaning to use this for some time to finally update the books pages here on Cafe con Leche and Cafe au Lait, but time is limited as always. He actually defines web services as any programmatic (as opposed to human) access to a web server so it includes REST approaches as well as SOAP. The big difference he sees (I'm not sure this is accurate) is that REST is preferred by weakly typed scripters (Python, XSLT) where as SOAP is preferred by strongly typed programmers (Java, C). Interesting statistic from this talk: Amazon provides both SOAP and REST interfaces to their data. About 80% of the calls come through REST, 20% through screen scraping. He expected the opposite. BEEP, WSDL, etc. seem unnecessary for aggregation of web services. The developers he sees are doing just fine without them.


Next up is Stephen Pemberton of the CWI and chair of the W3C HTML and Forms working group. He's talking about notations in a generic sense, not specifically XML NOTATION type attributes. Examples include two-letter U.S. state abbreviations such as NY and FL. He suggests a better algorithm for this, but I don't think it would actually work. I see several possible conflicts. As he says, "I'm English. I just live in Holland." He recommends reading "The Goldilocks Theories" in Tog on Interface. People writing with WYSIWYG editors produce higher quality text than people typing in text editors (he says, as I type this in BBEdit). Pen and paper is higher quality still. Very interesting picture that demonstrates if you buy a new computer every 18 months or more, your current computer is more powerful than the sum of all computers you have owned previously. "The only thing my computer has all those extra cycles for is to make it act more like a television...so why are we devising notations to make like easier for the computer?" I suggest that we're not so much making it easier for computers as for programmers. Software development, programmer skill, algorithms, etc. don't follow Moore's law. Hmm, seems Pemberton agrees with me. 90% of the cost of developing software is debugging according to the DoD. A program that's 10 times longer is 31 times harder to write according to Moore of Mythical Man-Month fame. Therefore we should write programming languages to make life easier for the programmer rather than the computer. This was the goal of ABC, Python etc. What is Lambert Meertens working on? An order of magnitude improvement on Python/ABC? He's complaining about the difficulty of authoring XML (and XHTML), but he's exaggerating the problem by assuming validity, XML declaration, namespaces, etc. are required. I think he's also overestimating the ease of writing unmarked up text that can be processed by a computer. I don't think computers are really going to be able to parse real unmarked up text until and unless we have real AI. I think it's easier to write explicitly marked up text than implicitly marked up text.


Chris Lilley of the W3C is talking about Architectural Principles of the World Wide Web. This is the first breakout session. Good crowd, about 50 people. According to Lilley, the TAG is only responsible for documenting the web architecture as it exists, not designing an architecture.

First principle is orthogonality of specifications is good. I agree. XML is harmed by excessive reliance on Unicode and URLs. Big digression in the audience over why "orthogonal" is or is not the right word for this principle, but everyone agrees with the principle.

2nd principle: "Silent recovery from error is harmful." Does Opera error correct XML? Claim is made in audience. Some disagreement in audience with this principle.

Principle 3: URIs (as redefined in RFC 2396bis). Open question whether or not IRIs can only be written using Unicode Normalization Form C. Check the spec.

Principle 4: URIs are compared character by character.

Principle 5: Avoid unnecessary new URI schemes. "Making up stupid things like itunes that are exactly the same as http except they mean use my software instead of a web browser is a bad idea." Ditto for subscribe in RSS.

Principle 6: "User agents must not silently ignore authoritative server metadata."

Principle 7: Safe interactions. GET is safe (does not incur obligations). POST may not be. Big issue with GET is character encoding in query strings. This breaks search engines in countries with less-ASCII like character sets.

Principle 8: Text vs. binary. Lilley likes text. Tag finding summarizes the issue.

Principle 9: Extensibility and Versioning. Extensibility must be designed in. Must understand vs. must ignore.

Principle 10: Separate content, presentation, and interaction. Question from audience: "Isn't there someone from Microsoft on the Working Group?"

Principle 11: XML and Hypertext. Allow web wide linking. Use URIs instead of IDREFs.

Principle 12: XML ID semantics.


Paul Prescod is talking about "Take REST: An Analysis of Two REST APIs". He's referring to Amazon and ATOM. I'm not sure I like the title. I suppose these are interfaces, and can be used as interfaces to application programs, but they are not APIs in the traditional sense. They're simply a presentation of data as XML documents at particular URLs. Hmm, seems he may have thought the same thing. That title was from the show program, but on the slides it's morphed into "Take REST: A Tale of Two Service Interfaces".

Prescod prefers "data-centric interfaces" to "service oriented interfaces". "XML is the solution to the problem, not the problem." Don't hide the XML! Big problem with Amazon interfaces is embedding authentication info in the URIs. However, this does work better with XSLT. RPC is too fragile (not extensible) for wire protocols. Example: fixed length argument lists.


Cool siting of the day: Linux running on a dual-boot IPod.


Michael Kay, author of the popular Saxon open source XSLT processor, is talking about "XSLT and XPath Optimization (in Saxon)". There's a large crowd, more than 60 people in a small room. "Saxon is an engineering project, not a research project." He does not have a good performance test suite and reproducible measurements. His technique is mostly based on incrementally optimizing badly performing stylesheets. If he had been a reviewer his own paper, he would have complained about this. Runtime optimizations can use knowledge of input data. Compile time optimizations avoid cost of repeated optimization. Differences between XSLT 1 and XSLT 2 aren't that radical from the standpoint of optimizations. Most optimizations in Saxon 7 could have been applied to Saxon 6 if he hadn't abandoned it. Some techniques are more effective in XSLT 2 due to strong types, but even 1.0 processors can deduce type information. Namespace prefixes defined at runtime (often via variables) are a major pain. Saxon does more optimization on XPath expressions than XSLT instructions.


Jonathan Robie of Data Direct is talking about "SQL/XML, XQuery, and Native XML Programming." Robie expects a second last call working draft of XQuery, because of the significant changes still being made. "It is anticipated" that there will be support for the SQL/XML XML data type in JDBC 4.0. There should be a public draft of a Java API for XQuery soon.


IBM's Elena Litani, a major contributor to Xerces-Java, is talking about "An API to Query XML Schema Components and the PSVI," about 20 people attending. The API she's describing is implemented in Xerces and has been submitted as a member proposal to the W3C. (I don't remember seeing this there. It may be members only. If the wireless network were working I could check.)

Elena Litani lecturing at XML Europe 2004

They wanted a platform and language independent API, defined in IDL. Didn't the DOM prove once and for all that this was a bad idea? Here they don't even have the excuse of needing to run inside browsers.

The three main interfaces are ElementPSVI, AttributePSVI and their common superinterface, ItemPSVI. These are implemented by the same objects that implement DOM Level 3 standard Element, Attr, and Node interfaces (or equivalent in other APIs). Casting is required.

Streaming models would use a PSVIProvider pull interface instead. Xerces supports this in SAX. Cast XMLReader to PSVIProvider, and then call getElementPSVI(), getAttributePSVI(), etc. However not all details may be available. For instance, in startElement(), one doesn't yet know if the element is valid.

This all looks very closely tied to the W3C XML Schema Language. I don't see how one could use this on a RELAX NG validated document, for example.

This API also includes a full read-only model for modelling schemas including XSObject, XSModel, etc for modelling element declarations, target namespaces, type definitions, etc. I asked what the use case for this part of the API was. Litani suggests comparing two schemas and a schema-aware editor. According to Henry Thompson, it also allows you to navigate the type hierarchy; for instance to find out if a user defined type is a subtype of xsd:int.


Next up is Henry Thompson of the University of Cambridge talking about "A Logical Foundation for W3C XML Schema." He admits the spec was written for implementors, and difficult to read for ordinary users. In the future wants to better support logical reasoning about schema composition. He's speaking for himself as an individual, not the working group. He starts from the logic of feature structures as developed by Rounds, Moshier, et al. (And I'm already lost. Oh, he's going to give a mini-tutorial on what a "logic" is. Maybe this will explain it.) A logic is

  • A sentential form
  • A model theory
  • An interpretation

Gee, that's clear. OK, he elaborates. A sentential form is a grammar for defining well-formedness such as a BNF grammar. Now we're on ground I understand somewhat. A model theory is what the sentences are about. It's also a set of individuals and a set of named subsets of the set. The interpretation relates the well-formed sentences to the model so truth values of sentences can be determined. The sentences are interpreted by comparing what's found in the sentence to the items in the sets. The sentences contain logical operators such as OR and AND. I think I see what he's saying. I just don't see why this is useful.

OK, that's a logic. Now on to schemas. According to Thompson, a schema is a component graph in which components are nodes and properties are edge labels. Non-component values are leaf nodes. However, this is a general graph with circles. Unlike XML documents it is not a tree. He wants to extend XPath to support such graphs.

Now he's showing a reformulation of parts of the schema spec using logic notation. I wouldn't have thought it possible, but it's even more opaque and reader hostile than before! Maybe it makes more sense once you've had some time to absorb it. This does look like it may make life easier for implementors. I'm just not sure it's an improvement for users. I asked, who is this meant for? Apparently it's supposed to replace the normative parts of the spec, and allow the non-normative parts to be written more cleanly.

Dan Brickley wants to rewrite this on top of OWL (which is itself written on top of RDF) instead of Thompson's idiosyncratic XML.

Thompson notes that when originally working on the PSVI, the working group was frustrated that there was nothing in the Infoset spec that told them how to be good citizens when extending the Infoset. It occurs to me that this is a problem for other specs like XInclude (which is trying to sneak a new language property into the Infoset) that want to extend the Infoset. Thompson claims this approach solves that problem, but I can't tell. As he says, "This stuff is dense." The formalization is very close to being a Prolog program, which would make an excellent reference implementation (if an inefficient one).

That's all for today. More tomorrow.

Sunday, April 18, 2004

The W3C has released version 8.4 of Amaya, their open source testbed web browser and authoring tool for Solaris, Linux, Windows, and Mac OS X that supports HTML, XHTML, XML, CSS, MathML, and SVG. This release fixes assorted bugs.

Saturday, April 17, 2004

Sun and IBM have posted the proposed final draft specification for Java Specification Request (JSR) 105, XML Digital Signature APIs. "The purpose of this JSR is to define a standard Java™ API for generating and validating XML signatures."

Friday, April 16, 2004

I'm leaving tonight for XML Europe in Amsterdam. I should have Internet access while I'm at the show, but updates will likely be sporadic until I return.


Dominique Hazaël-Massieux and Dan Connolly have written a W3C Note about Gleaning Resource Descriptions from Dialects of Languages (GRDDL). "This document presents GRDDL, a mechanism for encoding RDF statements in XHTML and XML to be extracted by programs such as XSLT transformations." Do you think just maybe the acronym came first? Hmm, what other expansions can we think of? Maybe "Gratuitous RDF Defenestrates Descriptive Labelling"?

Thursday, April 15, 2004

The W3C DOM working group has released the final Document Object Model (DOM) Level 3 Core Specification and the Document Object Model (DOM) Level 3 Load and Save Specification. The load and save specification defines platform and language independent means to parse an XML document to create a DOM Document object, and to serialize DOM Document objects into a file or output stream, something that was previously implementation dependent and accomplished with APIs like JAXP. DOM Level 3 Core adds various new features and methods to the standard DOM classes including:

  • Base URI support
  • A standard bootstrapping procedure to load a DOMImplementation in Java by reading a DOMImplementationRegistry
  • A DOMConfiguration interface that controls lots of little fiddly details like normalization
  • The XML declaration information such as encoding is now available in the Document object
  • User data attached to nodes
  • You can get the whole text from a text node.
  • DOMErrorHandler for reporting errors (like SAX2's ErrorHandler)
  • DOMLocator interface for reporting document positions (like SAX2's Locator)
  • A TypeInfo interface for DTDs and schemas
Wednesday, April 14, 2004

The W3C XInclude working group has posted the second candidate recommendation of the XInclude specification. Substantive changes include:

  • The accept and accept-language attributes are now limited to printable ASCII characters between 0x20 and 0x7E inclusive. This closes a major security hole in the last draft.
  • The accept-charset attribute has been removed.
  • The namespace URI is once again http://www.w3.org/2001/XInclude instead of http://www.w3.org/2003/XInclude
  • XInclude processors are recommended to add xml:lang attributes as necessary to the included elements to try to retain the language tagging.
  • The prohibition on fragment identifiers has changed from a "should not" to a "must not", though the exact error behavior is not yet specified.

In my opinion this draft is the best yet, and probably ready for final release. I still think the xpointer attribute is an kludge, but it's just ugly—it doesn't really prevent users from doing anything. I do still wish their were a profile with no XPointer support at all to better support streaming implementations. Otherwise, this spec feels very solid.

Tuesday, April 13, 2004

The XML Apache Project has posted the fourth beta of XIndice 1.1, an open source native XML database published under the Apache Software License. XIndice supports XPath for queries and XML:DB XUpdate for XML updates and the XML:DB XML database API for Java as well as an XML-RPC interface. Changes since 1.0 are mostly minor and include bug fixes and Java 1.4 support.

Monday, April 12, 2004

The W3C XML Core Working Group has posted a note on XML Processing Model Requirements. The goal is to devlop an XML vocabulary dexcribing how to combine different processes such as XSL transformation and XInclude resolution in a certain order for manipulating infosets. The requirements are as follows:

  • The language must be rich enough to address practical interoperability concerns.
  • The language should be as small and simple as possible.
  • The language must allow the inputs, outputs, and other parameters of a components to be specified.
  • The language must define the basic minimal set of mandatory input processing options and associated error reporting options required to achieve interoperability.
  • Given a set of components and a set of documents, the language must allow the order of processing to be specified.
  • It should be relatively easy to implement a conformant implementation of the language, but it should also be possible to build a sophisticated implementation that can perform parallel operations, lazy or greedy processing, and other optimizations.
  • The model should be extensible enough so that applications can define new processes and make them a component in a pipeline.
  • The model must provide mechanisms for addressing error handling and fallback behaviors.
  • The model could allow conditional processing so that different components are selected depending on run-time evaluation.
  • The model should not prohibit the existence of streaming pipelines.
  • The model should allow multiple inputs and multiple outputs for a component.
  • The model should allow any data set conforming to one of the W3C standards, such as XML 1.1, XSLT 1.0, XML Query 1.0, etc., to be specified as an input or output of a component.
  • Information should be passed between components in a standard way, for example, as one of the data sets conforming to an industry standard.
  • The language should be expressed in XML. It should be possible to author and manipulate documents expressed in the pipeline language using standard XML tools.
  • The pipeline language should be declarative, not based on APIs.
  • The model should be neutral with respect to implementation language. Just as there is no single language that can process XML exclusively, there should be no single language that can implement the language of this specification exclusively. It should be possible to interoperably exchange pipeline documents across various computing platforms. These computing platforms should not be limited to any particular class of platforms such as clients, servers, distributed computing infrastructures, etc.

Rob Mckinnon has posted Delineate 0.5, a "tool for converting raster images to SVG (Scalable Vector Graphics) using AutoTrace or potrace. It loads images using JIU and displays results using Batik. Input formats are JPEG, PNG, GIF, BMP, TIFF, PNM, PBM, PGM, PPM, IFF, PCD, PSD, RAS."


Oleg Paraschenko has released TeXML 1.0, an XML vocabulary for TeX. The processor that transforms TeXML markup into TeX markup is written in Python, and thus should run on most modern platforms. The intended audience is developers who automatically generate TeX files. TeXML is published under the GPL.

Sunday, April 11, 2004

Emmanuil Batsis has posted Sarissa 0.9, an open source (GPL) JavaScript library for processing XML under Mozilla and Internet Explorer. It provides methods to obtain DOM Document/XMLHTTP objects, synchronous and asynchronous loading, XSLT transformations, implements of some non-standard IE extensions for Mozilla, and adds NodeType constants for IE. This version can map the default namespace to a prefix (important when transforming XHTML with XSLT) and implements document.importNode for IE.


Nicholas Cull has released version 0.94 of his XHTML negotiation module for Apache 1.3.x that enables this web server to negotiate content types for XHTML documents approporiate for different browsers. That is, it allows you to serve application/xhtml+xml to modern, standards conformant browsers like Mozilla, and text/html to out of date, non-conformant browsers like Internet Explorer. This release fixes bugs.

Friday, April 9, 2004

The W3C XML Core Working Group has posted the first public working draft of xml:id Version 1.0. This describes an idea that's been kicked around in the community for some time. The basic problem is how to link to elements by IDs when a document doesn't have a DTD or schema. The proposed solution is to predefine an xml:id attribute that would alays be recognized as an ID, regardless of the presence or absence of a DTD or schema. There are a few issues with the spec as currently written, which I've been sending to the working group as I discover them, but the basic idea seems solid.

Thursday, April 8, 2004

The Organization for the Advancement of Structured Information Standards (OASIS) has voted to approve several Web Services Security specifications as official standards. Of course, they won't actually tell anybody what those specifications are, and their web site seems to be down about one out of every two connection attempts; and from what one can find, it seems there was at least one major flaw in the specification as written which will have to be addressed later in an erratum; but hey, they've got some specs out, somewhere!

In my experience, OASIS is the place companies go when they want to be able to brand something as a standard without doing the hard work of verifying that it actually makes sense. OASIS does have the benefit of being much more open to individual participation than most other standards bodies, which has led a couple of smart developers to move their projects there. (RELAX NG and DocBook come to mind.) However, the company led projects at OASIS seem to be uniformally disastrous: big, clunky, and irrelevant.


The W3C XKMS Working Group Working Group has posted candidate recommendations of XML Key Management Specification (XKMS) and XML Key Management Specification (XKMS) Bindings. XKMS is a set of "protocols for distributing and registering public keys, suitable for use in conjunction with the standard for XML Signatures [XML-SIG] defined by the World Wide Web Consortium (W3C) and the Internet Engineering Task Force (IETF) and companion standard for XML encryption [XML-ENC]. The XML Key Management Specification (XKMS) comprises two parts -- the XML Key Information Service Specification (X-KISS) and the XML Key Registration Service Specification (X-KRSS). These protocols do not require any particular underlying public key infrastructure (such as X.509) but are designed to be compatible with such infrastructures." Comments are due by October 1.

Wednesday, April 7, 2004

The W3C Web Services Description working group has uploaded two WSDL working drafts:

Part 1 "defines a language for describing the abstract functionality of a service as well as a framework for describing the concrete details of a service description. It also defines criteria for a conformant processor of this language." Part 2 "defines the sequence and cardinality of abstract messages sent or received by an operation."

Tuesday, April 6, 2004

IBM's alphaWorks has released the XQuery Normalizer and Static Analyzer, a "Java API and GUI for normalizing and computing the static type of XQuery expressions."


Apple has posted Security Update 2004-04-05 for Mac OS X 10.3.3, available through Software Update. I don't normally announce such things here, but among other fixes this release includes a new version of libxml2. I'm guessing it fixes this buffer overflow attack on the URI parsing code in nanothttp and nanoftp.


Engage Interactive has updated two open source XML parsers written in PHP. SAXY 0.8.1 exposes a SAX like interface. DOMIT! 0.9.2 exposes an API based on the Document Object Model (DOM) Level 1. Both are published under the GPL. These are bug fix releases.

Monday, April 5, 2004

The W3C Web Services Choreography Working Group has published the first public working draft of WS Choreography Model Overview. According to the introduction,

Business or other activities that involve multiple different organizations or independent processes that use Web service technology to exchange information can only be successful if they are properly coordinated. This means that the sender and receiver of a message know and agree in advance:

  • The format and structure of the (SOAP) messages that are exchanged, and

  • The sequence and conditions in which the messages are exchanged.

WSDL and its extensions provide a mechanism by which the first objective is realized, however, it does not define the sequence and conditions, or choreography, in which messages are exchanged.

To solve this problem, a shared common or "global" definition of the sequence and conditions in which messages are exchanged is produced that describes the observable complementary behavior of all the participants involved. Each participant can then use the definition to build and test solutions that conform to the global definition.

The main advantage of a global definition approach is that it separates the process being followed by an individual business or system within a "domain of control" from the definition of the sequence in which each business or system exchanges information with others. This means that, as long as the "observable" sequence does not change, the rules and logic followed within the domain of control can change at will.

The purpose of this paper is to describe an information model or "meta model" for a Choreography Definition Language that identifies the information and structures required to build a "global" definition.

I don't believe these premises. I think loosely coupled systems with limited if any prior agreements work much better than systems that attempt to legislate what one does with the data one receives. I think any attempt to globallu define behavior is doomed to failure.

Sunday, April 4, 2004

Antenna House, Inc has released XSL Template Designer 1.0, a $3000 payware (no, I didn't leave out a decimal point) forms-based GUI designer for XSL-FO that runs on Windows. It supports fixed layout for rigid forms, flow layout for forms that have expanding fields to accommodate the data, and label layouts. The Antenna House XSL Formatter V3.1 or later is a prerequisite.

Saturday, April 3, 2004

Frank McIngvale has released the Gnosis Utils 1.1.1, a public domain collection of Python modules for processing XML:

  • gnosis.xml.pickle serializes objects to and from XML using an API compatible with the standard pickle module
  • gnosis.xml.objectify turns arbitrary XML documents into Python objects
  • gnosis.xml.validity checks validity against DTDs or schemas
  • gnosis.xml.indexer provides full text indexing and searching of XML documents based on XPath
  • gnosis.util.dtd2sql converts a DTD into SQL 'CREATE TABLE' statements
  • gnosis.util.sql2dtd creates a DTD for the results of a given SQL query
  • gnosis.util.xml2sql converts XMLinto SQL 'INSERT INTO' statements

This release adds support for RELAX NG.


I think the future is clear, and it ain't spelled "XSD". Major recent RELAX NG wins include DocBook, OpenOffice, XHTML, and SVG; all of which are planning to move to RELAX NG in their next versions. I have yet to encounter a group that seriously explored RELAX NG and still chose to use the W3C XML Schema Language. Which reminds me. Henry S. Thompson has released a new version of XSV, his "open source (GPLed) work-in-progress attempt at a conformant schema-aware processor." This is a bug fix release. Honestly, if the very bright Professor Thompson, one of the editors of the W3C XML Schema specification, still can't get this right three years after the spec was released, what hope is there for the rest of us?

Friday, April 2, 2004

Norm Walsh has released DocBook 4.3. DocBook is an XML application designed for technical documentation and books such as Processing XML with Java. There are no changes since the third candidate release. Changes since 4.2 are fairly minor and include allowing xml:base attributes on most elements, step alternatives, a new URI element for non resolvable URIs such as namespace anmes and SAX property names, much better support for Java-like function prototypes, prefix, namespace, and localname classes for sgmltag, and adding emailmessage, webpage and newsposting as types of pubwork. Version 4.3 also registers the MIME media type application/docbook+xml.

Walsh has also announced that DocBook 5.0 will not be backwards compatible with DocBook 4.3. It will include changes that were not announced as deprecated in the 4.x releases. The eexact list of changes has not been announced yet. It will be based on RELAX NG rather than DTDs or W3C XML Schemas.


Nate Nielsen has released RTFM 0.9, an open source (BSD license) tool for converting Rich Text Format (RTF) files into XML. "It majors on keeping meta data like style names, etc... rather than every bit of formatting. This makes it handy for converting RTF documents into a custom XML format (using XSL or an additional processing step)."


Nicholas Cull has released version 0.93 of his XHTML negotiation module for Apache 1.3.x that enables this web server to negotiate content types for XHTML documents approporiate for different browsers. That is, it allows you to serve application/xhtml+xml to modern, standards conformant browsers like Mozilla, and text/html to out of date, non-conformant browsers like Internet Explorer. This release fixes a bug that inadvertently locked out IE users.


Bare Bones Software has released BBEdit 7.1.3. This is a free bug fix update for all 7.0 users. BBEdit is the $179 payware Macintosh text/HTML/XML/programmer's editor I normally use to write this page. Mac OS X 10.2 or later is required. Mac OS 9 is not supported.

Thursday, April 1, 2004

Sadly, this is not an April Fool's joke. The W3C has chartered an XML Binary Characterization Working Group to "analyze and develop use cases and measurements for alternate encodings of XML. Its goal is to determine if serialized binary XML transmission and formats are feasible." It's likely that only people who believe in binary XML will participate, which indicates the conclusion is foreordained. However, there are so many different conflicting goals for binary XML (streaming, file size, bandwidth, parsing time, random access, strong typing, direct binding of data to memory, memory footprint, high performance in CPU limited environments like cell phones, etc.) from the players in this space that it's possible the working group won't be able to reach consensus. However there's virtually no chance that "the Working Group may determine that the benefits brought by an alternate encoding of XML may not be sufficient to justify the loss of interoperability incurred."


Andy Clark has posted version 0.9.2 of his CyberNeko Tools HTML Parser for the Xerces Native Interface (NekoXNI). CyberNeko is writen in Java. Besides the HTML parser, CyberNeko includes a DTD parser, a generic XML pull parser, a RELAX NG validator, and a DTD to XML converter. The RELAX NG validator, pull parser, and DTD converter have also been updated. This release works with the latest version of Xerces and fixes assorted other bugs.

Wednesday, March 31, 2004

The W3C Scalable Vector Graphics Working Group has posted the sixth public working draft of Scalable Vector Graphics (SVG) 1.2. "The SVG Working Group consider the feature set of SVG 1.2 to be approaching stability. However, there are some cases where the descriptions in this document are incomplete and simply show the current thoughts of the SVG Working Group on the feature, or list the open issues. Therefore, this document should not be considered stable." Changes since the fifth public draft include:

  • Added support for CSS 3 Color property value rgba() syntax.
  • Switch is dynamic; that is, constantly evaluated.
  • Cursors allow SVG content and animation
  • A new interface for filtering events such as mutations only on a particular attribute name, mouse drags, or events in a particular phase
  • New :highlight pseudo-class
  • Non Scaling Strokes
  • overlay and overlay-host properties
  • Methods to set and retrieve client-side data stored between sessions
  • Removed the requiredView attribute, renamed min-pixel-width to min-unit-scale.
  • Three new rendering hints, static, cache and snap, for sprites
  • The traitDef element allows custom attributes to be exposed to the animation engine
  • vector-effect property.

The W3C SVG Working group has also posted the second working draft of Mobile SVG Profiles: SVG Tiny and SVG Basic, Version 1.2. SVG Tiny 1.2 is a "backwardly-compatible update of SVG Tiny 1.1 which adds some new features from SVG 1.2 add adds other features based on implementor and designer feedback on SVG Tiny 1.1."

Tuesday, March 30, 2004

Conrad Roche has posted Matra 0.8b, an open source Java library for parsing DTDs. It might be useful for some purposes. However, it's divided into way too many packages, and uses short constants for no good reason that I can see. In many cases it should be using type-safe enums instead, or at least ints. Furthermore many methods are exposed as public that really shouldn't be, making the API far more complex than it needs to be. This may well be a symptom of the excessive number of packages. Matra is a read-write API. It's not immedioately clear to what extent the write side of the library maintains the syntactical correctness of DTDs. For instance, does it make sure element names are legal XML names, and does it prevent multiple declarations for the same element? Matra is published under the MPL 1.1.


Dennis Sosnoski has posted JiBX beta 3a, yet another open source (BSD license) framework for binding XML data to Java objects using your own class structures. It falls into the custom-binding document camp as opposed to the schema driven binding frameworks like JaxMe and JAXB. Quoting from the JiBX web site,

JiBX is a framework for binding XML data to Java objects. It lets you work with data from XML documents using your own class structures. The JiBX framework handles all the details of converting your data to and from XML based on your instructions. JiBX is designed to perform the translation between internal data structures and XML with very high efficiency, but still allows you a high degree of control over the translation process.

How does it manage this? JiBX uses binding definition documents to define the rules for how your Java objects are converted to or from XML (the binding). At some point after you've compiled your source code into class files you execute the first part of the JiBX framework, the binding compiler. This compiler enhances binary class files produced by the Java compiler, adding code to handle converting instances of the classes to or from XML. After running the binding compiler you can continue the normal steps you take in assembling your application (such as building jar files, etc.).

The second part of the JiBX framework is the binding runtime. The enhanced class files generated by the binding compiler use this runtime component both for actually building objects from an XML input document (called unmarshalling, in data binding terms) and for generating an XML output document from objects (called marshalling). The runtime uses a parser implementing the XMLPull API for handling input documents, but is otherwise self-contained.

I heard Sosnoski talk about JiBX at SD West a couple of weeks ago. Overall I found it a little too inflexible for my tastes. It assumes a much tighter coupling between XML documents and the corresponding Java classes than I'm comfortable with. It did give me some ideas for a more flexible, loosely coupled framework I may work on in the future.

This beta adds bind-on-load, optional source location tracking for unmarshalling, custom mappings of array classes, and custom marshallers/unmarshallers for DOM or dom4j document models, escaping of output characters based on encoding.

Monday, March 29, 2004

Nicholas Cull has written an XHTML negotiation module for Apache 1.3.x that enables this web server to negotiate content types for XHTML documents approporiate for different browsers. That is, it allows you to serve application/xhtml+xml to modern, standards conformant browsers like Mozilla, and text/html to out of date, non-conformant browsers like Internet Explorer.


Mikhail Grushinskiy has posted XMLStarlet 0.91, a command line utility for Linux that exposes a lot of the functionality in libxml and libxslt including validation, pretty printing, and canonicalization. This release has been recompiled against libxml2 2.6.8 and libxslt 1.1.5, and fixes a bug I discovered in canonicalization in the previous version.


Pekka Enberg has posted version 0.2.16 of XML Indent, an open source (GPL) "XML stream reformatter written in ANSI C" that "is analogous to GNU indent." This is a bug fix release.


Engage Interactive has updated two open source XML parsers written in PHP. SAXY 0.8 exposes a SAX like interface. DOMIT! 0.9.1 exposes an API based on the Document Object Model (DOM) Level 1. Both are published under the GPL. This release of SAXY fixes bugs in entity resolution. This release of DOMIT! cleans up the code and build process.


Alexandre Brilliant has released JXMLPad 2.4, a €90 shareware JavaBean component for editing XML. This release lets you edit nodes in a table view. It also fixes various bugs. Java 1.2 or later is required.

Sunday, March 28, 2004

The W3C XML Schema Working Group has posted three "proposed editied recommendations":

These incorporate errata, but try not to change the base specs too much. However, there are some backwards incompatible changes in these drafts. For instance, --10-- is no longer a legal xsd:gMonth value but --10 is. Commennts are due by April 16.

Saturday, March 27, 2004

My DSL router died early yesterday afternoon. Speakeasy said they'd get me a new one by early this morning, but until it gets here I'm stuck with dial-up. And of course this happens just a couple of weeks after I finally disconnect my second phone line. :-(


Paul DuBois has released xmlformat 1.03 an open source pretty-printer for XML documents written in Perl (or Ruby) that can adjust indentation, line-breaking, and text wrapping on a per-element basis. Version 1.0.3 is more comtpaible with Ruby 1.8. xmlformat is published under a BSD license.


Antenna House, Inc has released XSL Formatter 3.1 for Linux and Windows. Version 3.1 implements more of the XSL-FO Specification, adds support for CMYK color, enhances SVG drawing support, supports Arabic, Hebrew and Thai output to PDF, enables the user to choose between PDF 1.3, 1.4 and 1.5, can embed WMF and EMF graphics and ZIP compressed TIFF files, and integrates better with .NET. XSL Formatter is $1250 payware for a single user Windows license, plus another $100 if you want hyphenation support, plus royalty fees if you want GIF or TIFF support. Linux/Unix prices start at $3000.


JAPISoft has released EditiX 1.2.1, a $39 payware XML editor written in Java. Features include XPath location and syntax error detection, context sensitive popups based on DTD, W3C XML Schema Language, and RelaxNG schemas, and XSLT and XSL-FO previews. Version 1.2.1 adds an embedded FAQ and file drag and drop support on Mac OS X. EditiX is available for Mac OS X, Linux, and Windows.

Friday, March 26, 2004

BEA Systems has posted the final draft of Java Specification Request (JSR) 173, Streaming API for XML (StAX), in the Java Community Process. StAX is a Java-based, pull-parsing API for XML. StAX offers two approaches. XMLStreamReader and XMLStreamWriter are a cursor API designed to read and write XML as efficiently as possible. XMLEventReader and XMLEventWriter are an iterator API designed to be easy to use, event based, easy to extend, and allow easy pipelining. The iterator API sits on top of the cursor API.

BEA has published a reference implementation. I haven't had time to write code with it yet, or to test the performance; but overall from the spec and JavaDoc I'd say this is the cleanest, most XML conformant pull parser I've seen to date. It's definitely a substantial improvement on XMLPULL. Overall, this looks like a very nice API.


Engage Interactive has updated two open source XML parsers written in PHP. SAXY 0.7 exposes a SAX like interface. DOMIT! 0.9 exposes an API based on the Document Object Model (DOM) Level 1. These releases fix various bugs. They also add "Lite" versions Both are published under the GPL.


The Exolab Group has released Castor 0.9.5.3, an open source (BSD license) data binding tool for XML and Java. Castor can marshal and unmarshal XML documents into Java objects, and store those objects in SQL databases. Automatic generation of Java classes from W3C XML schema language schema is supported, though that doesn't seem to be required.


The Big Faceless Organization has released the Big Faceless Report Generator 1.1.17, a $1200 payware Java application for converting XML documents to PDF. Unlike most similar tools it appears to be based on HTML and CSS rather than XSL Formatting Objects. This is mostly a bug fix release. Java 1.2 or later is required.

Thursday, March 25, 2004

The Gnome Project has posted version 2.6.8 of libxml2, the open source XML C library for Gnome. This release improves support for the W3C XML schema language, fixes a few bugs, and attempts a couple of optimizations. They've also released version 1.1.5 of libxslt, the GNOME XSLT library for C and C++. This is also a bug fix release. Both versions are only available in CVS for the time being. Sources and binaries have not yet been posted.


Toni Uusitalo has posted Parsifal 0.7.5, a minimal, non-validating XML parser written in ANSI C. The API is based on SAX2. Parsifal doesn't yet catch all the well-formedness errors it should, but unlike a lot of so-called fast parsers the author does seem to realize this is important, and is working on fixing the problems. I can't recommend this parser just yet, but by the time it hits 1.0, it may be a worthy addition to the C programmer's toolbox. Version 0.75 fixes one nasty bug. Parsifal is in the public domain.

Wednesday, March 24, 2004

Tim Bray has updated the beta of Genx, his pure C library for outputting canonical XML. This version "can now declare any namespace to have an empty prefix ("") so that it is the default namespace when in effect. Also, you can redeclare a namespace from one prefix to another as long as you’re not in the scope of a declaration. (This restriction will be lifted soon). There are new calls genxAddNamespace for explicitly inserting a namespace declaration, and genxUnsetDefaultNamespace for removing the current default. Finally, there is genxGetNamespacePrefix, which is useful for creating QNames-in-content where Genx generated the prefix." Genx is published under the expat license. This is a very liberal, non-viral but GPL-compatible license.


IBM has updated their XML Parser for Java to version 4.3. This release is based on Xerces-J 2.6.2 and supports XML 1.0 and 1.1, Namespace 1.0 and 1.1, W3C XML Schema Recommendation 1.0, SAX 1.0 and 2.0, DOM Level 1, DOM Level 2, XML catalogs, some experimental features of DOM Level 3 Core and Load/Save Working Drafts, JAXP 1.2, and XNI.


Didier Demany has released xmloperator 3.0, an open source, tree-based XML editor written in Java. Editing can be guided by a RELAX NG schema or a DTD. Version 3.0 adds support for editing documents larger than available memory. xmloperator is published under a BSD-like license.

Tuesday, March 23, 2004

The W3C Voice Browser Working Group has released the VoiceXML 2.0 recommendation. "VoiceXML is designed for creating audio dialogs that feature synthesized speech, digitized audio, recognition of spoken and DTMF key input, recording of spoken input, telephony, and mixed initiative conversations. Its major goal is to bring the advantages of Web-based development and content delivery to interactive voice response applications." Changes since VoiceXML 1.0 include new log and metadata elements, the deprecation of dtmf, emp, div, pros, and sayas elements, and better integration with the Speech Synthesis Markup Language and other generic XML applications.


The W3C Voice Browser Working Group has also released the Speech Recognition Grammar Specification Version 1.0 Recommendation. According to the abstract, this "document defines syntax for representing grammars for use in speech recognition so that developers can specify the words and patterns of words to be listened for by a speech recognizer. The syntax of the grammar format is presented in two forms, an Augmented BNF Form and an XML Form. The specification makes the two representations mappable to allow automatic transformations between the two forms."

Monday, March 22, 2004

I'm back from Software Development 2004 West, where a good time was had by all. I got some interesting ideas about data binding APIs during Dennis Sosnoski's talk about JAXB and JiBX. I think I see now how to design a data binding API that doesn't suffer from the numerous problems of the existing schema dependent, tightly coupled systems. More on that after I get XOM out the door. I'm currently hoping to post alpha 1 around the first of next month. Speaking of XOM, it got a very positive reception at the conference; and a few more groups are likely to start using it.

During my talk on Effective XML, I recommended that developers always use a parser to handle XML, because regular expressions aren't sufficiently aware of XML rules. I really should follow my own advice. The regex based software I use to update this site from home just broke because of some well-formed changes I made to attribute order while hand editing the site in Santa Clara last week. The only excuse I have is that the software for managing this site predates XML by a year or two. It certainly would be more reliable and less flaky if I took the time to rewrite it to use an XML parser, though. I'll be catching up on news from the last week over the day as I wade through my e-mail backlog.


The Mozilla Project has posted the first beta of Mozilla 1.7, an open source web browser, chat and e-mail client that supports XML, CSS, XSLT, XUL, HTML, XHTML, MathML, SVG, and lots of other crunchy XML goodness. It's available for Linux, Mac OS X, and Windows. Version 1.7 improves popup blocking, lets users review and open blocked popups, supports multiple identities in the same email account, and makes various other small improvements. It also has some small but significant performance optimizations.


IBM's alphaWorks has released an XQuery Normalizer and Static Analyzer (XQNSTA). "Given an XQuery expression, the tool uses the normalization rules specified in the W3C XQuery Formal Semantics document to convert the given expression to an expression in a core grammar (a subset of the XQuery grammar). The tool also comes with a parser that gives an XML representation of the XQuery expression; this XML representation can be used for manipulating the expression. Given a normalized expression, the tool again uses the static typing rules specified in the W3C XQuery Formal Semantics document to determine the output type of the expression. The normalized expression can be obtained from the Normalizer, and the static type of the expression can be obtained from the Static Analyzer. The Static Analyzer also checks for semantic errors (such as passing an empty expression to a function call where an integer argument is expected) and generates error messages whenever semantic errors are found during the static type checking." XQNSTA is written in Java. There's both an API for this and a GUI interface.


AlphaWorks has also published Views for XML, an XQuery based "mechanism for defining and querying views on native XML data. It is designed for XML users and Java™ developers working with applications that deal with data stored in XML format. It provides mechanisms for defining views relevant to the application. The application developer can then have the view itself as a data abstraction and not be bothered about the remaining data in the repository (which is irrelevant to the application being developed)."


Slava Pestov has uploaded the eleventh pre-release of jEdit 4.2, an open source programmer's editor written in Java with extensive plug-in support and my preferred text editor on Windows and Unix. New features in this release include the ability to customize the metal look and feel fonts in Java 1.5, the file system browser uses the locale's short date format, various new macros, and S# syntax highlighting.

Friday, March 19, 2004

I've posted the notes from Wednesday's StAX and Effective XML classes at Software Development 2004 West. Regular updates should resume Monday.

Thursday, March 18, 2004

I've posted the notes from this morning's DOM Level 3 and XOM classes at Software Development 2004 West. I've also posted the notes from yesterday's unscheduled XSLT seminar.

Tuesday, March 16, 2004

I've posted the notes from this morning's Processing XML with SAX and DOM tutorial at Software Development 2004 West.


JAPISoft has released EditiX 1.2, a $39 payware XML editor written in Java. Features include XPath location and syntax error detection, context sensitive popups based on DTD, W3C XML Schema Language, and RelaxNG schemas, and XSLT and XSL-FO previews. Version 1.2 adds a multiple view XSLT editor, SVG preview, generation of temporary schemas for document completion, and a DTD syntax checker. EditiX is available for Mac OS X, Linux, and Windows.

Monday, March 15, 2004

I've posted the notes from this morning's XML Fundamentals tutorial at Software Development 2004 West.

Sunday, March 14, 2004

I'm leaving today for Santa Clara and the Software Development 2004 West Conference. I'll have intermittent Internet access at the show, but I'll be quite busy so updates may be sporadic until I return next week.


With a great deal of effort, I've managed to push out one more release of XOM, my tree-based streaming API for processing XML with Java, before I leave. I had hoped that this would be alpha 1 and I could declare the API frozen. However, a few too many good ideas were submitted at the last minute to ignore, so I thought I'd make one more development release to shake out any bugs in the new classes and methods. So herewith is XOM 1.0d25. Anything that didn't change since the last release is probably not going to change now. However, there are some new features in this release that are worth reviewing and are not necessarily stable:

  • All 21 protected checkFoo methods have been removed. Instead the various mutator methods (setters and other methods that change the state of an object) are now non-final so they can be overridden. The getter methods are stil final and the fields are all private. Thus to change the state of an object setter methods still need to call the constraint-verifying superclass methods. This should give subclasses a lot more flexibility while not compromising on well-formedness. (I'm not sure why I didn't think of this a year and a half ago. I suspect I was too focused on the way JDOM did things, and incorrectly assuming that the only way to make setters non-final was to expose the fields as well. Clearly that's not true, but sometimes when you get a wrong idea in your head, it's really hard to shake it. It took an offhand remark from John Cowan about JavaDoc comments to make me realize that I didn't really need to expose the fields just because the methods that wrote to them were overrridable.)

  • The Serializer now throws UnavailableCharacterException, a subclass of XMLException, instead of a raw XMLException when it encounters a character it can neither write nor escape in the current encoding.

  • NodeFactory.makeDocument has been renamed startMakingDocument. NodeFactory.endDocument has been renamed finishMakingDocument.

  • DOMConverter can convert a DocumentFragment to a Nodes.

  • Added an XSLTransform.toDocument() method that converts a Nodes to a Document.

  • Element.removeChildren() now returns a Nodes object containing the children removed.

  • The LeafNode class has been removed. DocType, Text, Comment, and ProcessingInstrcution now directly extend Node.

  • Removed the hasChildren method from Element, Node, ParentNode, Attribute and Document.

  • Element.addAttribute is declared to throw the more specific MultipleParentException instead of IllegalAddException

Saturday, March 13, 2004

The W3C Cascading Style Sheets working group has posted the first public working draft of The CSS 'Reader' Media Type. "'Reader' is a keyword for use in Media Queries [MEDIAQ]. When a Media Query that includes the 'reader' keyword is attached to (a link to) a style sheet, it indicates that that style sheet is designed to be used by a "reader" device (typically a screen reader), that both displays and speaks a document at the same time. It may also display the document and render it in braille at the same time, or do all three."


The W3C Cascading Style Sheets working group has posted the first public working draft of CSS3 Hyperlink Presentation Module. " This specification is a module of level 3 of CSS and contains the functionality required to describe the presentation of hyperlink source anchors and the effects of hyperlink activation." The draft includes this convenient summary of the four properties it defines:

  • target-name: current | root | parent | new | modal | <string>
  • target-new: window | tab | none
  • target-position: above | behind | front | back
  • target: <target-name> || <target-new> || <target-position>

These properties are long overdue. The equivalent HTML markup was deprecated some years ago, but there was effetcively no other way to specify where a link would load.

Friday, March 12, 2004

The W3C Web Services Choreography Working Group has posted the third public working draft of Web Services Choreography Requirements 1.0. According to the abstract, "As the momentum around Web Services grows, the need for effective mechanisms to co-ordinate the interactions among Web Services and their users becomes more pressing. The Web Services Choreography Working Group has been tasked with the development of such a mechanism in an interoperable way. This document describes a set of requirements for Web Services choreography based around a set of representative use cases, as well as general requirements for interaction among Web Services. This document is intended to be consistent with other efforts within the W3C Web Services Activity."

Thursday, March 11, 2004

The W3C Web Content Accessibility Guidelines Working Group has posted the fourth public working draft of Web Content Accessibility Guidelines 2.0. Quoting from the introduction:

This document outlines design principles for creating accessible Web content. When these principles are ignored, individuals with disabilities may not be able to access the content at all, or they may be able to do so only with great difficulty. When these principles are employed, they also make Web content accessible to a variety of Web-enabled devices, such as phones, handheld devices, kiosks, network appliances. By making content accessible to a variety of devices, that content will also be accessible to people in a variety of situations.

The design principles in this document represent broad concepts that apply to all Web-based content. They are not specific to HTML, XML, or any other technology. This approach was taken so that the design principles could be applied to a variety of situations and technologies, including those that do not yet exist.

The table of contents provides a very nice summary of the guidelines:

  • Principle 1: Content must be perceivable.
    • Guideline 1.1 For non-text content, provide text equivalents that serve the same purpose or convey the same information as the non-text content, except when the sole purpose of the non-text content is to create a specific sensory experience (for example, music and visual art) in which case a text label or description is sufficient.
    • Guideline 1.2 Provide synchronized media equivalents for time-dependent presentations.
    • Guideline 1.3 Ensure that information, functionality, and structure are separable from presentation.
    • Guideline 1.4 In visual presentations, make it easy to distinguish foreground words and images from the background.
    • Guideline 1.5 In auditory presentations, make it easy to distinguish foreground speech and sounds from background sounds. [level 2 guideline]
  • Principle 2: Interface elements in the content must be operable.
    • Guideline 2.1 Make all functionality operable via a keyboard or a keyboard interface.
    • Guideline 2.2 Allow users to control time limits on their reading or interaction unless specific real-time events or rules of competition make such control impossible.
    • Guideline 2.3 Allow users to avoid content that could cause photosensitive epileptic seizures.
    • Guideline 2.4 Facilitate the ability of users to orient themselves and move within the content. [level 2 guideline]
    • Guideline 2.5 Help users avoid mistakes and make it easy to correct them. [level 2 guideline]
  • Principle 3: Content and controls must be understandable.
    • Guideline 3.1 Ensure that the meaning of content can be determined.
    • Guideline 3.2 Organize content consistently from "page to page" and make interactive components behave in predictable ways.
  • Principle 4: Content must be robust enough to work with current and future technologies.
    • Guideline 4.1 Use technologies according to specification.
    • Guideline 4.2 Ensure that user interfaces are accessible or provide an accessible alternative(s)
Wednesday, March 10, 2004

The W3C XML Schema Working Group has published the second working draft of XML Schema: Component Designators. This spec proposes a scheme for naming and identifying XML Schema components. Such components include:

  • Simple and complex type definitions
  • Attribute declarations
  • Element declarations
  • Attribute and model group definitions
  • Identity-constraint definitions
  • Notation declarations
  • Annotations
  • Model groups
  • Particles
  • Wildcards
  • Attribute uses
  • The master schema component representing the schema as a whole.
  • Facets

The goal is to be able to name, for example, the literallayout notation in the DocBook schema, as well as every other significant piece of the schema. Neither qualified names nor URIs obviously solve this problem.

Tuesday, March 9, 2004

The W3C has released version 8.3 of Amaya, their open source testbed web browser and authoring tool for Solaris, Linux, Windows, and Mac OS X that supports HTML, XHTML, XML, CSS, MathML, and SVG. This release fixes a lot of bugs, adds some new menu shortcuts, and adds a current date macro in the editor.


Engage Interactive has updated two open source XML parsers written in PHP. SAXY 0.6 exposes a SAX like interface. DOMIT! 0.8 exposes an API based on the Document Object Model (DOM) Level 1. These releases improve the handling of document prologs and fix various bugs. Both are published under the GPL.

Monday, March 8, 2004

Michael Kay has released Saxon 7.9, an experimental open source implementation of large parts of XSLT 2.0, XPath 2.0, and XQuery in Java. According to Kay, version 7.9 "is not yet the schema-aware version, but it contains the structural changes needed to prepare the way for schema awareness. It kills all known bugs, and fills in a number of gaps in the coverage of the XSLT 2.0 and XQuery 1.0 specifications." Java 1.4 is required. Saxon is published under the Mozilla Public License 1.0. However, a "schema-aware version of the product is planned: this will be a commercial product available from Saxonica Limited."

Sunday, March 7, 2004

Nicholas Cull has posted a beta of mod_xhtml_neg, an Apache module that does content negotiation for XHTML pages. In other words, it serves XHTML pages to Internet Explorer using the MIME media type text/html and to browsers that actually follow the standards using application/xhtml+xml. (Actually, it looks quite a bit more powerful than that, but this seems to be the main purpose.)

Saturday, March 6, 2004

Syntext has released Serna 1.3.1. a $299 payware XSL-based WYSIWYG XML Document Editor. The new feature in this release is Mac OS X support. Other platforms are not affected. Features include on-the-fly XSL-driven XML rendering and transformation, on-the-fly XML Schema validation, and spell checking.

Friday, March 5, 2004

Sun has released the J2ME Web Services Specification. This describes a subset of JAXP and JAX-RPC intended for talking to SOAP services from Java 2 Micro Edition devices. "The goal of this optional package is to define a strict subset wherever possible of the XML parsing functionality defined in JSR-063 JAXP 1.2 [2] that can be used on the Java 2 Micro Edition Platform (J2ME)".

There are major problems in the SAX subset. Sun is using the confusing, underspecified SAXParser and SAXParserFactory classes instead of the much cleaner, better specified XMLReader and XMLReaderFactory classes. They've also removed ContentHandler completely and replaced it with DefaultHandler. This requires altering signatures in JAXP and makes it substantially more difficult to port standard SAX programs to J2ME.

If Sun really finds true SAX to be inappropriate for a micro environment, then they're free to develop their own API that better fits the needs of small devices. And indeed they're doing exactly that with the StAX API. Thus it's completely unclear to me why they felt the need to fork SAX in this fashion. They keep claiming SAX is too big, but if size were really the concern, I'd expect them to limit themselves to one API for this use case rather than two.

Thursday, March 4, 2004

The W3C Multimodal Interaction Working Group has published the second public working draft of the Ink Markup Language. According to the abstract,

The Ink Markup Language serves as the data format for representing ink entered with an electronic pen or stylus. The markup allows for the input and processing of handwriting, gestures, sketches, music and other notational languages in Web-based (and non Web-based) applications. It provides a common format for the exchange of ink data between components such as handwriting and gesture recognizers, signature verifiers, and other ink-aware modules.

The following example of writing the word "hello" in InkML is given in the spec:

<ink>
   <trace>
     10 0 9 14 8 28 7 42 6 56 6 70 8 84 8 98 8 112 9 126 10 140
     13 154 14 168 17 182 18 188 23 174 30 160 38 147 49 135
     58 124 72 121 77 135 80 149 82 163 84 177 87 191 93 205
   </trace>
   <trace>
     130 155 144 159 158 160 170 154 179 143 179 129 166 125
     152 128 140 136 131 149 126 163 124 177 128 190 137 200
     150 208 163 210 178 208 192 201 205 192 214 180
   </trace>
   <trace>
     227 50 226 64 225 78 227 92 228 106 228 120 229 134
     230 148 234 162 235 176 238 190 241 204
   </trace>
   <trace>
     282 45 281 59 284 73 285 87 287 101 288 115 290 129
     291 143 294 157 294 171 294 185 296 199 300 213
   </trace>
   <trace>
     366 130 359 143 354 157 349 171 352 185 359 197
     371 204 385 205 398 202 408 191 413 177 413 163
     405 150 392 143 378 141 365 150
   </trace>
</ink>

<sarcasm>Gee, that's not the least bit opaque.</sarcasm>. This looks like the SVG mistake all over again. I wrote about this in Item 11 of Effective XML, "Make Structure Explicit through Markup.". The right way to solve this problem is something like this:

<ink>
  <trace>
    <coordinate><x>10</x> <y>0</y></coordinate>
    <coordinate><x>9</x> <y>14</y></coordinate>
    <coordinate><x>8</x> <y>28</y></coordinate>
    <coordinate><x>7</x> <y>42</y></coordinate>
    <coordinate><x>6</x> <y>56</y></coordinate>
    <coordinate><x>6</x> <y>70</y></coordinate>
    <coordinate><x>8</x> <y>84</y></coordinate>
    <coordinate><x>8</x> <y>98</y></coordinate>
    <coordinate><x>8</x> <y>112</y></coordinate>
    <coordinate><x>9</x> <y>26</y></coordinate>
    <coordinate><x>10</x> <y>140</y></coordinate>
    <coordinate><x>13</x> <y>154</y></coordinate>
    <coordinate><x>14</x> <y>168</y></coordinate>
    <coordinate><x>17</x> <y>182</y></coordinate>
    <coordinate><x>18</x> <y>188</y></coordinate>
    <coordinate><x>23</x> <y>174</y></coordinate>
    <coordinate><x>30 </x> <y>60</y></coordinate>
    <coordinate><x>38</x> <y>147</y></coordinate>
    <coordinate><x>49</x> <y>135</y></coordinate>
    <coordinate><x>58</x> <y>124</y></coordinate>
    <coordinate><x>72 </x> <y>21</y></coordinate>
    <coordinate><x>77</x> <y>135</y></coordinate>
    <coordinate><x>80</x> <y>149</y></coordinate>
    <coordinate><x>82</x> <y>163</y></coordinate>
    <coordinate><x>84</x> <y>177</y></coordinate>
    <coordinate><x>87</x> <y>191</y></coordinate>
    <coordinate><x>93</x> <y>205</y></coordinate>
  </trace>
</ink>

That's more verbose, but it's also much clearer. It would let the data be extracted with standard XML tools rather than requiring each user to write their own micro-parser for the trace elements. If InkML really can't afford to actually markup the x and y coordinates as x and y coordinates instead of raw text, then one wonders why it's using XML at all?

Wednesday, March 3, 2004
XML 1.1 Bible Cover

Amazon is now showing the XML 1.1 Bible as in stock for 24 hour shipment. They have it on sale for just $27.99.


The W3C Internationalization Working Group has published two new working drafts, Character Model for the World Wide Web 1.0: Fundamentals and Character Model for the World Wide Web 1.0: Normalization. These refactor and replace the previous single spec draft Character Model for the World Wide Web 1.0.

The fundamentals draft is in last call. "This Architectural Specification provides authors of specifications, software developers, and content developers with a common reference for interoperable text manipulation on the World Wide Web, building on the Universal Character Set, defined jointly by the Unicode Standard and ISO/IEC 10646. Topics addressed include use of the terms 'character', 'encoding' and 'string', a reference processing model, choice and identification of character encodings, character escaping, string indexing, and URI conventions."

The Normalization spec "provides authors of specifications, software developers, and content developers with a common reference for early uniform normalization and string identity matching to improve interoperable text manipulation on the World Wide Web."

Tuesday, March 2, 2004
XML 1.1 Bible Cover

As usually happens once I announce one of my books here, Amazon completely sold out their initial shipment of the XML 1.1 Bible as soon as it arrived at their warehouses. The publisher is shipping them more, and it should be available very soon. It should not take the "3 to 5 weeks" Amazon is currently listing on their web site before more copies arrive. In the meantime, I've posted the usual five sample chapters:

The XPointers chapter has changed the most since the Gold edition to bring it in line with the final XPointer recommendations, but the other four chapters have been cleaned up and rewritten as well.

These were all originally written using Wiley's rather unusual Word stylesheet. (For instance, inline code is identified with strike through rather than monospace.) I cleaned that up some in Word, then saved the files as HTML, and finally did a lot of grepping to try to clean up Word's horrific HTML; but it's still pretty ugly and malformed if you view source. If you notice anything that seems out of sorts formatting wise when reading these chapters, please drop me a line and I'll try to clean it up by hand.


The XML Apache Project has released Xalan-Java 2.6.0, an open source XSLT processor.This release improves compatibility with Java 1.5, upgrades Xerces to 2.6.2, and fixes various bugs.


JAPISoft has released EditiX 1.1, a $39 payware XML editor written in Java. Features include XPath location and syntax error detection, context sensitive popups based on DTD, W3C XML Schema Language, and RelaxNG schemas, and XSLT and XSL-FO previews. Version 1.1 can generate DTDs or W3C XML Schemas from an instance document, supports drag and drop, and fixes several bugs.

Monday, March 1, 2004
XML 1.1 Bible Cover

I am very pleased to announce the release of the XML 1.1 Bible, the first book to dedicated to the new version of the XML standard released by the W3C less than a month ago. XML 1.1 has a number of features that make it much more suitable for use by people whose operating system is MVS or VM/CMS or whose native language is Amharic, Burmese, or Cambodian. That doesn't describe you? Then to be honest there really isn't much in XML 1.1 to interest you, which is why the XML 1.1 Bible is quite clear in its recommendation that most users should stick to XML 1.0 for the foreseeable future.

However, although the new edition of this bestselling work is titled the XML 1.1 Bible, I didn't stop with the relatively minor changes needed to make it 1.1 savvy. A number of other sections were updated as well including the chapters on XPointers, Schemas, CSS, and XHTML. Most importantly, the book was substantially reduced in size and price. The last edition topped out at 1600 pages and cost almost $70. This edition cuts both the size and the price by almost half. The XML 1.1 Bible should strain neither your back nor your wallet. It comes in at a little over 1000 pages, and just under $40 (on top of which Amazon and other booksellers are currently offering it for 30% off).

How did I manage such a radical reduction in size? As the French philosopher and mathematician Blaise Pascal once wrote, "I have only made this longer because I have not had the time to make it shorter." I know how he felt. The first edition of the XML Bible was written under great time pressure, was finished well after deadline, and was the largest book I had written up to that point. My favorite reader comment about that edition was, "It would seem to me that if you asked the author to write 10,000 words about the colour blue, he would be able to do it without breaking into a sweat." While I probably could write 10,000 words about blue, for this edition, I restrained myself and took the time to write more concisely. I rewrote the book from the ground up; and while I retained the basic flavor and outline that proved so popular with the last three editions, I tightened up the writing and cut many examples down to size. With the benefit of five years of hindsight, I have also been able to expand coverage of promising new technologies (schemas, XInclude, XHTML, SVG, XML Base, and RDDL) while eliminating coverage of applications that proved to be less useful than they initially appeared (WML, VML, CDF, HTML+TIME, RDF, aural style sheets, and so on). The result is a more concise, approachable volume that covers more of what you need to know and less of what you don’t. If you liked the first or second edition, you’re going to like the third edition even more. I’m confident you’ll find this an even more useful tutorial and reference.

One change deserves special note. The baseball examples are history. They've been replaced throughout by a shorter, more approachable XML application involving television listings that I hope is a little less offputting to European audiences and other non-baseball fans. The baseball examples were a real dividing line among readers. You either loved them or hated them. I'm hopeful that the new television examples in this edition will be somewhat less controversial. If nothing else, they are certainly shorter.

Should you buy the new edition? If you already have the second or Gold edition, probably not. There's not a lot of new material here. Besides bringing the coverage of a few specs like XML itself and XPointer up to date, the main focus of this revision was to make the whole subject more approachable and accessible for novices. If you're still tooling around with a dog-eared copy of the first edition, it may be time to replace it. However, the second and Gold editions are still pretty up-to-date on most topics, and will continue to serve you well. On the other hand, if you're just learning XML, or if you're looking for a book to recommend to a colleague or to use as a text for a class, then I think the XML 1.1 Bible is better than ever. It covers the basics with more depth and detail, with fewer digressions into the more obscure parts of XML. The price has been reduced to just $39.99, and Amazon has it on sale for just $27.99. They aren't showing it in stock yet, but my publisher assures me it has been shipped and will be arriving at their warehouses very soon. Enjoy, and Happy XML 1.1!

Sunday, February 29, 2004

The W3C CSS Working Group has posted three candidate recommendations:

Cascading Style Sheets, level 2 revision 1
This spec describes CSS 2.1, a revision of CSS 2 that removes rarely implemented features and adds a few new ones including media-specific style sheets, content positioning, table layout, features for internationalization and some properties related to user interface. It also fixes a few bugs in the CSS2 spec "the most important being a new definition of the height/width of absolutely positioned elements, more influence for HTML's 'style' attribute and a new calculation of the "clip" property". Features removed include text-shadow, display: marker, display: compact, and content: <uri>.
CSS Print Profile
This module "defines a subset of Cascading Style Sheets Level 2 [CSS2] and CSS3 module: Paged Media [PAGEMEDIA] specifically for printing to low-cost devices. It is designed for printing from mobile devices, where it is not feasible or desirable to install a printer-specific driver, and for situations were some variability between the device's view of the document and the formatting of the output is acceptable."
CSS3 Paged Media Module
This module "describes the page model that partitions a flow into pages. It builds on the CSS3 Box model module and introduces and defines the page model and paged media. It adds functionality for pagination, page margins, headers and footers, image orientation. Finally it extends generated content for the purpose of cross-references with page numbers."
Saturday, February 28, 2004

Norm Walsh has released version 1.65 of the DocBook XSL stylesheets. These support transforms to HTML, XHTML, and XSL-FO. Major enhancements in this release include an alternate, more internationalizable index-generation mechanism and a "hack to support styling DocBook NG documents."

Friday, February 27, 2004

The W3C Scalable Vector Graphics Working Group has posted the fifth public working draft of Scalable Vector Graphics (SVG) 1.2. According to working group member Chris Lilley, "This document shows ongoing work, and is published now to get early feedback and to give a document for the SVG WG face to face meeting next week. Portions are known to be incomplete; we anticipate a new draft in a few weeks time." Changes since the fourth public draft include:

  • animateClock element.
  • New :editable CSS pseudo class.
  • Page elements can have transition effects and animation timing attributes.
  • Added support for the SMIL speed attribute.
  • Added vertical and horizontal alignment for text flow
  • Renamed flowText to flowRoot

Cameron McCormack has released Constraint, an open source SVG browser based on the Apache Project's Batik that "that allows attributes to be specified in terms of expressions to be evaluated at display time. These simple one-way constraints allow a great amount of adaptivity to be built in to documents to account for, for example, canvas dimensions, language, text size, etc."


Engage Interactive has released DOMIT! 0.7, an open source Document Object Model (DOM) Level 1 implementation for PHP. This release improves conformance to the DOM Level 1 specification. DOMIT! is published under the GPL.


Andy Clark has posted version 0.9.1 of his CyberNeko Tools HTML Parser for the Xerces Native Interface (NekoXNI). CyberNeko is writen in Java. Besides the HTML parser, CyberNeko includes a DTD parser, a generic XML pull parser, a RELAX NG validator, and a DTD to XML converter. This release fixes a bug in namespace handling introduced in the last release.

Thursday, February 26, 2004

Bodo Tasche has released Majix 1.2.2, an open source program written in Java that transform RTF file into XML. MajiX is Java compliant. It supports headings, lists, simple tablea, bold face , italics and underlines.


Jez Higgins has posted a new version of Arabica (nee SAXinC++), an open source C++ XML parser toolkit that supports SAX2 and DOM2 by wrapping an underlying parser such as expat, Xerces, libxml, or the Microsoft XML parser COM component. It supports various string types. It is published under a BSD style license. This is a bug fix release.


Tim Bray has updated the beta of Genx, his pure C library for outputting canonical XML. This version adds a genxNextUnicodeChar function that enables you to skip over malformed UTF-8 data. Frankly, this strikes me as more than a little dangerous. This second beta also integrates more cleanly with C++ than previous versions. Genx is published under the expat license. This is a very liberal, non-viral but GPL-compatible license.

Wednesday, February 25, 2004

The Mozilla Project has posted the first alpha of Mozilla 1.7, an open source web browser, chat and e-mail client that supports XML, CSS, XSLT, XUL, HTML, XHTML, MathML, SVG, and lots of other crunchy XML goodness. It's available for Linux, Mac OS X, and Windows. Version 1.7 improves popup blocking, lets users review and open blocked popups, supports multiple identities in the same email account, and makes various other small improvements.


Tim Bray has updated the beta of Genx, his pure C library for outputting canonical XML. Several nasty bugs have been fixed, and the whole package builds and runs much more reliably on Windows now. Genx is published under the expat license. This is a very liberal, non-viral but GPL-compatible license.


Cladonia Ltd.has released the Exchanger XML Editor 1.3, a $98 payware Java-based XML Editor. Features include

  • Schema Based Editing
  • Tag Prompting
  • Validation against DTD, XML Schema, RelaxNG
  • Tree View and Outliner for Tag Free editing
  • XPath and Regular expression searches
  • Schema Conversion
  • XSLT
  • Project Management
  • SVG Viewer and Conversion
  • Easy SOAP Invocations
  • Find in Files
  • Extension Handling

Version 1.3 adds support for DTD editing, XML catalogs, and RelaxNG and DTD based tag completion.

Tuesday, February 24, 2004

Daniel Veillard has released version 2.6.7 of libxml2, the open source XML C library for Gnome. This release fixes a few bugs and attempts a couple of optimizations.

Monday, February 23, 2004

The W3C XHTML working group has posted the first working draft of Modularization of XHTML, 2nd edition. Besides the incorporation of errata, the big change in this edition is the addition of W3C XML Schema Language modules for XHTML.


Tim Bray has posted a beta of Genx, his pure C library for outputting canonical XML. The beta fixes bugs and adds a genxGetVersion function. Genx is now published under the expat license. This is a very liberal, non-viral but GPL-compatible license.

Sunday, February 22, 2004

The W3C XQuery Working Group has posted the last call working draft of XQuery 1.0 and XPath 2.0 Formal Semantics. Changes since the previous working draft include:

  • Specification of schema context in [7.6.2 Elements in validation context] and in [7.1.4 Element and attribute type lookup (Static)]. This closes Issue 481.
  • Specification in [5 Modules and Prologs] of module import. This closes Issue 555.
  • Specification in [3.4.4 SequenceType Matching] and in [7 Auxiliary Judgments] of SequenceType matching. This closes Issue 559.

Frank McIngvale has released the Gnosis Utils 1.1.1, a public domain collection of Python modules for processing XML:

  • xml.pickle serializes objects to and from XML using an API compatible with the standard pickle module
  • xml.objectify turns arbitrary XML documents into Python objects
  • xml.validity checks validity against DTDs or schemas
  • xml.indexer provides full text indexing and searching of XML documents

This release adds supports for the RELAX NG schema language and fixes various bugs.

Saturday, February 21, 2004

The XML Apache Project has released version 2.6.2 of Xerces-J, the popular open source XML parser for Java that supports SAX and DOM. This is a bug fix release. Java 1.2 or later is required.

Friday, February 20, 2004

The Apache XML Project has released version 2.5.0 of Xerces-C, the popular open source parser written in C++. This is primarily a bug fix release.


Andy Clark has posted version 0.9 of his CyberNeko Tools HTML Parser for the Xerces Native Interface (NekoXNI). CyberNeko is writen in Java. Besides the HTML parser, CyberNeko includes a DTD parser, a generic XML pull parser, a RELAX NG validator, and a DTD to XML converter. This release adds

  • Namespace processing
  • CDATA scanning
  • Settings to add or override namespace bindings
  • Settings to add or override doctype declaration
  • Filter to "purify" input to produce well-formed XML

It also fixes a few bugs.

Thursday, February 19, 2004

Jason Hunter has posted beta 10 of JDOM, a tree-based API for Processing XML with Java. There are a few minor changes and fixes since beta10 RCc 1 last week. Get your comments in on the new API now. Jason hopes to release 1.0 in about 5 weeks.


Ed Willink has posted NiceXSL, "a conventional textual representation of XSL that is more amenable to conventional editing. XML overheads are much reduced and abbreviated syntaxes are provided for all XSLT 2.0 constructs. NiceXSL is supported by translators to and from standard XSLT." He offers the following example:

stylesheet version="17.0" {
  match (/) {
    if (system-property('xsl:version') >= 17.0)
      <xsl:exciting-new-17.0-feature>;
    else {
      <html
        <head
          <title "XSLT 17.0 required">
        >
        <body
          <p "Sorry, this stylesheet requires XSLT 17.0.">
        >
      >
    }
  }
}

Honestly, I don't think this is as clear as the usual XML syntax for XSLT. I understand to the desrie to move the control structures out of XML syntax, but why can't the literal result elements be XML? This sort of malformed, pseudo-XML markup is just plain confusing. :-(


Syntext has released Serna 1.3, a $299 payware XSL-based WYSIWYG XML Document Editor for Windows and Linux. Features include on-the-fly XSL-driven XML rendering and transformation, on-the-fly XML Schema validation, and spell checking. New features in this release include XslBricks (a rapid stylesheet development template library), NITF support, and the ability to open another document from the command line in the running instance (something I really wish Mozilla could figure out how to do).

Wednesday, February 18, 2004

Tim Bray has posted an initial drop of Genx, his C API for generating canonical XML.


Dave Malcolm has posted Conglomerate 0.7.12, an open source GUI XML editor for Linux written in C, and based on libxml2 and the GTk+ and Gnome libraries. This release adds Croatian and Japanese localizations, provides preliminary support for TEI Lite, adds a dialog for selecting which child element to insert when a DTD requires a choice, and makes numerous bug fixes. Conglomerate is published under the GPL.

Tuesday, February 17, 2004

The W3C Web Services Architecture Working Group has updated five working drafts:

Web Services Architecture
"This document defines the Web Services Architecture. It identifies the functional components and defines the relationships among those components to effect the desired properties of the overall architecture."
Web Services Glossary
This document defines various terms used in the various specs like "actor", "digital signature", and "SOAP receiver". The definition given of "Web Service" is
A Web service is a software system designed to support interoperable machine-to-machine interaction over a network. It has an interface described in a machine-processable format (specifically WSDL). Other systems interact with the Web service in a manner prescribed by its description using SOAP-messages, typically conveyed using HTTP with an XML serialization in conjunction with other Web-related standards.
Web Services Architecture Usage Scenarios
This document describes a series of use cases like travel agent interactions with cutomers and EDI transactions.
Web Services Architecture Requirements
"This document describes a set of requirements for a standard reference architecture for Web services developed by the Web Services Architecture Working Group. These requirements are intended to guide the development of the reference architecture and provide a set of measurable constraints on Web services implementations by which conformance can be determined."
Web Service Management: Service Life Cycle
"This document describes the life cycle of a Web service, and of the processing of a request by a Web service."
Monday, February 16, 2004

The W3C Privacy Activity has posted the first public working draft of the Platform for Privacy Preferences 1.1 (P3P1.1) Specification. "P3P 1.1 is based on the P3P 1.0 Recommendation and adds some features using the P3P 1.0 Extension mechanism. It also contains a new binding mechanism that can be used to bind policies for XML Applications beyond HTTP transactions." New features in P3P 1.1 include a mechanism to name and group statements together so user agents can organize the summary display of those policies and a generic means of binding P3P Policies to arbitrary XML to support XForms, WSDL, and opther XML applications.


Oleg Tkachenko has released nxslt 1.4, a Windows command line utility for accessing the .Net XSLT engine. This release updates various libraries nxslt depends on. nxslt is written in C# and requires the .NET Framework version 1.0 to be installed.

Sunday, February 15, 2004

Toni Uusitalo has posted Parsifal 0.7.4, a minimal, non-validating XML parser written in ANSI C. The API is based on SAX2. Parsifal doesn't yet catch all the well-formedness errors it should, but unlike a lot of so-called fast parsers the author does seem to realize this is important, and is working on fixing the problems. I can't recommend this parser just yet, but by the time it hits 1.0, it may be worthy addition to the C programmer's toolbox. Version 0.74 adds support for linking with GNU libiconv so that it can now parse parse in many different encodings such as UTF-16, UTF-32, EUC-JP, and SHIFT_JIS. And of course, many bugs were fixed. Parsifal is in the public domain.

Saturday, February 14, 2004

The XML Protocol Working Group has posted the first public working draft of the XML-binary Optimized Packaging (XOP) specification.

This specification defines the XML-binary Optimized Packaging (XOP) convention, a means of more efficiently serializing XML Query 1.0 and XPath 2.0 Data Model [XML Query Data Model] that have certain types of content.

A XOP package is created by placing a serialization of the XML Data Model inside of an extensible packaging format (such as MIME Multipart/Related, see [RFC 2387]) and then re-encoding selected portions of its content alongside it, while marking their locations in the XML with a special element that links to the packaged data using URIs.

Optimization in XOP is limited to the content of those elements which contain characters that can be interpreted as the canonical lexical representation of the XML Schema base64Binary datatype (see [XML Schema Part 2] 3.2.16 base64Binary and Errata in XML Schema, E2-54). Attributes, non-base64-compatible character data, and data not in the canonical representation of the base64Binary datatype cannot be successfully optimized by XOP.

Fortunately, this does not seem to be a generic binary encoding of XML, just a more efficient means of bundling non-XML binary data with XML documents.


The XML Protocol Working Group has also posted a new working draft of the SOAP Message Transmission Optimization Mechanism that relies on XOP. Quoting from the introduction,

Unlike SOAP itself, which is defined in terms of XML Infosets [XML InfoSet], this feature models message envelopes using the XQuery 1.0 and XPath 2.0 Data Model [XML Query Data Model], which is a typed superset of the Infoset. This feature uses type information only for optimization purposes; it does not provide for reconstruction of type information at receivers, except as necessary to support optimization. Nonetheless, use of the Data Model in this specification facilitates optimized transmission of query results through SOAP, and should provide a useful foundation if, for example, digital signature canonicalizations were to be developed for Data Model instances. Use of the Data Model here should also facilitate the work of those who may wish to develop features to provide for optimized transmission of the full typed Data Model: the changes needed to this specification should be straightforward, and the optimizations provided herein should be easy to generalize for such use.

The usage of the Abstract Transmission Optimization Feature is a hop-by-hop contract between a SOAP node and the next SOAP node in the SOAP message path, providing no mandatory convention for optimization of SOAP transmission through intermediaries. The feature does provide optional means by which binding implementations MAY choose to facilitate the efficient pass-through of optimized data contained within headers or bodies relayed by an intermediary (see 2.3.4 Binding Optimizations at Intermediaries). Additional specifications might also be written to provide for other optimized multi-hop capabilities, perhaps building on the mechanisms provided herein.

The second part (3. An Optimized MIME Multipart Serialization of SOAP Messages) describes an Optimized MIME Multipart Serialization of SOAP Messages implementing the Abstract Transmission Optimization Feature in a binding independent way. This implementation relies on the [XOP] format.

The third part (4. HTTP Transmission Optimization Feature) uses this Optimized MIME Multipart Serialization of SOAP Messages for describing an implementation of the Abstract Transmission Optimization Feature for the SOAP 1.2 HTTP binding (see [SOAP Part 2] 7. SOAP HTTP Binding).

I find the string typing implicit in this model to be seriously broken. It loses information (i.e. it's a lossy compression format) and makes too many assumptions about what content is and is not relevant.

Friday, February 13, 2004

Daniel Veillard has released version 2.6.6 of libxml2, the open source XML C library for Gnome. This release fixes a potentially serious buffer overflow error. All users should upgrade.


The Apache Project has released Cocoon 2.1.4, an open source "web development framework built around the concepts of separation of concerns and component-based web development. Cocoon implements these concepts around the notion of 'component pipelines', each component on the pipeline specializing on a particular operation. This makes it possible to use a Lego(tm)-like approach in building web solutions, hooking together components into pipelines without any required programming." Cocoon can assemble data from many sources including filesystems, SQL databases, LDAP, native XML databases, and SAP. It can customize the output to generate HTML, WML, PDF, SVG, and RTF from the same inputs. Processes it supports include XSL transformation and XInclude resolution. Cocoon can run as a servlet inside an existing web server or standalone through a commandline interface. 2.1.4 is primarily a bug fix release.


Ernst de Haan has posted xmlenc 0.42, an open source library for streaming XML output. It's marginally more convenient than System.out.println(). This release improves verification of well-formedness.


Thursday, February 12, 2004

The W3C Resource Description Framework (RDF) Core Working Group has published six proposed recommendations. This set of six replaces the original two Resource Description Framework specifications from 1999, RDF Model and Syntax and RDF Schema. The six new specs are:

RDF Primer
According to the introduction, RDF

The Resource Description Framework (RDF) is a language for representing information about resources in the World Wide Web. It is particularly intended for representing metadata about Web resources, such as the title, author, and modification date of a Web page, copyright and licensing information about a Web document, or the availability schedule for some shared resource. However, by generalizing the concept of a "Web resource", RDF can also be used to represent information about things that can be identified on the Web, even when they cannot be directly retrieved on the Web. Examples include information about items available from on-line shopping facilities (e.g., information about specifications, prices, and availability), or the description of a Web user's preferences for information delivery.

RDF is intended for situations in which this information needs to be processed by applications, rather than being only displayed to people. RDF provides a common framework for expressing this information so it can be exchanged between applications without loss of meaning. Since it is a common framework, application designers can leverage the availability of common RDF parsers and processing tools. The ability to exchange information between different applications means that the information may be made available to applications other than those for which it was originally created.

RDF is based on the idea of identifying things using Web identifiers (called Uniform Resource Identifiers, or URIs), and describing resources in terms of simple properties and property values. This enables RDF to represent simple statements about resources as a graph of nodes and arcs representing the resources, and their properties and values.

Resource Description Framework (RDF): Concepts and Abstract Syntax
"This document defines an abstract syntax on which RDF is based, and which serves to link its concrete syntax to its formal semantics. This abstract syntax is quite distinct from XML's tree-based infoset [XML-INFOSET]. It also includes discussion of design goals, key concepts, datatyping, character normalization and handling of URI references."
RDF Semantics
"This is a specification of a precise semantics, and corresponding complete systems of inference rules, for the Resource Description Framework (RDF) and RDF Schema (RDFS)."
RDF Vocabulary Description Language 1.0: RDF Schema
"This specification describes how to use RDF to describe RDF vocabularies. This specification defines a vocabulary for this purpose and defines other built-in RDF vocabulary initially specified in the RDF Model and Syntax Specification."
RDF/XML Syntax Specification (Revised)
"This document defines an XML syntax for RDF called RDF/XML in terms of Namespaces in XML, the XML Information Set and XML Base. The formal grammar for the syntax is annotated with actions generating triples of the RDF graph as defined in RDF Concepts and Abstract Syntax. The triples are written using the N-Triples RDF graph serializing format which enables more precise recording of the mapping in a machine processable form."
RDF Test Cases
This document describes a set of machine-processable test cases for RDF though it does not contain the test cases themselves which are available separately.

Changes since the December proposed recommendations are mostly editorial in nature.


The W3C Web Ontology Working Group has released the final recommendations of all six of its specifications:

Quoting from the overview document,

The OWL Web Ontology Language is designed for use by applications that need to process the content of information instead of just presenting information to humans. OWL facilitates greater machine interpretability of Web content than that supported by XML, RDF, and RDF Schema (RDF-S) by providing additional vocabulary along with a formal semantics. OWL has three increasingly-expressive sublanguages: OWL Lite, OWL DL, and OWL Full.
Wednesday, February 11, 2004

Who else noticed that the latest Microsoft security hole was really an bug in their ASN.1 implementation? In theory text and binary formats are isomprophic and equivalent. In practice text formats in general (and XML formats in particular) are simpler, easier-to-use, easier-to-debug and less prone to security problems. Also of note was that it took Microsoft six months to fix this problem. Somehow I suspect that even if a security hole were found in Xerces or libxml, it wouldn't take their vendors half a year before releasing a patch.


The Software Development West 2004 Expo in Santa Clara next month (March 15-19) is looking for a few more volunteers to man doors, distribute notes, and similar tasks. For each day a you volunteer you get to attend the conference for a day free, and most volunteer days involve nothing more than sitting in the back of the room listenting to the presentation, and collecting eval forms at the end; so really, it's a nice way to attend the show for free.


Valéry Febvre has posted PyXMLSec 0.2.0, a set of Python bindings for the XML Security Library. This is published under the GPL.

Tuesday, February 10, 2004

Version 3.1 of the payware <Oxygen/> XML editor has been released. Oxygen supports XML, XSL, DTDs, and the W3C XML Schema Language. New features in version 3.1 include:

  • The Outliner shows more information.
  • DocBook XSL stylesheets have been upgraded to version 1.64.1.
  • File associations are supported on Mac OS X.
  • WebDAV tries to recover from non fatal HTTP errors.
  • Redesigned Eclipse Plugin options
  • Bug fixes

Oxygen requires Java 1.3 or later. It costs $74.


Get Firefox

The Mozilla Project has released FireFox (nee Firebird) 0.8, an open source web browser for Windows, Mac OS X, and Linux that supports XML, XHTML, XSL, HTML, and CSS. Unlike the heavier weight Mozilla from which it is derived, this is just abrowser; no e-mail client, newreader, LDAP browser, or microwave oven is included. Besides the name change, new features in this release include

  • A Windows Installer
  • A new streamlined download manager
  • An enhanced Add Bookmark Dialog that allows the creation of new bookmark folders.
  • Offline mode
  • Better Handling of mislabelled binary files
  • A new XPInstall Frontend
  • A new default Pinstripe theme for MacOS X

Of course, many bugs have been fixed as well. FireFox is published under the Mozilla Public License.


IBM has released Version 5.4 of XML for C++, a schema-validating XML parser based on Xerces-C. This release has a number of bug fixes and performance optimizations, especially in schema handling.


Sleepycat Software has released Berkeley DB XML 1.2.1, an open source (Non-GPL viral) "application-specific, embedded data manager for native XML data" based on Berkeley DB. It includes C++ and Java APIs and supports XPath 1.0. 1.2.1 is a bug fix release.

Monday, February 9, 2004

The W3C Document Object Model (DOM) Working Group has released the DOM Level 3 Validation Recommendation. Implementations of DOM3 validation enable programs to validate documents in memory without reparsing, and to determine whether or not particular changes to the document such as adding or removing an element or changing text content are allowed by the schema. The system is schema-language neutral. Different implementations can support different schema languages including DTDs, RELAX NG, and the W3C XML Schema Language. Comments are due by January 14.


The W3C DOM working group has released proposed recommendations of the Document Object Model (DOM) Level 3 Core Specification and the Document Object Model (DOM) Level 3 Load and Save Specification. Changes since the previous candidate recommendation drafts in both the Core and Load and Save specifications appear to mostly editorial.

I'll be talking about all three of these specs in a session at Software Development 2004 West in San Jose next month. I'm now quite glad I waited till the last minute to prepare my notes for that talk. :-)


JAPISoft has released EditiX 1.0.1, a $39 payware XML editor written in Java. Features include XPath location and syntax error detection, context sensitive popups based on DTD, W3C XML Schema Language, and RelaxNG schemas, and XSLT and XSL-FO previews. Version 1.0.1 cleans up the user interface and fixes bugs.

Sunday, February 8, 2004

Paul DuBois has released xmlformat 1.02 an open source pretty-printer for XML documents written in Perl (or Ruby) that can adjust indentation, line-breaking, and text wrapping on a per-element basis. Version 1.0.2 adds options for in-place formatting and backup creation. xmlformat is published under a BSD license.


Engage Interactive has updated two open source XML parsers written in PHP. SAXY 0.5 exposes a SAX like interface. DOMIT! 0.6 exposes an API based on the Document Object Model (DOM) Level 1. These releases improve performance. Both are published under the GPL.

Saturday, February 7, 2004

I warned the W3C that XML 1.1 was going to be a disaster. I warned them that it was going to cause massive problems for numerous people. I warned them that clueless users were going to start typing version="1.1" for no good reason, and thereby making their documents uninteroperable with most of the installed base of XML software. But they went ahead and released it anyway, and now my prophecies have come to pass. I must admit I didn't think it would happen quite this quickly, but I do know what Cassandra felt like. :-(


The W3C Voice Browser Working Group has posted the Proposed Recommendation of VoiceXML 2.0. "VoiceXML is designed for creating audio dialogs that feature synthesized speech, digitized audio, recognition of spoken and DTMF key input, recording of spoken input, telephony, and mixed initiative conversations. Its major goal is to bring the advantages of web-based development and content delivery to interactive voice response applications." Changes since VoiceXML 1.0 include new log and metadata elements, the deprecation of dtmf, emp, div, pros, and sayas elements, and better integration with the Speech Synthesis Markup Language and other generic XML applications.

Friday, February 6, 2004

Wolfgang Meier of the Darmstadt University of Technology has posted the first beta of eXist 1.0 (after several earlier pre-beta releases), an open source native XML database that supports fulltext search. XML can be stored in either the internal, native XML-DB or an external relational database. The search engine has been designed to provide fast XPath queries, using indexes for all element, text and attribute nodes. The server is accessible through HTTP and XML-RPC interfaces and supports the XML:DB API for Java programming.

This beta release adds support for XQuery 1.0 including an XQueryServlet and XQueryGenerator for Cocoon that enable you to write web applications in XQuery. This release also improves locking and caching strategies, adds support for binary resources, and fixes numerous bugs. eXist is published under the LGPL.


Jason Hunter has posted beta 10/release candidate 1 of JDOM, a tree-based API for Processing XML with Java. The biggest change in this release is that there's a Parent interface that anstracts out the common functionality of Element and Document and a Content superclass that abstracts out the common functionality of child nodes. Get your comments in on the new API now. Jason hopes to release 1.0 in about 6 weeks.


Mikhail Grushinskiy has posted XMLStarlet 0.81, a comman line utility for Linux that exposes a lot of the functionality in libxml and libxslt including validation, pretty printing, and canonicalization. I have to test that last one. I've been looking for a tool to do this for a while now.

Thursday, February 5, 2004

The W3C has released four updated recommendations:

The 3rd edition of XML 1.0 and the second ediiton of the XML Infoset just incorporate errata. Namespaces in XML 1.1 allows prefixes to be undeclared. That is, you can now say xmlns:pre="". Everything you need to know about XML 1.1 can be summed up in two rules:

  1. Don't use it.
  2. (For experts only) If you speak Mongolian, Yi, Cambodian, Amharic, Dhivehi, Burmese or a very few other languages and you want to write your markup (not your text but your markup) in these languages, then you can set the version attribute of the XML declaration to 1.1. Otherwise, refer to rule 1.

For more details, see Chapter 3 of Effective XML.

Wednesday, February 4, 2004

Late Night Software has released XSLT Tools 2.0, a scripting addition that adds XSLT and XPath support to AppleScript. It also allows extending the XSLT processor using AppleScript functions. It's based on Xerces-C 2.3 and Xalan-C 1.6. XSLT Tools is free for Mac OS X.


Bare Bones Software has released BBEdit 7.1.2. This is a free update for all 7.0 users. BBEdit is the $179 payware Macintosh text/HTML/XML/programmer's editor I normally use to write this page. This is mostly a bug fix release, with some small new features including smarter auto indentation, cancelling grep searches, and alternate ports for SFTP. Mac OS X 10.2 or later is required. Mac OS 9 is not supported.


Engage Interactive has posted DOMIT! 0.52, an open source DOM Level 1 implementation for PHP. This is a bug fix release. DOMIT! is published under the GPL.

Tuesday, February 3, 2004

Release early and release often. Wolfgang Hoschek uncovered a nasty bug in the new Verifier code that prevented it from working in multi-classloader environments such as servlet containers, so I've posted XOM 1.0d24 to fix the problem. So far the fix has only been tested in Tomcat 4, but it seems to work. I also took the opportunity to fix bugs that were causing a couple of test cases to fail on Windows, fixed a bad bug in the FibonacciServlet sample that was failing to add child elements to the root element, and improved the API documentation for several classes. The API is unchanged since yesterday. XOM is still in last call. Discussion takes place on the xom-interest mailing list. You don't have to subscribe to post, though I do moderate non-subscribers for spam control.


Ranchero has released version 1.0.8 of the NetNewsWire and NetNewsWire Lite RSS readers for Mac OS X. Version 1.0.8 adds confirmation for quick-subscribing and fixessome bugs. NetNewswire is $39.95. The Lite version is free beer.

Monday, February 2, 2004

I am pleased to announce the release of XOM 1.0d23, my open source, dual streaming/tree-based API for processing XML with Java. XOM is quite parsimonious with memory, far more so than most competing tree-based APIs, and it is the most "XML correct" of any of the major tree-based APIs for processing XML. XOM is fanatical about maintaining well-formedness at all times. I also suspect XOM is more code correct than most libraries for processing XML. It has an extensive unit test suite that achieves approximately 90% code coverage. The XOM tests have actually exposed numerous bugs in other libraries including Xerces, Crimson, Oracle, Piccolo, libxml, JDOM, Saxon, and more. (This weekend's 2.6.1 release of Xerces did fix the last two bugs XOM's unit tests were tripping over, so that is the preferred parser for XOM.)

This is the *LAST CALL* development release of XOM 1.0. The next release will be 1.0 alpha 1, at which point I will declare API freeze and rule out gratuitous, backwards incompatible changes in the API until at least 2.0 (at some point in the indefinite future). If there are any method names or signatures that bother you in XOM, now is the time to let me know. I do plan some future releases, 1.0.1, 1.1, 1.2, etc. to add features and improve performance. However, I don't want to change the existing API after alpha 1 without a very good reason.

There are several big changes in this release. Some programs will need minor modifications to compile and run against the new release. Most notably, the various makeNode() methods in the NodeFactory class all return Nodes objects. This means a factory can replace one node type with a different node type (e.g. changing elements into attributes and vice versa) or replace a single node with several nodes.

In addition, I fiddled with the exception hierarchy to try to rationalize it. IllegalDataException and its subclasses have getData and setData methods to get and set the exact text that caused the exception. Subclasses include IllegalNameException, IllegalTargetException, and IllegalCharacterDataException. IllegalCharacterDataException is now used where IllegalDataException was used previously. Furthermore, NamespaceException has been broken up. IllegalNameException is used for problems with a namespace prefix. MalformedURIException is used for problems with a namespace URI. NamespaceConflictException, a subclass of WellformednessException, is used for cases where attributes, elements, and/or additional namespace declarations have conflicting bindings for the same prefix.

The nu.xom.xinclude package supports now the November 2003 Working Draft syntax of XInclude, including the xpointer, accept, accept-charset, and accept-language attributes. Documents will need to be rewritten to use the new syntax.

All legacy JDOM code has been replaced. The XOM code base is now completely independent of JDOM. (There never was much JDOM code in the first place, just parts of the Verifier class; but these parts have now all been replaced with different, faster algorithms based on table lookup.)

Other changes are listed on the web page. XOM is published under the Lesser General Public License (LGPL). Enjoy!


Daniel Veillard has released version 2.6.5 of libxml2, the open source XML C library for Gnome. This release fixes numerous bugs.

Sunday, February 1, 2004

The XML Apache Project has released version 2.6.1 of Xerces-J, the popular open source XML parser for Java that supports SAX and DOM. New features in this release include

  • XML Catalogs
  • Well-formedness checking for LSSerializer
  • Xerces can be built on Mac OS X with Apple JDK 1.4
  • Supports the November 2003 XInclude syntax, except for XPointers

This release also fixes a number of bugs, including a couple that were bedeviling XOM. Most importantly, default atttribute values in invalid documents are now correctly reported by SAX. This is probably the last release that will support Java 1.1. Xerces 2.7 will require Java 1.2 or later.


Joshua Baker has released SBXP 1.0.4, an open source, stream-based XML parser written in C. Notable features include the ability to parse HTML as well as XML.

Saturday, January 31, 2004

The W3C XForms working group has posted a note on XForms 1.1 Requirements. LIsted requirements include:

  • SOAP Integration
  • Improved Control over Submission
  • Repeat/Insert Enhancements
  • Bind Attribute on Bind Element
  • Email-address Datatype
  • XForms Processor as XML Editor
  • A power Function for exponentiation
  • Referencing Bind Sites in XPath Expressions
  • Improved Search for Instance Data by Key Value
  • Simplify Authoring XForms in XHTML2
  • XForms Model as Distinct Conformance Level

Engage Interactive has posted two open source XML parsers written in PHP. SAXY 0.3 exposes a SAX like interface. DOMIT! 0.51 exposes an API based on the Document Object Model (DOM) Level 1. These releases allow CDATA sections to be preserved. Both are published under the GPL.

Friday, January 30, 2004

Paul DuBois has released xmlformat 1.01 an open source pretty-printer for XML documents written in Perl (or Ruby) that can adjust indentation, line-breaking, and text wrapping on a per-element basis. xmlformat is published under a BSD license.

Thursday, January 29, 2004

David Tolpin has released RNV 1.5.3, an open source Relax NG Compact Syntax validator written in ANSI C. Version 1.5.3 adds support for extension datatype libraries and external system entities. RNV is published under a BSD license.


Version 0.94 of Chiba, an open source, web-based implementation of XForms based on servlets and XSLT, has been released. Chiba enables XForms to be used in current browsers without plugins or special requirements on the client-side. Version 0.94 "adds some detail functionality and some new Connectors and well as a bunch of unit-tests." Chiba is published under the artistic license.

Wednesday, January 28, 2004

Engage Interactive has posted two open source XML parsers written in PHP. SAXY 0.21 exposes a SAX like interface. DOMIT! 0.4 exposes an API based on the Document Object Model (DOM) Level 1. Both are published under the GPL.

Tuesday, January 27, 2004

The W3C DOM Working Group has posted the final recommendation DOM Level 3 Validation. "This specification defines the Document Object Model Validation Level 3, a platform- and language-neutral interface. This module provides the guidance to programs and scripts to dynamically update the content and the structure of documents while ensuring that the document remains valid, or to ensure that the document becomes valid." There do not appear to have been any substantive changes since December's proposed recommendation.

Monday, January 26, 2004

The W3C HTML Working Group has published the XHTML-Print candidate recommendation. According to the abstract, "XHTML-Print is member of the family of XHTML languages defined by the Modularization of XHTML [XHTMLMOD]. It is designed to be appropriate for printing from mobile devices to low-cost printers that might not have a full-page buffer and that generally print from top-to-bottom and left-to-right with the paper in a portrait orientation. XHTML-Print is also targeted at printing in environments where it is not feasible or desirable to install a printer-specific driver and where some variability in the formatting of the output is acceptable." In essence, this subsets XHTML with the features appropriate for printing.

Sunday, January 25, 2004

The W3C Device Independence Working Group has released the final recommendation of Composite Capability/Preference Profiles (CC/PP): Structure and Vocabularies 1.0. According to the abstract,

This document describes CC/PP (Composite Capabilities/Preference Profiles) structure and vocabularies. A CC/PP profile is a description of device capabilities and user preferences. This is often referred to as a device's delivery context and can be used to guide the adaptation of content presented to that device.

The Resource Description Framework (RDF) is used to create profiles that describe user agent capabilities and preferences. The structure of a profile is discussed. Topics include:

  • structure of client capability and preference descriptions, AND
  • use of RDF classes to distinguish different elements of a profile, so that a schema-aware RDF processor can handle CC/PP profiles embedded in other XML document types.

CC/PP vocabulary is identifiers (URIs) used to refer to specific capabilities

and preferences, and covers:
  • the types of values to which CC/PP attributes may refer,
  • an appendix describing how to introduce new vocabularies,
  • an appendix giving an example small client vocabulary covering print and display capabilities, and
  • an appendix providing a survey of existing work from which new vocabularies may be derived.
Saturday, January 24, 2004

The W3C has released version 8.2+ of Amaya, their open source testbed web browser and authoring tool for Solaris, Linux, Windows, and Mac OS X that supports HTML, XHTML, XML, CSS, MathML, SMIL, and SVG. This release fixes bugs and adds "new features for dates, tables, shortcuts and transformations."

Friday, January 23, 2004

John Cowan has updated TagSoup, his Java-language SAX parser for nasty, ugly HTML, to version 0.9. This is a bug fix release. Cowan has also launched a mailing list for discussion and support. Subscribe by sending a blank email to tagsoup-friends-subscribe@yahoogroups.com.

Thursday, January 22, 2004

I've posted the notes from last night's XQuery presentation at the XML Developer's Network of the Capitol District, where a good time as had by all. I have also posted the third alpha of XQuisitor, my GUI tool for querying XML documents based on XQuery and Saxon. Alpha 3 adds support for printing, fixes some bugs, cleans up the user interface, is better integrated with Mac OS X and Windows, and externalizes most strings to support future localization. The build file can now create a single runnable JAR archive that doesn't require you to put Saxon in the bootclasspath, though for the moment I'm not distributing that because of license conflicts.

XQuisitor has been tested primarily on Linux and a little on Mac OS X 10.2. It should run on any Java 1.4 platform, but I haven't verified that. Java 1.4 and Saxon 7.8 are required. XQuisitor is published under the GPL.

Wednesday, January 21, 2004

The XML Apache Project has released Xalan-C++ 1.7, an open source XSLT processor written in standard C++. Version 1.7 fixes bugs, provides support for message localization, improves the build process, and adds some new EXSLT functions.

Tuesday, January 20, 2004

x-port.net has posted a release candidate of of formsPlayer 1.0, an XForms processor that "only works in Microsoft's Internet Explorer version 6 SP 1."


Malcolm Wallace and Colin Runciman have released version 1.10 of HaXml, a bug fix release of the XML processing library for the Haskell language. According to the web page,

HaXml is a collection of utilities for using Haskell and XML together. Its basic facilities include:

  • a parser for XML,
  • a separate error-correcting parser for HTML,
  • an XML validator,
  • pretty-printers for XML and HTML.

For processing XML documents, the following components are provided:

  • Combinators is a combinator library for generic XML document processing, including transformation, editing, and generation.
  • Haskell2Xml is a replacement class for Haskell's Show/Read classes: it allows you to read and write ordinary Haskell data as XML documents. The DrIFT tool (available from http://repetae.net/~john/computer/haskell/DrIFT/) can automatically derive this class for you.
  • DtdToHaskell is a tool for translating any valid XML DTD into equivalent Haskell types.
  • In conjunction with the Xml2Haskell class framework, this allows you to generate, edit, and transform documents as normal typed values in programs, and to read and write them as human-readable XML documents.
  • Finally, Xtract is a grep-like tool for XML documents, loosely based on the XPath and XQL query languages. It can be used either from the command-line, or within your own code as part of the library.

HaXml is distributed under the Artistic License.


The Big Faceless Organization has released the Big Faceless Report Generator 1.1.16, a $1200 payware Java application for converting XML documents to PDF. Unlike most similar tools it appears to be based on HTML and CSS rather than XSL Formatting Objects. This is a bug fix release. Java 1.2 or later is required.

Monday, January 19, 2004

Jnnathan Borden and Tim Bray have posted a new draft version of RDDL 2.0. RDDL is the Resource Directory Description Language. It is an XML application based on XHTML Basic that is intended for documents placed aty the end of namespace URIs to allow humans and software to retrieve information about the XML application which uses that namespace. The new version uses Tim Bray's non-XLink syntax which looks something like this:

<a rddl:nature="http://www.w3.org/1999/xhtml"
		rddl:purpose="..."
		href="foo.html">Example</a>

To my mind this is a significant step backwards from the RDDL 1.0, XLink based syntax. Requiring all related resources to be identified by a elements is very restrictive. I'm not sure why people think a new version of RDDL is needed, but if one is; this isn't the right one.


David Tolpin has released incelim, an open source XSLT stylesheet that splices RELAX NG schemas together. It reads "a Relax NG grammar in XML syntax, expands all includes and externalRefs, and optionally replaces references to text, empty, or notAllowed with the patterns. The result is a 'compiled' schema convenient for distribution, as well as for consumption by tools which do not yet support include and externalRef." incelim is published under a BSD license.

Saturday, January 17, 2004

JAPISoft has released EditiX 1.0, is $39 payware XML editor written in Java. Features include XPath location and syntax error detection, context sensitive popups based on DTD, W3C XML Schema Language, and RelaxNG schemas, and XSLT and XSL-FO previews.

Friday, January 16, 2004

Mozilla 1.6 has been released. Mozilla is an open source web browser that supports XML, CSS, XSLT, XUL, HTML, XHTML, MathML, SVG, and lots of other crunchy XML goodness. It's available for Linux, Mac OS X, and Windows. Version 1.6 adds support for NTLM authentication on all platforms, automatic page translation feature via Google Language Tools, ChatZilla 0.9.48, about:about, several improvements to the mail client user interface, vCard support, Ask Jeeves searching, reload in View Source, and of course various bugs are fixed.


Valéry Febvre has posted PyXMLSec 0.1.0, a set of Python bindings for the XML Security Library. This is published under the GPL.


RenderX has released version 3.7.1 of XEP, its payware XSL Formatting Objects to PDF and PostScript converter. Version 3.7.1 adds a new non-static Java API, supports repeatable table headers, support recognizes all values of the white-space-treatment property, applies CSS1 styles in SVG, improves processing of Type 1 fonts with subsetting enabled, and makes all glyphs are accessible regardless of the encoding. The basic client is $299.95. The developer edition with an API is $999.95. The server version is $4999.95. Updates from 3.0 are free.

Thursday, January 15, 2004

Asami Tomoharu has released Relaxer 1.0, a schema compiler for RELAX NG that generates Java source code, DTDs, XSLT stylesheets, HTML documents with FORM tags, and more from RELAX NG schemas.


Bocaloca Software has released XMLBuddy Pro 2.0.2, a $35 Eclipse XML editor plug-in. Features include

  • Outliner
  • Bookmarks
  • DTD, W3 XML Schema, RELAX NG (XML and compact) support
  • Translate between DTDs and Schemas
  • Generate DTD or schema from XML instances
  • Code completion for DTD, XSD, RNG or RNC editing
  • RELAX NG compact syntax editor with configurable coloring
  • Open definition in DTD or schema
  • Edit and run XSL 1.0 or 2.0 (experimental) transforms
  • Auto-validation while you edit
  • Line numbering

Version 3.0 of the payware <Oxygen/> XML editor has been released. Oxygen supports XML, XSL, DTDs, and the W3C XML Schema Language. New features in version 3.0 include:

  • Outliner
  • Bookmarks
  • Experimental XInclude support.
  • Access to FTP/WebDAV from transformation dialog
  • WebDAV over HTTPS
  • Schema conversion
  • Better validation error reporting
  • NRL (Namespace Routing Language) support
  • Japanese localization
  • Preserve spaces and strip spaces elements list
  • More format and indent options
  • Pretty print for Relax NG Compact (RNC) schemas
  • Code completion for RNC editing
  • Code completion for DTD editing
  • Support for more encodings
  • Automatically learn the document structure when no schema or DTD is specified

Oxygen requires Java 1.3 or later. It costs $74.

Wednesday, January 14, 2004

One week from tonight, on Wednesday, January 21, 2004, starting at 6:30 P.M. I will be speaking to the XML Developers Network of the Capital District in Albany, New York. The topic is "XQuery: Exquisite or Excruciating?". I'll be exploring exploring XQuery 1.0, talking about both the its syntax and semantics of this powerful new language. There'll also be a personal demo of XQuisitor, a new open source GUI tool for querying XML based Michael Kay's Saxon 7. The meeting takes place at the HANYS offices in Renesselaer. I'm told the building will be locked; but if you come to the front door (the side facing I-90), someone will let you in.

Tuesday, January 13, 2004

New results seem to have stopped coming in, so I think it's time to wrap up the results from the initial HTTP digest authentication tests. Bottom line: every significant browser except Netscape 4.x and Lynx supports HTTP digest authentication as implemented by Apache 2.0. Specifically, Internet Explorer 5.0 and later (4.0 and later on the Mac) supports HTTP digest authentication, reports to the contrary notwithstanding, at least for simple cases.

It is still possible that some more complex URLs using query strings, percent escapes, and/or fragment identifiers have interoperability issues. It is also possible that there are interoperability issues with POST if not GET. I need to cook up some tests for these possibilities. However, at a minimum, we have discovered that HTTP digest authentication is much more prevalent in the installed base than was previously thought. It should at least be offered as a preferred option to BASIC authentication for those browsers that do support it. In 2004, I can live with locking out Netscape 4.x users if I have to to get increased security, but I'm not willing to turn off BASIC authentication until Lynx supports digest authentication. Adding digest authentication to Lynx should be a high priority item. Does anyone know the Lynx developers?


David Tolpin has released RNV 1.4, an open source Relax NG Compact Syntax validator written in ANSI C. Version 1.4 adds RVP, a pipe that receives validation primitives on one end and emits validation diagnostics from the other. RNV is published under a BSD license.


Ernst de Haan has posted xmlenc 0.39, an open source library for streaming XML output. It's marginally more convenient than System.out.println(). However, it does not guarantee well-formedness of the output, which to my mind is a sine qa non for any XML output library.

Saturday, January 10, 2004

As part of an ongoing discusion on xml-dev, I'm running a test of browser support HTTP digest authentication. So far, these are the results I've collected:

  • Internet Explorer 6.0 Windows 2000: Works
  • Internet Explorer 6.0, service pack 2, Windows XP.
  • Internet Explorer 6.0.2800.1106, Windows XP SP2: Works
  • Internet Explorer Version 5.50.4807.2300 on Windows 2000 Pro 5.00.2195 SP4: Works
  • Internet Explorer 5.00.3700, Windows 2000: Works
  • Internet Explorer 5.0 on Windows 98SE, NT4, and 2000: Works
  • Internet Explorer 5.5 on Windows 98SE, NT4, and 2000: Works
  • Internet Explorer 5.1.6 Mac OS 9: Works
  • Internet Explorer 4.0.1 Mac OS 9: Works
  • Internet Explorer 5.2 Mac OS X: Works
  • Internet Explorer 4 on Windows XP Pro: Works (*)
  • Internet Explorer 5.00 on MS Windows 2000 SP4 Workstation: Works
  • Internet Explorer 5.01 on Windows XP Pro: Works (*)
  • Internet Explorer 5.50.4807 SP2 on Windows2000: Works
  • Internet Explorer 5.5 on Windows XP Pro: Works (*)
  • Mozilla 1.6a on Windows XP Pro: Works
  • Mozilla 1.5, Windows XP: Works
  • Mozilla 1.5, Linux: Works
  • Mozilla 1.4.1 Linux: Works
  • Mozilla 1.4.1 Windows XP: Works
  • Mozilla 1.4 Linux: Works
  • Mozilla 1.4 on Windows 2000 Pro 5.00.2195 SP4: Works
  • Mozilla 1.3 Mac OS 9: Works
  • Mozilla 1.2.1 Mac OS 9: Works
  • Mozilla 1.1 Mac OS 9: Works
  • Mozilla 1.0.2, Linux: Works
  • Mozilla 1.0, Windows 98: Works
  • Firebird 0.7 on Windows XP pro: Works
  • Firebird 0.7, Windows SP4: Works
  • Firebird 0.7, Mac OS X 10.3: Works
  • Firebird 0.7 Linux i686 (SuSe 8.2)
  • Firebird 0.6 (Platform?): Works
  • Safari 1.1.1 Mac OS X 10.3: Works
  • Safari 1.0 Mac OS X 10.2: Works
  • Netscape 7.01 on Win2000: Works
  • Netscape Communicator 6.2 Mac OS 9: Fails
  • Netscape Communicator 4.8 on Windows XP Pro: Fails
  • Netscape 4.79 on Windows XP Pro: Fails
  • Netscape 4.78 on Windows 98SE: Fails
  • Netscape 4.7 on Linux: Fails
  • Netscape Communicator 4.75 Mac OS 9: Fails
  • Opera 7.23 on Windows XP Pro: Works
  • Opera 7.11 Windows: Works
  • Opera 7.1, Windows XP home edition: Works
  • Opera 7.01 on Windows XP pro: Works
  • Opera 6.01 (German) on Windows XP Pro: Works
  • Opera 6.0 Windows, build 1010, running on NT 4, SP6: Works
  • Opera 5.0.485 Mac OS 9: Works
  • Opera 5.0 on Windows XP Pro: Works
  • Opera 3.62 on Windows XP Pro: Fails
  • Omniweb 4.5 on OS X 10.3: Works
  • Mozilla Firebird 0.7 for Windows: Works
  • Lynx 2.8.5dev.7 character mode browser for Linux: Fails
  • Lynx 2.8.3, Mac OS X: Fails
  • Lynx 2.8.3, Windows XP Pro: Fails
  • w3m version w3m/0.4.1-m17n-20030308 character mode browser for Linux: Fails
  • links (aka Elinks) 0.4.2 character mode browser for Linux: Fails
  • Amaya 6.0 on Windows 2000 Pro 5.00.2195 SP4: Works
  • Konqueror 3.14 on Linux 2.4.22 (Gentoo): Works
  • Konqueror 1.3-13 on Red Hat 9.0: Works

* This one needs confirmation. The correspondent has "an unconventional multi-version IE installation that makes exact version identification difficult and *may* (or may not) mean they're sharing some DLL's" I do now have one report of failure with IE 4.0 on Windows.

I'd like to get more results, particularly Opera versions prior to 5.0, Konqueror, Amaya, and Internet Explorer on Windows prior to 6.0. If you have these or any other browser not yet listed here, could you please go to http://elharo.com/staging/ and attempt to login with the user name "invited" and the password "test". Whether you succeed or fail, please e-mail me the results, along with the version, vendor, and platform of your browser. I don't really need any more results for Netscape 4.x and earlier. I'm pretty convinced those all fail (unless you find a version that works. That would be interesting.) Similarly I'm fairly sure Mozilla 1.0 and later all work. (Again, if you have evidence otherwise, please let me know.) Thanks.

Friday, January 9, 2004

jCatalog Software has released XSLfast 1.3, an €899 graphical editor for XSL Formatting objects documents that supports mail merge and form processing. New features in version 1.3 include:

  • Zoom in / Zoom out
  • Measurements can be either points or millimeters
  • Index creation
  • Wizards for document creation
  • XPath integration with layouts
Thursday, January 8, 2004

Romeo Anghelache has posted Hermes 0.8.2, an open source LaTex to MathML authoring and editing tool that handles both MathML presentation and content markup. Hermes is published under the GPL.

Wednesday, January 7, 2004

RenderX have recently released several XSLT stylesheets that produce SVG representations of several barcode formats. The stylesheets are heavily parameterized to control bars width/height, wide-to-narrow ratio, quiet zone size, font properties for human readable label, etc.


Syntext has released Serna 1.2, a $299 payware XSL-based WYSIWYG XML Document Editor for Windows and Linux. Features include on-the-fly XSL-driven XML rendering and transformation, on-the-fly XML Schema validation, and spell checking. New features in this release include support for XHTML and GCA proceedings.

Tuesday, January 6, 2004

The Big Faceless Organization has released the Big Faceless Report Generator 1.1.14, a $1200 payware Java application for converting XML documents to PDF. Unlike most similar tools it appears to be based on HTML and CSS rather than XSL Formatting Objects. This is mostly a bug fix release. Java 1.2 or later is required.

Monday, January 5, 2004

Jez Higgins has released Arabica, open source an XML parser toolkit that provides SAX2 and DOM implementations for Standard C++ that sits on top of an existing parser such as libxml2 or expat. Arabica can work with UTF-8 encoded std::strings, UCS-2 std::wstrings, or custom string types. Arabica is published under a BSD license.

Sunday, January 4, 2004

David Tolpin has released RNV 1.3, an open source Relax NG Compact Syntax validator written in ANSI C. Version 1.3 improves performance and adds a VIM plug-in. RNV is published under a BSD license.


Eric S. Raymond has released version 1.6 of doclifter, an open source tool that transcodes {n,t,g}roff documentation to DocBook. He claims the "result is usable without further hand-hacking about 95% of the time." This release has simpler entity resolution logic, and removes the -s and -x options. Doclifter is written in Python, and requires Python 2.2a1. doclifter is published under the GPL.

Saturday, January 3, 2004

I thought I'd catch up on a slew of DocBook news today. DocBook is an XML application designed for technical documentation and books such as Processing XML with Java. The first item of note is the posting of the second candidate release of DocBook 4.3.


Norm Walsh has posted the first beta of Simplified DocBook 1.1, "a small subset of the DocBook XML DTD." This is based on full DocBook 4.3CR2 and adds support for HTML tables. All other markup should be the same as in simplified DocBook 1.0.


Walsh has also posted version 3.3.0 of DocBook Slides, a system for replacing PowerPoint with XML. This release "contains a lot of patches and bug fixes." It is based on Simplified DocBook 1.1b1.


DocBook: The Definitive Guide has been updated to version 2.0.9 to cover DocBook 4.3CR2. According to Walsh, this is "a 'work in progress'. It purports to document DocBook V4.3 with the EBNF, HTML Forms, MathML, and SVG modules. As it is being actively updated, it may be inconsistent in some areas."


Walsh has also released version 1.6.4.1 of his XSLT stylesheets for DocBook. 1.6.4 "includes many bugfixes (including an experimental fix for correctly generating links when the dbhtml 'dir' PI is used), some performance improvements, and some new features, including a new option for controlling which sections are included in running headers or footers, better control over superscript/subscript properties, and support for the newly add 'code' and 'stepalternatives' markup."


Michael Smith has written a DocBook Menu Mode for Emacs that adds a hierarchical, customizable reference menu that provides "quick access to a variety of DocBook-related documentation directly from within Emacs, and to files in the DocBook XSLT stylesheets distribution." Menu items include

  • DocBook: The Definitive Guide
  • DocBook: The Definitive Guide (HTML Help)
  • DocBook: Element Reference (Alphabetical)
  • DocBook: Element Reference (Logical)
  • DocBook XSL: The Complete Guide
  • DocBook XSL: Parameter Reference - FO
  • DocBook XSL: Parameter Reference - HTML
  • DocBook XSL: Stylesheet Distribution
  • DocBook XSL: Stylesheet Documentation
  • DocBook FAQ
  • DocBook Wiki
  • DocBook Mailing List Search Form
  • Customize DocBook Menu
Friday, January 2, 2004

The W3C Voice Browser Working Group has posted the Candidate Recommendation of the Speech Synthesis Markup Language Version 1.0. According to the abstract, the Speech Synthesis Markup Language "is designed to provide a rich, XML-based markup language for assisting the generation of synthetic speech in Web and other applications. The essential role of the markup language is to provide authors of synthesizable content a standard way to control aspects of speech such as pronunciation, volume, pitch, rate, etc. across different synthesis-capable platforms."


The Voice Browser Working Group has also published the proposed recommendation of the Speech Recognition Grammar Specification Version 1.0. "This document defines syntax for representing grammars for use in speech recognition so that developers can specify the words and patterns of words to be listened for by a speech recognizer. The syntax of the grammar format is presented in two forms, an Augmented BNF Form and an XML Form. The specification makes the two representations mappable to allow automatic transformations between the two forms."

Thursday, January 1, 2004

Happy New Year. This is normally the point where I make a lot of promises about cool new features for this site that don't come to pass. However, for 2004 my plans are focused in on other areas. My goals for this year are three third editions, at least one peer-reviewed paper, and 1.0 releases of XOM and XQuisitor.


The XML Apache Project has posted the third beta of XIndice 1.1, an open source native XML database published under the Apache Software License. Xindice supports XPath for queries and XML:DB XUpdate for XML updates and the XML:DB XML database API for Java as well as an XML-RPC interface. Changes since 1.0 are mostly minor and include bug fixes and Java 1.4 support.



News from 2003 | | News from 2002 | News from 2001 | News from 2000 | News from 1998 | News from 1999
[ XML Books | XML Trade Shows | XML Mailing Lists | XML Quotes ]

Copyright 2004 Elliotte Rusty Harold
elharo@ibiblio.org
Last Modified January 20, 2005