Item 13: Remember Mixed Content

XML was designed for narrative documents meant to be read by humans: books, novels, plays, poems, technical manuals, and most especially web pages. Its use for record-oriented data was a happy accident. Narrative documents have a number of characteristics that are not often true of more record-like data. The most significant is mixed content. For example, consider this simple paragraph taken from the second edition of the XML specification:

<p diff="add">This second edition is <emph>not</emph> a new 
version of XML (first published 10 February 1998); it merely 
incorporates the changes dictated by the first-edition errata 
(available at <loc 
href="http://www.w3.org/XML/xml-19980210-errata">
http://www.w3.org/XML/xml-19980210-errata</loc>) as a 
convenience to readers. The errata list for this second edition 
is available at <loc href=
"http://www.w3.org/XML/xml-V10-2e-errata">
http://www.w3.org/XML/xml-V10-2e-errata</loc>.</p>

It has seven children in the following order:

A text node starting "This second edition is"
The emph element
A text node starting " a new version of XML"
A loc element
A text node starting ") as a convenience"
A loc element
A text node containing a single period.

The text is on the same level of the tree as the child elements. It is a crucial part of the meaning of the paragraph. It cannot be ignored.

Nonetheless, numerous tools and APIs blithely assume mixed content simply doesn't exist. They also often assume that order doesn't matter, and that documents are not recursive. These assumptions are true of relational tables. While these assumptions may also be accurate about XML formats that are little more than database dumps, (for instance, RSS 0.9.x), they are definitely not true of most real-world XML. They aren't even true of as many specific applications as their inventors often think. For example, RSS was originally designed to provide simple record-like news items like this one:

<item>
  <title>Xerlin 1.3 released</title>
  <description>
    Xerlin 1.3, an open source XML Editor written in Java, has been released.
    Users can extend the application via custom editor interfaces for specific
    DTDs. New features in version 1.3 include XML Schema support, WebDAV    
    capabilities, and various user interface enhancements.
    Java 1.2 or later is required.
  </description>
  <link>http://www.cafeconleche.org/#news2003April7</link>
</item>

However, it rapidly became apparent that this wasn't enough to meet the needs of most web sites. In particular, site authors often wanted to put mixed content in the description like so:

  <description>
    <a href="http://www.xerlin.org"><strong>Xerlin 
    1.3</strong></a>, 
    an open source XML Editor written in 
    <a href="http://java.sun.com/">Java</a>, has been released.
    Users can extend the application via custom editor 
    interfaces for specific 
    DTDs. New features in version 1.3 
    include:
    <ul>
       <li>XML Schema support</li>
       <li>WebDAV capabilities</li>
       <li>Various user interface enhancements</li>
    </ul>
    Java 1.2 or later is required.
  </description>

However, since RSS doesn't allow this, authors and vendors instead converged on the truly awful solution of escaping the markup for eventual display as HTML. For example,

<description>
    &lt;a href="http://www.xerlin.org">&lt;strong>Xerlin 
    1.3&lt;/strong>&lt;/a>, an open source XML Editor written in 
    &lt;a href="http://java.sun.com/">Java&lt;/a>, has been 
    released. .
    Users can extend the application via custom editor
    interfaces for specific c
    DTDs. New features in version 1.3 
    include:
       &lt;ul>
       &lt;li>XML Schema support&lt;/li>
       &lt;li>WebDAV capabilities&lt;/li>
       &lt;li>Various user interface enhancements&lt;/li>
    &lt;/ul>
    Java 1.2 or later is required.
  </description>

This ugliness isn't created just so mixed content can be avoided. It also avoids the use of namespaces (Item 20) and modularization (Item 8). But fear of mixed content is certainly a major contributing factor. What's really telling in this example is that the community promptly hacked their own uglier version of mixed content back into RSS, even though the original developers had tried to avoid it. Mixed content is not a mistake. It is not something to be feared. It is at the core of much of the information XML is designed to mark up.

Tools that fail to handle mixed content properly range from simple programs such as XML pretty printers to complete data binding APIs. One particularly perverse API I encountered read mixed content, but reordered it so all the plain text nodes came after all the child elements. Many other tools came into existence without support for mixed content, and had to undergo complicated and expensive retrofitting when the need to support it became obvious.

Another common problem is software that claims to be able to handle mixed content, but was never extensively tested with narrative documents. I've brought more than one XML editor to its knees by loading in a book written in DocBook. Too often programmers introduce bugs into their code based on mistaken notions of what XML documents can look like. For example, a programmer who forgets about mixed content may try to store the children of an element as a list of Element objects, rather than a more generic list of Object or Node objects. True XML software needs to be prepared to handle all the many forms XML can take, including both narrative and record-oriented documents.

The underlying cause of these problems is that the designers started with the question "How do I convert an object into an XML document?" rather than the much tougher question "How do I convert an XML document into an object?" A variant starts with the question "How do I convert a relational table to an XML document?", but the underlying problem is the same. This is a toothpaste problem: It's a lot easier to squirt XML out of an object than to push it back in. Most of these tools claim to be able to read XML documents into Java or C++, but they fail very quickly as soon as you start throwing real-world documents at them. Generally speaking, the developers designing these tools are laboring under numerous faulty assumptions, including the following.:

Documents have W3C XML Schema Language schemas. (The vast majority don't.)
Documents have some kind of schema. (Many, perhaps most, don't.)
Documents that actually have schemas of some kind do in fact adhere to those schemas. (Often untrue.)
You know the sorts of structures you're going to encounter before you see the documents. In other words, the documents are predictable. (Not an unreasonable assumption, but nonetheless it is often untrue in practice.)
Mixed content doesn't exist. (Patently false.)
XML documents are fairly flat. In particular they have nearly tabular structures. (The database mapping folks tend to make this assumption. The object folks are a little less likely to fall into this particular trap.)

The same issues arise when developers try to store XML data in relational tables. XML documents are not tables. You can force them in, in a variety of very ugly ways, but this is simply not the task a relational database is designed for. You'll be happier with a database and API designed for XML from the start that doesn't try to pretend XML is simpler than it really is.

The fact is, XML documents considered in their full generality are extremely complicated. They are not tables. They are not objects. Any reasonable model for them has to take this complexity into account. Their structures very rarely match the much more restrictive domains of tables and objects. You can certainly design mappings from XML to classes, but unless you're working in a very limited domain, it's questionable whether you can invent anything much simpler than JDOM. And if you are working in a restricted domain, all you really need is a standard way of serializing and deserializing instances of particular classes to and from a particular XML format. This can be almost hidden from the client programmer. Be wary of tools that implicitly subset XML, and handle only some kinds of XML documents. Robust, reliable XML processing needs to use tools that are ready to handle all of XML, including mixed content.