Chapter 1. XML for Data

Table of Contents

Motivating XML
A Thought Experiment
Robustness
Extensibility
Ease of Use
XML Syntax
XML Documents
XML Applications
Elements and Tags
Text
Attributes
XML Declaration
Comments
Processing Instructions
Entities
Namespaces
Validity
DTDs
Schemas
Schematron
The Last Mile
Style sheets
CSS
Associating Style Sheets with XML Documents
XSL
Summary

XML was designed to be “SGML for the Web”. It was meant for the same sorts of narrative documents SGML and HTML had been used for previously: articles, books, short stories, poems, technical manuals, web pages, and so forth. Much to its inventors’ surprise, it achieved its first great successes not in the publishing and writing arenas it was intended for, but rather in the much more prosaic world of data formats. XML was enthusiastically adopted by programmers who needed a robust, extensible, standard format for data. For the most part, this was not narrative data like stories and articles, but record oriented data such as that found in databases. Uses included object serialization, financial records, vector graphics, remote procedure calls, and similar tasks. This chapter explores some of the flaws in traditional formats for such data and elucidates the features of XML that make it surprisingly well-suited for such tasks.

Motivating XML

If you’re reading this book you’re a developer. (At least I hope you are. Otherwise a lot of what I say isn’t going to make any sense :-) ) Doubtless over the course of your career you’ve written numerous programs that read and write files. And every time you wrote a new program you had to invent or learn a new file format. File formats I’ve personally had to deal with over the years include RTF, Word .doc files, tab delimited text, FITS, PDF, PostScript, and many more. You’ve probably encountered a few of these yourself. Doubtless, you’ve also seen many other formats.

If you’re like me you’ve learned to dread encountering a new file format. If it’s documented at all, the documentation is likely incomplete or worse yet misleading. Important details like byte order and line ending conventions are often left unspecified. Different tools that all claim to read and write the same format actually produce subtly different variants that are often incompatible in practice. When you think you’ve finally wrestled the last bug out of your code, you discover a file written by somebody else’s software that you can’t read; and you realize you’ve made one too many assumptions about the format, so you have to go back to the drawing board.

Consequently, when designing new file formats, developers have tended to gravitate toward the simplest formats they can imagine, often tab delimited text or comma separated values. Nonetheless, even these plain, undecorated formats often present unexpected problems. For example, should two tabs in a row be interpreted as the empty string, null, or the same as one tab? In fact, all three variations are used in practice. Java’s StringTokenizer class takes the last interpretation, two consecutive tabs are the same as one tab, even though this is the least common approach in actual data files, a fact which has surprised many Java programmers and led to not a few bugs in Java programs.[1]

A Thought Experiment

With all that in mind, let’s do a thought experiment. Imagine you’ve been tasked with writing a server side program that accepts orders over the Internet for an e-commerce site. The web server must send each completed order to the internal system, one order at a time. You’re responsible for writing the code on the server that sends the order to the internal system and for writing the code on the internal system that receives and processes the order. The only connection between the two systems is a TCP/IP network; that is, you don’t have some sort of higher level API like JDBC that lets you move data between the two systems. You need to invent a data format you can generate on one end and parse on the other end that’s flexible enough to contain all the information in a typical order. This includes the customer name, the product ordered, its price, the manufacturer’s stock keeping unit (SKU) number, the address to ship to, the tax, and the shipping and handling charges. One possibility is to place each piece of information on a separate line as shown in Example 1.1:

Example 1.1. A plain text document indicating an order for 12 Birdsong Clocks, SKU 244

c32
Chez Fred
Birdsong Clock
244
12
USD
21.95
135 Airline Highway
Narragansett
RI
02882
USD
263.40
7.0
USD
18.44
USPS
USD
8.95
USD
290.79

An alternative is to use a more complex and verbose XML format such as Example 1.2:

Example 1.2. An XML document indicating an order for 12 Birdsong Clocks, SKU 244

<?xml version="1.0" encoding="ISO-8859-1"?>
<Order>
  <Customer id="c32">Chez Fred</Customer>
  <Product>
    <Name>Birdsong Clock</Name>
    <SKU>244</SKU>
    <Quantity>12</Quantity>
    <Price currency="USD">21.95</Price >
  </Product>
  <ShipTo>
    <Street>135 Airline Highway</Street >
    <City>Narragansett</City> <State>RI</State> <Zip>02882</Zip>
  </ShipTo>
  <Subtotal currency='USD'>263.40</Subtotal>
  <Tax rate="7.0" 
       currency='USD'>18.44</Tax>
  <Shipping  method="USPS" currency='USD'>8.95</Shipping>
  <Total currency='USD' >290.79</Total>
</Order>

Would you rather write the code to send and receive orders that are formatted as nice, simple linefeed delimited files as shown in Example 1.1 or as complex, marked up XML documents such as Example 1.2? Both documents contain the same information. Most uninitiated developers prefer the first, simpler form. After all each piece of information is presented on a line by itself with no extraneous markup characters getting in the way. It’s my goal to convince you that contrary to most developers’ first intuition the second form is more robust, more extensible, and much easier to work with.

Robustness

Let’s consider robustness first. Suppose your program receives the order in Example 1.3:

Example 1.3. A document indicating an order for 12 Birdsong Clocks, SKU 244?

c32
Chez Fred
Birdsong Clock
12
244
USD
21.95
135 Airline Highway
Narragansett
RI
02882
USD
263.40
7.0
USD
18.44
USPS
USD
290.79
USD
8.95

Look’s the same as Example 1.1 doesn’t it? However, if you compare it very carefully with Example 1.3 you may notice that the 12 and the 244 have changed places. What used to be an order for 12 bird clocks may now be an order for 244 whoopee cushions. Maybe somebody will notice the problem before the order is shipped and maybe they won’t. Worse yet, the shipping charge and the total price got flipped around. This entire order now costs eight dollars and ninety-five cents. Again, maybe someone will notice the problem before it’s too late and maybe not. These sorts of problems aren’t theoretical. More than one e-commerce site has lost both revenue and customer goodwill by mispricing items.

In the XML version, this simply would not be an issue because each datum is marked up with what it means. You can freely reorder the quantity and the SKU or the shipping cost and the total price without any confusion about which is which. Example 1.4 demonstrates. What can be devastating mistakes in a traditional system are harmless in XML.

Example 1.4. Still an order for 12 Birdsong Clocks, SKU 244

<?xml version="1.0" encoding="ISO-8859-1"?>
<Order>
  <Customer id="c32">Chez Fred</Customer>
  <Product>
    <Name>Birdsong Clock</Name>
    <Quantity>12</Quantity>
    <SKU>244</SKU>
    <Price currency="USD">21.95</Price >
  </Product>
  <ShipTo>
    <Street>135 Airline Highway</Street >
    <City>Narragansett</City> <State>RI</State> <Zip>02882</Zip>
  </ShipTo>
  <Subtotal currency='USD'>263.40</Subtotal>
  <Tax rate="7.0" 
       currency='USD'>18.44</Tax>
  <Total currency='USD' >290.79</Total>
  <Shipping  method="USPS" currency='USD'>8.95</Shipping>
</Order>

Some readers will be objecting at this point that you would never let a mistake like that through your system. After all you check every value for sensibility. You look up the SKU in the company database to make sure it matches the product name and price before completing an order. You check every return value from a method call to see if it’s null and you catch every exception. You write extensive tests to verify that each method is doing what you think it’s doing. You use a source code control system so you can always back out changes, and you never check code in until it’s passed all the regression tests. Every line of code is scrupulously documented. In fact, you write more documentation than actual code. And you’ve never, ever missed church on Sunday. In this case your name is Donald Knuth. The rest of us need a little more help making sure we don’t do something stupid.

Even if you are that conscientious, are you really willing to gamble on everyone else who sends or receives data from you being equally anal retentive? Wouldn’t it make more sense to use the most robust format possible so that when the inevitable errors do creep in, they’ll do less damage?

Of course, XML has a lot to offer the anal developer as well. When defining constraints such as “Every order must have a shipping address”, “the currency must be one of the three letter codes USD, CAN, or GBP” or “the total cost must be the sum of the unit price times the number of items, the tax, and the shipping”, it’s easiest to use a declarative language that specifies what the constraints are without elaborating the actual code to check these constraints. When your data is XML, you can use a declarative schema language to define and test such constraints. Indeed, you have a choice of several schema languages. The simplest and most broadly supported, the classic document type definition (DTD), allows you to verify that all required elements are present in the required order with any necessary attributes. The W3C XML schema language goes further and lets you constrain the contents of particular elements and attributes so that you can guarantee that the total price is a decimal number greater than 1.00. Schematron, the most powerful schema language of all, allows you to state multi-element constraints such as “the actual price must be less than or equal to the suggested retail price”. I’ll discuss all of these languages in more detail later in this chapter and the rest of the book. For now what you need to know is that you can list all the constraints on a document in a simple fashion and check those constraints without writing a lot of extra code to do so. You feed your documents through a validator before you act on them. Validation becomes a separate, modular and more maintainable part of the process. You can even change constraints or add new ones without recompiling your code.

Extensibility

Robustness isn’t the only advantage of the XML approach. The XML solution is also far more extensible. For example, suppose you suddenly discover a need to add a discount percentage to some products. The change to the XML is straightforward. Just add an extra element:

  <Product>
    <Name>Birdsong Clock</Name>
    <Quantity>12</Quantity>
    <SKU>244</SKU>
    <Price currency="USD">21.95</Price >
    <Discount>.10</Discount> 
  </Product>

The change to the plain text file (or the equivalent binary file) is much less obvious. You can certainly add an extra line of data. However, then everything that follows it will be out of order. You could put the new information at the end of the document, but then it isn’t close to the item it logically belongs with. And suppose not all orders have discounts. Will there be blank lines for products that don’t have discounts? How will your program recognize that it’s supposed to convert an empty string into a zero discount rather than NaN or throwing an exception? This is not an insurmountable problem, but the simple solution is becoming more complex.

Now suppose someone wants to add a gift message field whose value can contain line breaks. Now the data can contain the delimiter character! You can probably escape the line breaks as \n or some such, and then escape the backslash character as \\, but your nice simple solution is becoming quite a bit more complex. However, once again this is not a problem for XML as this solution demonstrates:

  <GiftMessage>
     Happy Birthday Monica!

    Love Always,
    Tracy
  </GiftMessage>

Throughout this example, I’ve assumed that each order is for exactly one product. That’s probably not true. Some customers will order multiple products at a time. Thus each order will contain between one and an indefinite number of products. Different products may even be going to different addresses. Do you break each individual item into a separate order document and repeat the customer information? If so how do you calculate the total shipping and total cost? Or do you allow multiple products in a single order? If so how do you tell where one product ends and the next begins? Again, none of these problems are unsolvable, but the simple solution proves more and more complex as the needs grow. The XML approach, by contrast, scales very well to expanded functionality in a very obvious way. Example 1.5 is an XML document that accomplishes all of the above. The boundaries between the individual parts are obvious.

Example 1.5. An XML document indicating an order for multiple products shipped to multiple addresses

<?xml version="1.0" encoding="ISO-8859-1"?>
<Order>
  <Customer id="c32">Chez Fred</Customer>
  <Product>
    <Name>Birdsong Clock</Name>
    <SKU>244</SKU>
    <Quantity>12</Quantity>
    <Price currency="USD">21.95</Price >
    <ShipTo>
      <Street>135 Airline Highway</Street >
      <City>Narragansett</City> <State>RI</State> <Zip>02882</Zip>
    </ShipTo>
  </Product>
  <Product>
    <Name>Brass Ship's Bell</Name>
    <SKU>258</SKU>
    <Quantity>1</Quantity>
    <Price currency="USD">144.95</Price >
    <Discount>.10</Discount>
    <ShipTo>
      <GiftRecipient>Samuel Johnson</GiftRecipient>
      <Street>271 Old Homestead Way</Street >
      <City>Woonsocket</City> <State>RI</State> <Zip>02895</Zip>
    </ShipTo>
    <GiftMessage>
      Happy Father's Day to a great Dad!
      
      Love,
      Sam and Beatrice
    </GiftMessage>
  </Product>
  <Subtotal currency='USD'>393.85</Subtotal>
  <Tax rate="7.0" 
       currency='USD'>28.20</Tax>
  <Shipping  method="USPS" currency='USD'>8.95</Shipping>
  <Total currency='USD' >431.00</Total>
</Order>

This example still isn’t really complete. Many pieces are missing including the credit card information, billing address, and more. Real world examples are larger and more complex than can comfortably fit in a book. Adding these other parts would only stretch the flat format further and make the advantages of XML still more obvious. The more complex your data is, the more important it is to use a hierarchical format like XML rather than a flat format like tab or line-delimited text.

Ease of Use

Now here’s the real kicker: not only is the XML document far more robust. Not only is it much more extensible in the face of both expected and unexpected changes. Not only does it more easily adapt to more complex structures. It is also easier for your programs to read! Writing a program to accept orders written in XML will be many times easier than writing a program to accept orders delivered in simple line delimited files. “How can that be?” you may be asking. After all, the program reading the XML document has to hunt for less than signs and quotation marks rather than just picking each piece of data off of a line. It has to make sure not to confuse any less than signs and quotation marks that may appear in the data itself with those in the markup. It has to deal with data that may extend across multiple lines. And in fact, there are many more possibilities not evident in this simple example that a real program has to handle.

Fortunately none of this matters to you as a developer because you don’t have to do any of it. Instead of writing the code to process XML documents directly, you let an XML parser do the hard work for you. A parser is a software library that knows how to read XML documents and handle all the markup it finds. The parser takes responsibility for checking documents for well-formedness and validity. Your own code reads the XML document only through the parser’s API. At this level, you can simply ask the parser to tell you what it saw in any particular element. Or you can ask the parser to tell you everything it sees as soon as it sees it. In either case, the parser just gives you the data after resolving all the markup. For instance, if you want to ask the parser what the total price was, it can tell you 290.79 and that this price has the currency USD. You don’t have to concern yourself with stripping off the markup around the information you want. Nor do you necessarily have to take the information in the order it appears in the input document. If you want the total price before the customer name, you can have it. If you just want to look at the price and ignore the rest of the order completely, you can do that too. You take the information in the form that’s convenient to you without worrying excessively about low level serialization details.

Note

One of the original ten goals for XML was that “It shall be easy to write programs which process XML documents.” Originally, this was interpreted as meaning that a “Desperate Perl Hacker” could write an XML parser in a weekend. Later it became clear that XML was simply too complex, even in its simplest form, for this goal to be met. However, the understanding of this requirement changed to mean that a typical programmer could use any of a number of free tools and libraries to process XML. Given this interpretation, the goal has most certainly been met.

The parser shields you from a lot of irrelevant details that you don’t really care about. These include:

  • How text is encoded: in Unicode, ASCII, Latin-1, SJIS, or something else

  • Whether carriage returns, line feeds, or both separate lines

  • How reserved characters such as < are escaped when used in the plain text parts of the document

  • Whether the byte order is big-endian or little-endian

None of these issues actually matter. None of them have any effect on what the data means or what the format allows you to say. However, when designing a data format, you must answer all these questions. As soon as you’ve said, “The underlying format of the data is XML”, every one of these questions is answered. Some are answered by simply choosing one possible solution. (The less than sign is escaped as &lt;.) Others are answered by allowing all possibilities and letting the parser sort things out (line endings). In all cases, the design problem is greatly simplified by picking XML as the underlying format.



[1] This interpretation makes sense once you realize that java.util.StringTokenizer is designed for parsing Java source code, not for reading tab delimited data files. Nonetheless many programmers do use it for reading tab delimited data.


Copyright 2001, 2002 Elliotte Rusty Haroldelharo@metalab.unc.eduLast Modified May 21, 2002
Up To Cafe con Leche