12. Store metadata in attributes

12. Store metadata in attributes

There's a recurring mild flame war on the xml-dev mailing list about when one should use attributes and when one should use elements. There's a slightly hotter one about whether one should ever use attributes at all. The bottom line is that it's really up to you. Do what feels right for your application. Most developers prefer to use attributes for metadata as opposed to the data itself, but this is a very rough rule of thumb at best. Of course, what's data and what's metadata depends heavily on who is reading your documents for what purpose.

One way to determine whether information is metadata or not is ask yourself whether a person reading the text would want to see it. For example, consider the following paragraph from the XML Base specification:

The set of characters allowed in xml:base attributes is the same as for XML, namely [Unicode]. However, some Unicode characters are disallowed from URI references, and thus processors must encode and escape these characters to obtain a valid URI reference from the attribute value.

If I were to mark this up in DocBook, every word above would be part of the element content:

<para id="p32">
  The set of characters allowed in <markup>xml:base</markup> 
  attributes is the same as for XML, namely 
  <ulink url="http://www.w3.org/TR/xmlbase/#Unicode">[Unicode]</ulink>.
  However, some Unicode characters are disallowed from URI references, 
  and thus processors <ulink type="Must, May, etc." 
  url="http://www.w3.org/TR/xmlbase/#dt-must">must</ulink> encode 
  and escape these characters to obtain a valid URI reference from the
  attribute value.
</para>

However, the following parts would be stored in attribute values:

IDs
Styles
URLs of the remote links
Titles of the remote links
Revision dates
Author's name

What unifies all these pieces of data is that the reader doesn't want to see any of them as part of the normal flow of text. They're useful, and they have a purpose. For instance, without the URL in the url attribute, the browser doesn't know where a link goes to. Without the id attribute, other pages can't link to this paragraph. Without revision tracking information, the author can't review and accept or reject changes. However, none of these things matter to the end reader, who just wants to read words in a row. Some of the attribute values may affect how the text is presented to the reader; for example, whether a word is italicized or whether a link is underlined. But in no case does the reader actually want to see the text that makes up the attribute value. After all, this would be much more confusing:

p32 The set of characters allowed in xml:base attributes is the same as for XML, namely [Unicode http://www.w3.org/TR/xmlbase/#Unicode ]. However, some Unicode characters are disallowed from URI references, and thus processors must Must, May, etc. http://www.w3.org/TR/xmlbase/#dt-must encode and escape these characters to obtain a valid URI reference from the attribute value.

In a few cases, the reader may want to see the content of the attribute value in special circumstances. For instance, when the user moves the mouse over a link, the browser may show the link title or URL in a tooltip or the browser status bar. However, this is extra information that still isn't provided as part of the normal flow of text.

The dividing line between data and metadata isn't nearly as clear in record-like documents as it is in narrative documents. Different users may well be interested in different aspects of the content. What is irrelevant data for one user may well be the whole point for another reader. There are often several reasonable ways to divide the data between element content and attribute values. Nonetheless the same basic principle applies: if the information is core, place it in element content. Reserve attributes for housekeeping information such as arbitrary ID numbers.

Although the rough distinction between data and metadata is a useful way to decide whether or not to place some simple text in an attribute, there is one rule that trumps this. Structured data should be part of element content. Attribute values contain undifferentiated text. They have no substructure, at least none that's accessible to the XML parser. An attribute value can contain a number, a URL, a date, a time, or some other atomic value. However, more complicated structures often require division into component parts, and you can only reasonably do this with elements. For example, consider this blockquote element where a src attribute is used to identify the source of the quote:

<blockquote src="Christopher Locke, &quot;Post-Apocalypso,&quot; in 
The Cluetrain Manifesto, (Cambridge: Perseus Books, 1999), p. 175">
  <p>
    There never was any grand plan on the Internet, and there isn't one 
    today. The Net is just the Net. But it <em>has</em> provided an
    extraordinarily efficient means of communication to people so long
    ignored, so long invisible, that they're only now figuring out 
    what to do with it. Funny thing: lawless, planless, 
    management-free, they're figuring out what to do with the Internet
    much faster than government agencies, academic institutions, media
    conglomerates, and Fortune-class corporations.
  </p>
</blockquote>

There are many different units of information in the src attribute: the author, the title of the chapter, the title of the book, the publisher, the page number, and so on. Perhaps these could be divided into separate attributes like this:

<blockquote author_name="Christopher Locke"
            chapter_title="Post-Apocalypso"
            book_title="The Cluetrain Manifesto"
            page_number="175"
            publisher_name="Perseus Books"
            publisher_city="Cambridge"
            year="1999">
...

However, this loses track of the substructure such as the difference between first and last name, or the order. For instance, I carefully wrote the original citation so that it adheres to the rules of the Chicago Manual of Style. Once it's split into separate attributes, that's no longer true.

Even worse, what if the chapter has more than one author? There can only be one attribute with the name author on any given element, but there's no limit to the number of child elements it can have. No, what's really called for here is a child source element, even though the source is obviously metadata about the quotation rather than core information.

<blockquote>
  <p>
    There never was any grand plan on the Internet, and there isn't one 
    today. The Net is just the Net. But it <em>has</em> provided an
    extraordinarily efficient means of communication to people so long
    ignored, so long invisible, that they're only now figuring out 
    what to do with it. Funny thing: lawless, planless, 
    management-free, they're figuring out what to do with the Internet
    much faster than government agencies, academic institutions, media
    conglomerates, and Fortune-class corporations.
  </p>
  
  <source>
     <author>
       <name><given>Christopher<given> <family>Locke</family></name>
     </author>
     <chapter>
       <title>Post-Apocalypso</title> 
      </chapter>
      <book>The Cluetrain Manifesto</book>
      <page>175</page>
      <city>Cambridge</city>
      <publisher>Perseus Books</publisher>
      <year>1999</year>
  </source>
</blockquote>

Once the substructure is expressed with elements, it's straight-forward for a stylesheet to show or hide any parts of the content you do or do not want shown. For example, you might want to include the name of the author, the title of the book, the year, and the page number; but leave out the publisher and chapter. These CSS rules accomplish that:

source, author, book, year, page { display: inline }
publisher, chapter, city { display: none }

You really couldn't do that if you just had one big attribute to work with.

Elements have one final advantage over attributes: they are much more extensible in the face of future changes. For example, many libraries like to give the author's birth year in their card catalogs. With elements this is easy to add:

<author>
  <name><given>Christopher<given> <family>Locke</family></name>
  <born>1950</born>
</author>

With attributes adding additional content is much more cumbersome.

One common use of attributes that I think does clearly meet the characteristic of being metadata that belongs in an attribute rather than data for a child element is the need to identify subtype of a particular element. For instance, in HTML and XHTML, you often see elements annotated with a class attribute, most commonly div and span:

<div class="sect2">...</<div>
<div class="titlepage">...</div>
<div class="informalexample">...</div>
<div class="summary">...</div>
<span class="person">...</span>
<span class="book">...</span>
<img class="equation" src="maxwell.gif" width="120" height="30"/>

Here, the class attribute is extending the normally fairly fixed HTML vocabulary. Identifying elements by class enables the author to apply different styles or processing rules to elements of different classes, even though they have the same name. This, I think, is clearly metadata. It is metadata in the same way that an element name is metadata. In effect these attributes are substituting for invalid element names. Thus they belong in the start-tag, just like an element name. If you find you're using such attributes frequently, then it indicates that your markup vocabulary is not a good fit to your data.

DocBook uses the role attribute in a similar way, to allow authors to attach arbitrary roles to elements that the designers of DocBook did not anticipate. A little more formally, the DocBook systemitem element has a class attribute whose value is given as an enumerated list of specific types of systemitem such as domainname, ipaddress, newsgroup, and username:

<phrase role="formula">H2O</phrase>
<phrase role="prescription">Paxil, 20 mg</phrase>
<personname role="plumber">
  <firstname>Laurence</firstname> 
  <lastname>Bienvenue</lastname>
</personname>
<systemitem class="domainname">www.cluetrain.com</systemitem>
<systemitem class="username">eharold</systemitem>

DocBook treats this very sensibly. If a particular role or class is found to be used frequently in practice, then it's a strong candidate for addition to the next version of DocBook as an element. Indeed several current DocBook elements such as environvar and prompt started life as systemitem classes or mere roles.

In the end, if you have any doubt about whether information is metadata or data, I suggest that you place it in element content. There's little an attribute can do that an element can't, but much that an element can do that an attribute can't. The costs of mis-marking data that should be elements as attributes are much higher than the cots of mis-marking data that should be attributes as elements.