Chapter 24 of the XML Bible, Gold Edition : Schemas

In This Chapter

What's Wrong with DTDs?
What is a Schema?
The W3C XML Schema Language
Hello Schemas
Complex Types
Grouping
Simple Types
Deriving Simple Types
Empty Elements
Attributes
Namespaces
Annotations

What's Wrong with DTDs?

Document Type Definitions (DTDs) are an outgrowth of XML's heritage in the Standardized General Markup Language (SGML). SGML was always intended for narrative-style documents: books, reports, technical manuals, brochures, Web pages, and the like. DTDs were designed to serve the needs of these sorts of documents, and indeed they serve them very well. DTDs let you state very simply and straightforwardly that every book must have one or more authors, that every song has exactly one title, that every PERSON element has an ID attribute, and so forth. Indeed for narrative documents that are intended for human beings to read from start to finish, that are more or less composed of words in a row, there's really no need for anything beyond a DTD. However, XML has gone well beyond the uses envisioned for SGML. XML is being used for object serialization, stock trading, remote procedure calls, vector graphics, and many more things that look nothing like traditional narrative documents; and it is in these new arenas that DTDs are showing some limits.

The limitation most developers notice first is the almost complete lack of data typing, especially for element content. DTDs can't say that a PRICE element must contain a number, much less a number that's greater than zero with two decimal digits of precision and a dollar sign. There's no way to say that a MONTH element must be an integer between 1 and 12. There's no way to indicate that a TITLE must contain between 1 and 255 characters. None of these are particularly important things to do for the narrative documents SGML was aimed at; but they're very common things to want to do with data formats intended for computer-to-computer exchange of information rather than computer-to-human communication. Humans are very good at handling fuzzy systems where expected data is missing, or perhaps in not quite the right format; computers are not. Computers need to know that when they expect an element to contain an integer between 1 and 12, the element really contains an integer in that range and nothing else.

The second problem is that DTDs have an unusual non-XML syntax. You actually need separate parsers and APIs to handle DTDs than you do to handle XML documents themselves. For instance, consider this common element declaration:

<!ELEMENT TITLE (#PCDATA)>

This is not a legal XML element. You can’t begin an element name with an exclamation point. TITLE is not an attribute. Neither is (#PCDATA). This is a very different way of describing information than is used in XML document instances. One would expect that if XML were really powerful enough to live up to all its hype then it would be powerful enough to describe itself. You shouldn’t need two different syntaxes: one for the information and one for the meta-information detailing the structure of the information. XML element and attribute syntax should suffice for both info and meta-info.

The third problem is that DTDs are only marginally extensible and don’t scale very well. It's difficult to combine independent DTDs together in a sensible way. You can do this with parameter entity references. Indeed, SMIL 2.0 and modular XHTML are based on this idea. However, the modularized DTDs are very messy and very hard to follow. The largest DTDs in use today are in the ballpark of 10,000 lines of code, and it's questionable whether much larger XML applications can be defined before the entire DTD becomes completely unmanageable and incomprehensible. By contrast, the largest computer programs in existence today, which are much more intrinsically complex than even the most ambitious DTDs, easily reach sizes of 1,000,000 lines of code and more; sometimes even 10,000,000 lines of code or more.

Perhaps most annoyingly, DTDs are only marginally compatible with namespaces. The first principle of namespaces is that only the URI matters. The prefix does not. The prefix can change as long as the URI remains the same. However, validation of documents that use namespace prefixes works only if the DTD declares the prefixed names. You cannot use namespace URIs in a DTD. You must use the actual prefixes. If you change the prefixes in the document but don’t change the DTD, then the document immediately ceases to be valid. There are some tricks that you can perform with parameter entity references to make DTDs less dependent on the actual prefix, but they're complicated and not well understood in the XML community. And even when they are understood, these tricks simply feel far too much like a dirty hack rather than a clean, maintainable solution.

Finally, there are a number of annoying minor limitations where DTDs don’t allow you to do things that it really feels like you ought to be able to do. For instance, DTDs cannot enforce the order or number of child elements in mixed content. That is, you can't make statements such as each PARAGRAPH element must begin with exactly one SUMMARY element that is followed by plain text. Similarly you can’t enforce the number of child elements without also enforcing their order. For instance, you cannot easily say that a PERSON element must contain a FIRST_NAME child, a MIDDLE_NAME child, and a LAST_NAME child, but that you don’t care what order they appear in. Again, there are workarounds; but they grow combinatorially complex with the number of possible child elements.

Schemas are an attempt to solve all these problems by defining a new XML-based syntax for describing the permissible contents of XML documents that includes:

Powerful data typing including range checking
Namespace-aware validation based on namespace URIs rather than on prefixes
Extensibility and scalability

However, schemas are not a be-all and end-all solution. In particular, schemas do not replace DTDs! You can use both schemas and DTDs in the same document. DTDs can do several things that schemas cannot do, most importantly declaring entities. And of course, DTDs still work very well for the classic sort of narrative documents they were originally designed for. Indeed, for these types of documents, a DTD is often considerably easier to write than an equivalent schema. Parsers and other software will continue to support DTDs for as long as they support XML.

What is a Schema?

The word schema derives from the Greek word σχημα, meaning form or shape. It was first popularized in the Western world by Immanuel Kant in the late 1700s. According to the 1933 edition of the Oxford English Dictionary, Kant used the word schema to mean, "Any one of certain forms or rules of the ‘productive imagination’ through which the understanding is able to apply its ‘categories’ to the manifold of sense-perception in the process of realizing knowledge or experience." (And you thought computer science was full of unintelligible technical jargon!)

Schemas remained the province of philosophers for the next 200 years until, the word schema entered computer science, probably through database theory. Here, schema originally meant any document that described the permissible content of a database. More specifically, a schema was a description of all the tables in a database and the fields in the table. A schema also described what type of data each field could contain: CHAR, INT, CHAR[32], BLOB, DATE, and so on.

The word schema has grown from that source definition to a more generic meaning of any document that describes the permissible contents of other documents, especially if data typing is involved. Thus, you'll hear about different kinds of schemas from different technologies, including vocabulary schemas, RDF schemas, organizational schemas, X.500 schemas and, of course, XML schemas.

You say schemas, I say schemata

Probably no single topic has been more controversial in the schema world than the proper plural form of the word schema. The original Greek plural is σχηματα, schemata in Latin transliteration; and this is the form which Kant used and which you'll find in most dictionaries. This was fine for the 200 years when only people with PhDs in philosophy actually used the word. However, as often happens when words from other languages are adopted into popular English, its plural changed to something that sounds more natural to an Anglophone ear. In this case, the plural form schemata seems to be rapidly dying out in favor of the simpler schemas. In fact, the three World Wide Web Consortium (W3C) schema specifications all use the plural form schemas. I follow this convention in this book.

Since schemas is such a generic term, it shouldn't come as any surprise to you that there's more than one schema language for XML. In fact there are many, each with its own unique advantages and disadvantages. These include Murata Makoto's Relax (http://www.xml.gr.jp/relax/), Rick Jelliffe's Schematron (http://www.ascc.net/xml/resource/schematron/schematron.html), James Clark's TREX - Tree Regular Expressions for XML (http://www.thaiopensource.com/trex/), the Document Definition Markup Language (DDML, also known as XSchema, http://purl.oclc.org/NET/ddml), and the W3C's misleadingly, generically titled XML Schema language. In addition, traditional XML DTDs can be considered to be yet another schema language.

There are also a number of dead XML schema languages that have been abandoned by their manufacturers in favor of other languages. These include Document Content Description (DCD), Commerce One's Schema for Object-Oriented XML (SOX), and Microsoft's XML-Data Reduced (XDR). None of these are worth your time or investment at this point. They never achieved broad adoption, and their vendors are now moving to the W3C XML Schema language instead.

This chapter focuses almost exclusively on the W3C XML Schema language. Nonetheless, TREX, Relax, and Schematron are definitely worthy of your attention as well. In particular, if you find W3C schemas to be excessively complex (and many people do so find them) and if you want a simpler schema language that still offers a complete set of extensible data types, you should consider Relax. Relax adopts the less controversial data types half of the W3C XML Schema recommendation, but replaces the much more complex and much less popular structures half with a much simpler language. Relax also has the advantage of being an official JIS and ISO standard.

Most schema languages, including W3C schemas, Relax, TREX, DDML, and DTDs, take the approach that you must carefully specify what is allowed in the document. They are conservative: Everything not permitted is forbidden. If, on the other hand, you're looking for a less-restrictive schema language in which everything not forbidden is permitted, you should consider Schematron. Schematron is based on XPath, which allows it to make statements none of the other major schema languages can, such as "An a element cannot have another a element as a descendant, even though an a element can contain a strong element which can contain an a element if it itself is not a descendant of an a element." This isn’t a theoretical example. This is a real restriction in XHTML that has to be made in the prose of the specification because neither DTDs nor schemas are powerful enough to say it. What it means is that links can’t nest; that is, a link cannot contain another link.

From this point forward, I will use the unqualified word schema to refer to the W3C's XML schema language; but please keep in mind that alternatives that are equally deserving of the appellation do exist.

The W3C XML Schema Language

The W3C XML Schema language was created by the W3C XML Schema Working Group based on many different submissions from a variety of companies and individuals. It is a very large specification designed to handle a broad range of use cases. In fact, the schema specification is considerably larger and more complex than the XML 1.0 specification. It is an open standard, free to be implemented by any interested party. There are no known patent, trademark, or other intellectual property restrictions that would prevent you from doing anything you might reasonably want to do with schemas. (which unfortunately is not quite the same thing as saying that there are no known patent, trademark, or other intellectual property restrictions that would prevent you from doing anything you might reasonably want to do. The U.S. Patent Office has been a little out of control lately, granting patents left and right for inventions that really don’t deserve it, including a lot of software and business processes. I would not be surprised to learn of an as yet unnoticed patent that at least claims to cover some or all of the W3C XML Schema language.)

Caution

This chapter is based on the May 2, 2001 Recommendation of XML Schemas. At the time of this writing, (June 2001) no software yet implements all of the final Recommendation. In fact, only one parser, Xerces-J, currently supports most of the W3C XML Schema language. Eventually, of course, this should be less of an issue as the standard evolves toward its final incarnation and more vendors implement the full schema language described here. In the meantime, if you do encounter something that doesn’t seem to work quite right, please report the problem to your parser vendor, not to me.

Hello Schemas

Let's begin our exploration of schemas with the ubiquitous Hello World example. Recall, once again, Listing 3-2 (greeting.xml) from Chapter 3. It is shown below:

Listing 3-2: greeting.xml

<?xml version="1.0"?>
<GREETING>
Hello XML!
</GREETING>

This XML document contains a single element, GREETING. (Remember that <?xml version="1.0"?> is the XML declaration, not an element.) This element contains parsed character data. A schema for this document has to declare the GREETING element. It may declare other elements too, including ones that aren’t present in this particular document, but it must at least declare the GREETING element.

The greeting schema

Listing 24-1 is a very simple schema for GREETING elements. By convention it would be stored in a file with the three-letter extension .xsd, greeting.xsd for example, but that's not required. It is an XML document so it has an XML declaration. It can be written and saved in any text editor that knows how to save Unicode files. As always, you can use a different character set if you declare it in an encoding declaration. Schema documents are XML documents and have all the privileges and responsibilities of other XML documents. They can even have DTDs, DOCTYPE declarations, and style sheets if that seems useful to you, although in practice most do not.

Listing 24-1: greeting.xsd

<?xml version="1.0"?>
<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema">
  <xsd:element name="GREETING" type="xsd:string"/>
</xsd:schema>

The root element of this and all other schemas is schema. This must be in the http://www.w3.org/2001/XMLSchema namespace. Normally, this namespace is bound to the prefix xsd or xs, although this can change as long as the URI stays the same. The other common approach is to make this URI the default namespace, although that generally requires a few extra attributes to help separate out the names from the XML application the schema describes from the names of the schema elements themselves. You'll see this when namespaces are discussed at the end of this chapter.

Elements are declared using xsd:element elements. Listing 24-1 includes a single such element declaring the GREETING element. The name attribute specifies which element is being declared, GREETING in this example. This xsd:element element also has a type attribute whose value is the data type of the element. In this case the type is xsd:string, a standard type for elements that can contain any amount of text in any form but not child elements. It's equivalent to a DTD content model of #PCDATA. That is, this xsd:element says that a valid GREETING element must look like this:

<GREETING>
  various random text but no markup
</GREETING>

There's no restriction on what text the element can contain. It can be zero or more Unicode characters with any meaning. Thus a GREETING element can also look like this:

<GREETING>Hello!</GREETING>

Or even this:

<GREETING></GREETING>

However, a valid GREETING element may not look like this:

<GREETING>
  <SOME_TAG>various random text</SOME_TAG>
  <SOME_EMPTY_TAG/>
</GREETING>

Nor may it look like this:

<GREETING>
  <GREETING>various random text</GREETING>
</GREETING>

Each GREETING element must consist of nothing more and nothing less than parsed character data between an opening <GREETING> tag and a closing </GREETING> tag.

Validating the document against the schema

Before a document can be validated against a DTD, the document itself must contain a document type declaration pointing to the DTD it should be validated against. You cannot easily receive a document from a third party and validate it against your own DTD. You have to validate it against the DTD that the document's author specified. This is excessively limiting.

For example, imagine you're running an e-commerce business that accepts orders for products using SOAP or XML-RPC. Each order comes to you over the Internet as an XML document. Before accepting that order the first thing you want to do is check that it's valid against a DTD you've defined to make sure that it contains all the necessary information. However, if DTDs are all you have to validate with, then there's nothing to prevent a hacker sending you a document whose DOCTYPE declaration points to a different DTD. Then your system may report that the document is valid according to the hacked DTD, even though it would be invalid when compared to the correct DTD. If your system accepts the invalid document, it could introduce corrupt data that crashes the system or lets the hacker order goods they haven’t paid for, all because the person authoring the document got to choose which DTD to validate against rather than the person validating the document.

Schemas are more flexible. The schema specification specifically allows for a variety of different means for associating documents with schemas. For instance, one possibility is that both the name of the document to validate and the name of the schema to validate it against could be passed to the validator program on the command line like this:

C:\>validator greeting.xml greeting.xsd

Parsers could also let you choose the schema by setting a SAX property or an environment variable. Many other schemes are possible. The schema specification does not mandate any one way of doing this. However, it does define one particular way to associate a document with a schema. As with DOCTYPE declarations and DTDs, this requires modifying the instance document to point to the schema. The difference is that with schemas, unlike with DTDs, this is not the only way to do it. Parser vendors are free to develop other mechanisms if they want to.

To attach a schema to a document, add an xsi:noNamespaceSchemaLocation attribute to the document's root element. (You can also add it to the first element in the document that the schema applies to, but most of the time adding it to the root element is simplest.) The xsi prefix is mapped to the http://www.w3.org/2001/XMLSchema-instance URI. As always, the prefix can change as long as the URI stays the same. Listing 24-2 demonstrates.

Listing 24-2: valid_greeting.xml

<?xml version="1.0"?>
<GREETING xsi:noNamespaceSchemaLocation="greeting.xsd"
  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
Hello XML!
</GREETING>

You can now run the document through any parser that supports schema validation. One such parser is Xerces Java 1.4.0 from the XML Apache Project. In fact, you can use the same SAXCount program you learned about in Chapter 8 to validate against schemas as well as DTDs. When you set the -v flag, SAXCount validates the documents it parses against a DTD if it sees a DOCTYPE declaration and against a schema if it finds an xsi:noNamespaceSchemaLocation attribute. Assuming SAXCount finds no errors, it simply returns the amount of time that was required to parse the document:

C:\XML>java sax.SAXCount -v valid_greeting.xml
valid_greeting.xml: 701 ms (1 elems, 1 attrs, 0 spaces, 12
chars)

Note

This chapter uses Xerces Java 1.4.0, which provides partial support for the May 2, 2001 Recommendation of XML Schema. At the time of this writing Xerces C++ has no schema support at all. Furthermore, earlier versions of Xerces Java support earlier drafts of the W3C XML Schema language that use different namespace URIs. In particular, they support the http://www.w3.org/2000/10/XMLSchema-and http://www.w3.org/1999/XMLSchema namespaces. You can download the latest version of Xerces from http://xml.apache.org/xerces-j/.

Now let's suppose you have a document that's not valid, such as Listing 24-3. This document uses a P element that hasn't been declared in the schema.

Listing 24-3: invalid_greeting.xml

<?xml version="1.0"?>
<GREETING
  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:noNamespaceSchemaLocation="greeting.xsd">
  <P>Hello XML!</P>
</GREETING>

Running it through sax.SAXCount, you now get this output showing you what the problems are:

C:\XML>java sax.SAXCount -v invalid_greeting.xml
[Error] invalid_greeting.xml:5:6: Element type "P" must be
declared.
[Error] invalid_greeting.xml:6:13: Datatype error: In element
'GREETING' : Can not have element children within a simple type
content.
invalid_greeting.xml: 1292 ms (2 elems, 2 attrs, 0 spaces, 14
chars)

The validator found two problems. The first is that the P element is used but is not, itself, declared. The second is that the GREETING element is declared to have type xsd:string, one of several "simple" types that cannot have any child elements. However, in this case, the GREETING element does contain a child element: the P element.

Complex Types

The W3C XML Schema language divides elements into complex and simple types. A simple type element is one like GREETING that can only contain text and does not have any attributes. It cannot contain any child elements. It may, however, be more limited in the kind of text it can contain. For instance, a schema can say that a simple element contains an integer, a date, or a decimal value between 3.76 and 98.24. Complex elements can have attributes and can have child elements.

Most documents need a mix of both complex and simple elements. For example, consider Listing 24-4. This document describes the song Yes I Am by Melissa Etheridge. The root element is SONG. This element has a number of child elements giving the title of the song, the composer, the producer, the publisher, the duration of the song, the year it was released, the price, and the artist who sang it. Except for SONG itself, these are all simple elements that can have type xsd:string. You might see documents like this used in CD databases, MP3 players, Gnutella clients, or anything else that needs to store information about songs.

Listing 24-4: yesiam.xml

<?xml version="1.0"?>
<SONG xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
      xsi:noNamespaceSchemaLocation="song.xsd">
  <TITLE>Yes I Am</TITLE>
  <COMPOSER>Melissa Etheridge</COMPOSER>
  <PRODUCER>Hugh Padgham</PRODUCER>
  <PUBLISHER>Island Records</PUBLISHER>
  <LENGTH>4:24</LENGTH>
  <YEAR>1993</YEAR>
  <ARTIST>Melissa Etheridge</ARTIST>
  <PRICE>$1.25</PRICE>
</SONG>

Now you need a schema that describes this and all other reasonable song documents. Listing 24-5 is the first attempt at such a schema.

Listing 24-5: song.xsd

<?xml version="1.0"?>
<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema">
  <xsd:element name="SONG" type="SongType"/>
  <xsd:complexType name="SongType">
    <xsd:sequence>
      <xsd:element name="TITLE"     type="xsd:string"/>
      <xsd:element name="COMPOSER"  type="xsd:string"/>
      <xsd:element name="PRODUCER"  type="xsd:string"/>
      <xsd:element name="PUBLISHER" type="xsd:string"/>
      <xsd:element name="LENGTH"    type="xsd:string"/>
      <xsd:element name="YEAR"      type="xsd:string"/>
      <xsd:element name="ARTIST"    type="xsd:string"/>
      <xsd:element name="PRICE"     type="xsd:string"/>
    </xsd:sequence>
  </xsd:complexType>
</xsd:schema>

The root element of this schema is once again xsd:schema, and once again the prefix xsd is mapped to the namespace URI http://www.w3.org/2001/XMLSchema. This will be the case for all schemas in this chapter, and indeed all schemas that you write. I won’t note it again.

This schema declares a single top-level element. That is, there is exactly one element declared in an xsd:element declaration that is an immediate child of the root xsd:schema element. This is the SONG element. Only top-level elements can be the root elements of documents described by this schema, though in general they do not have to be the root element.

The SONG element is declared to have type SongType. The W3C Schema Working Group wasn't prescient. They built a lot of common types into the language, but they didn’t know that I was going to need a song type, and they didn’t provide one. Indeed, they could not reasonably have been expected to predict and provide for the numerous types that schema designers around the world were ever going to need. Instead, they provided facilities to allow users to define their own types. SongType is one such user-defined type. In fact, you can tell it's not a built-in type because it doesn’t begin with the prefix xsd. All built-in types are in the http://www.w3.org/2001/XMLSchema namespace.

The xsd:complexType element defines a new type. The name attribute of this element names the type being defined. Here that name is SongType, which matches the type previously assigned to the SONG element. Forward references (for example, xsd:element using the SongType type before it's been defined) are perfectly acceptable in schemas. Circular references are okay, too. Type A can depend on type B which depends on type A. Schema processors sort all this out without any difficulty.

The contents of the xsd:complexType element specify what content a SongType element must contain. In this example, the schema says that every SongType element contains a sequence of eight child elements: TITLE, COMPOSER, PRODUCER, PUBLISHER, LENGTH, YEAR, PRICE, and ARTIST. Each of these is declared to have the built-in type xsd:string. Each SongType element must contain exactly one of each of these in exactly that order. The only other content it may contain is insignificant white space between the tags.

minOccurs and maxOccurs

You can validate Listing 24-4, yesiam.xml, against the song schema, and it does, indeed, prove valid. Are you done? Is song.xsd now an adequate description of legal song documents? Suppose you instead wanted to validate Listing 24-6, a song document that describes Hot Cop by the Village People. Is it valid according to the schema in Listing 24-5?

Listing 24-6: hotcop.xml

<?xml version="1.0"?>
<SONG xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
      xsi:noNamespaceSchemaLocation="song.xsd">
  <TITLE>Hot Cop</TITLE>
  <COMPOSER>Jacques Morali</COMPOSER>
  <COMPOSER>Henri Belolo</COMPOSER>
  <COMPOSER>Victor Willis</COMPOSER>
  <PRODUCER>Jacques Morali</PRODUCER>
  <PUBLISHER>PolyGram Records</PUBLISHER>
  <LENGTH>6:20</LENGTH>
  <YEAR>1978</YEAR>
  <ARTIST>Village People</ARTIST>
</SONG>

The answer is no, it is not. The reason is that this song was a collaboration between three different composers and the existing schema only allows a single composer. Furthermore, the price is missing. If you looked at other songs, you'd find similar problems with the other child elements. Under Pressure has two artists, David Bowie and Queen. We Are the World has dozens of artists. Many songs have multiple producers. A garage band without a publisher might record a song and post it on Napster in the hope of finding one.

The song schema needs to be adjusted to allow for varying numbers of particular elements. This is done by attaching minOccurs and maxOccurs attributes to each xsd:element element. These attributes specify the minimum and maximum number of instances of the element that may appear at that point in the document. The value of each attribute is an integer greater than or equal to zero. The maxOccurs attribute may also have the value unbounded to indicate that an unlimited number of the particular element may appear. Listing 24-7 demonstrates.

Listing 24-7: minOccurs and maxOccurs

<?xml version="1.0"?>
<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema">
  <xsd:element name="SONG" type="SongType"/>
  <xsd:complexType name="SongType">
    <xsd:sequence>
      <xsd:element name="TITLE"     type="xsd:string"
                   minOccurs="1"    maxOccurs="1"/>
      <xsd:element name="COMPOSER"  type="xsd:string"
                   minOccurs="1"    maxOccurs="unbounded"/>
      <xsd:element name="PRODUCER"  type="xsd:string"
                   minOccurs="0"    maxOccurs="unbounded"/>
      <xsd:element name="PUBLISHER" type="xsd:string"
                   minOccurs="0"    maxOccurs="1"/>
      <xsd:element name="LENGTH"    type="xsd:string"
                   minOccurs="1"    maxOccurs="1"/>
      <xsd:element name="YEAR"      type="xsd:string"
                   minOccurs="1"    maxOccurs="1"/>
      <xsd:element name="ARTIST"    type="xsd:string"
                   minOccurs="1"    maxOccurs="unbounded"/>
      <xsd:element name="PRICE"     type="xsd:string"
                   minOccurs="0"    maxOccurs="1"/>
    </xsd:sequence>
  </xsd:complexType>
</xsd:schema>

This schema says that every SongType element must have, in order,

Exactly one TITLE (minOccurs="1" maxOccurs="1")
At least one, and possibly a great many, COMPOSERs (minOccurs="1" maxOccurs="unbounded")
Any number of PRODUCERs, although possibly no producer at all (minOccurs="0" maxOccurs="unbounded")
Either one PUBLISHER or no PUBLISHER at all (minOccurs="0" maxOccurs="1")
Exactly one LENGTH (minOccurs="1" maxOccurs="1")
Exactly one YEAR (minOccurs="1" maxOccurs="1")
At least one ARTIST, possibly more (minOccurs="1" maxOccurs="unbounded")
An optional PRICE, (minOccurs="0" maxOccurs="1")

This is much more flexible and easier to use than the limited ?, *, and + that are available in DTDs. It is very straightforward to say, for example, that you want between 4 and 7 of a given element. Just set minOccurs to 4 and maxOccurs to 7.

If minOccurs and maxOccurs are not present, then the default value of each is 1. Taking advantage of this, the song schema can be written a little more compactly as shown in Listing 24-8.

Listing 24-8: Taking advantage of the default values of minOccurs and maxOccurs

<?xml version="1.0"?>
<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema">
  <xsd:element name="SONG" type="SongType"/>
  <xsd:complexType name="SongType">
    <xsd:sequence>
      <xsd:element name="TITLE"     type="xsd:string"/>
      <xsd:element name="COMPOSER"  type="xsd:string"
                   maxOccurs="unbounded"/>
      <xsd:element name="PRODUCER"  type="xsd:string"
                   minOccurs="0"    maxOccurs="unbounded"/>
      <xsd:element name="PUBLISHER" type="xsd:string"
                   minOccurs="0"/>
      <xsd:element name="LENGTH"    type="xsd:string"/>
      <xsd:element name="YEAR"      type="xsd:string"/>
      <xsd:element name="ARTIST"    type="xsd:string"
                   maxOccurs="unbounded"/>
      <xsd:element name="PRICE"     type="xsd:string"
                   minOccurs="0"/>
    </xsd:sequence>
  </xsd:complexType>
</xsd:schema>

Element content

The examples so far have all been relatively flat. That is, a SONG element contained other elements; but those elements only contained parsed character data, not child elements of their own. Suppose, however, that some child elements do contain other elements, as in Listing 24-9. Here the COMPOSER and PRODUCER elements each contain NAME elements.

Listing 24-9: A deeper hierarchy

<?xml version="1.0"?>
<SONG xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
      xsi:noNamespaceSchemaLocation="24-10.xsd">
  <TITLE>Hot Cop</TITLE>
  <COMPOSER>
    <NAME>Jacques Morali</NAME>
  </COMPOSER>
  <COMPOSER>
    <NAME>Henri Belolo</NAME>
  </COMPOSER>
  <COMPOSER>
    <NAME>Victor Willis</NAME>
  </COMPOSER>
  <PRODUCER>
    <NAME>Jacques Morali</NAME>
  </PRODUCER>
  <PUBLISHER>PolyGram Records</PUBLISHER>
  <LENGTH>6:20</LENGTH>
  <YEAR>1978</YEAR>
  <ARTIST>Village People</ARTIST>
</SONG>

Because the COMPOSER and PRODUCER elements now have complex content, you can no longer use one of the built-in types such as xsd:string to declare them. Instead you have to define a new ComposerType and ProducerType using top-level xsd:complexType elements. Listing 24-10 demonstrates.

Listing 24-10: Defining separate ComposerType and ProducerType types

<?xml version="1.0"?>
<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema">
  <xsd:element name="SONG" type="SongType"/>
  <xsd:complexType name="ComposerType">
    <xsd:sequence>
      <xsd:element name="NAME" type="xsd:string"/>
    </xsd:sequence>
  </xsd:complexType>
  <xsd:complexType name="ProducerType">
    <xsd:sequence>
      <xsd:element name="NAME" type="xsd:string"/>
    </xsd:sequence>
  </xsd:complexType>
  <xsd:complexType name="SongType">
    <xsd:sequence>
      <xsd:element name="TITLE"     type="xsd:string"/>
      <xsd:element name="COMPOSER"  type="ComposerType"
                   maxOccurs="unbounded"/>
      <xsd:element name="PRODUCER"  type="ProducerType"
                   minOccurs="0" maxOccurs="unbounded"/>
      <xsd:element name="PUBLISHER" type="xsd:string"
                   minOccurs="0"/>
      <xsd:element name="LENGTH" type="xsd:string"/>
      <xsd:element name="YEAR"   type="xsd:string"/>
      <xsd:element name="ARTIST" type="xsd:string"
                   maxOccurs="unbounded"/>
      <xsd:element name="PRICE" type="xsd:string"
                   minOccurs="0"/>
    </xsd:sequence>
  </xsd:complexType>
</xsd:schema>

Sharing content models

You may have noticed that PRODUCER and COMPOSER are very similar. Each contains a single NAME child element and nothing else. In a DTD you'd take advantage of this shared content model via a parameter entity reference. In a schema, it's much easier. Simply given them the same type. While you could declare that the PRODUCER has ComposerType or vice versa, it's better to declare that both have a more generic PersonType. Listing 24-11 demonstrates.

Listing 24-11: Using a single PersonType for both COMPOSER and PRODUCER

<?xml version="1.0"?>
<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema">
  <xsd:element name="SONG" type="SongType"/>
  <xsd:complexType name="PersonType">
    <xsd:sequence>
      <xsd:element name="NAME" type="xsd:string"/>
    </xsd:sequence>
  </xsd:complexType>
  <xsd:complexType name="SongType">
    <xsd:sequence>
      <xsd:element name="TITLE"     type="xsd:string"/>
      <xsd:element name="COMPOSER"  type="PersonType"
                   maxOccurs="unbounded"/>
      <xsd:element name="PRODUCER"  type="PersonType"
                   minOccurs="0" maxOccurs="unbounded"/>
      <xsd:element name="PUBLISHER" type="xsd:string"
                   minOccurs="0"/>
      <xsd:element name="LENGTH" type="xsd:string"/>
      <xsd:element name="YEAR"   type="xsd:string"/>
      <xsd:element name="ARTIST" type="xsd:string"
                   maxOccurs="unbounded"/>
      <xsd:element name="PRICE" type="xsd:string"
                   minOccurs="0"/>
    </xsd:sequence>
  </xsd:complexType>
</xsd:schema>

Anonymous types

Suppose you wanted to divide the NAME elements into separate GIVEN and FAMILY elements like this:

<NAME>
  <GIVEN>Victor</GIVEN>
  <FAMILY>Willis</FAMILY>
</NAME>
<NAME>
  <GIVEN>Jacques</GIVEN>
  <FAMILY>Morali</FAMILY>
</NAME>

To declare this, you could use an xsd:complexType element to define a new NameType element like this:

  <xsd:complexType name="NameType">
    <xsd:sequence>
      <xsd:element name="GIVEN"  type="xsd:string"/>
      <xsd:element name="FAMILY" type="xsd:string"/>
    </xsd:sequence>
  </xsd:complexType>

Then the PersonType would be defined like this:

  <xsd:complexType name="PersonType">
    <xsd:sequence>
      <xsd:element name="NAME" type="NameType"/>
    </xsd:sequence>
  </xsd:complexType>

However, the NAME element is only used inside PersonType elements. Perhaps it shouldn't be a top-level definition. For instance, you may not want to allow NAME elements to be used as root elements, or to be children of things that aren’t PersonType elements. You can prevent this by defining a name with an anonymous type. To do this, instead of assigning the NAME element a type with a type attribute on the corresponding xsd:element element, you give it an xsd:complexType child element to define its type. Listing 24-12 demonstrates.

Listing 24-12: Anonymous types

<?xml version="1.0"?>
<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema">
  <xsd:element name="SONG" type="SongType"/>
  <xsd:complexType name="PersonType">
    <xsd:sequence>
      <xsd:element name="NAME">
        <xsd:complexType>
          <xsd:sequence>
            <xsd:element name="GIVEN"  type="xsd:string"/>
            <xsd:element name="FAMILY" type="xsd:string"/>
          </xsd:sequence>
        </xsd:complexType>
      </xsd:element>
    </xsd:sequence>
  </xsd:complexType>
  <xsd:complexType name="SongType">
    <xsd:sequence>
      <xsd:element name="TITLE"     type="xsd:string"/>
      <xsd:element name="COMPOSER"  type="PersonType"
                   maxOccurs="unbounded"/>
      <xsd:element name="PRODUCER"  type="PersonType"
                   minOccurs="0" maxOccurs="unbounded"/>
      <xsd:element name="PUBLISHER" type="xsd:string"
                   minOccurs="0"/>
      <xsd:element name="LENGTH" type="xsd:string"/>
      <xsd:element name="YEAR"   type="xsd:string"/>
      <xsd:element name="ARTIST" type="xsd:string"
                   maxOccurs="unbounded"/>
      <xsd:element name="PRICE" type="xsd:string"
                   minOccurs="0"/>
    </xsd:sequence>
  </xsd:complexType>
</xsd:schema>

Defining the element types inside the xsd:element elements that are themselves children of xsd:complexType elements is a very powerful technique. Among other things, it enables you to give elements with the same name different types when used in different elements. For example, you can say that the NAME of a PERSON contains GIVEN and FAMILY child elements while the NAME of a MOVIE contains an xsd:string and the NAME of a VARIABLE contains a string containing only alphanumeric characters from the ASCII character set.

Mixed content

Schemas offer much greater control over mixed content than DTDs do. In particular, schemas let you enforce the order and number of elements appearing in mixed content. For example, suppose you wanted to allow extra text to be mixed in with the names to provide middle initials, titles, and the like as shown in Listing 24-13.

Caution

The format used here is purely for illustrative purposes. In practice, I'd recommend that you make the middle names and titles separate elements as well.

Listing 24-13: Mixed content

<?xml version="1.0"?>
<SONG xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
      xsi:noNamespaceSchemaLocation="24-14.xsd">
  <TITLE>Hot Cop</TITLE>
  <COMPOSER>
    <NAME>
      Mr. <GIVEN>Jacques</GIVEN> <FAMILY>Morali</FAMILY> Esq.
    </NAME>
  </COMPOSER>
  <COMPOSER>
    <NAME>
      Mr. <GIVEN>Henri</GIVEN> L. <FAMILY>Belolo</FAMILY>, M.D.
    </NAME>
  </COMPOSER>
  <COMPOSER>
    <NAME>
      Mr. <GIVEN>Victor</GIVEN> C. <FAMILY>Willis</FAMILY>
    </NAME>
  </COMPOSER>
  <PRODUCER>
    <NAME>
      Mr. <GIVEN>Jacques</GIVEN> S. <FAMILY>Morali</FAMILY>
    </NAME>
  </PRODUCER>
  <PUBLISHER>PolyGram Records</PUBLISHER>
  <LENGTH>6:20</LENGTH>
  <YEAR>1978</YEAR>
  <ARTIST>Village People</ARTIST>
</SONG>

It's very easy to declare that an element has mixed content in schemas. First, set up the xsd:complexType exactly as you would if the element only contained child elements. Then add a mixed attribute to it with the value true. Listing 24-14 demonstrates. It is almost identical to Listing 24-12 except for the addition of the mixed="true" attribute.

Listing 24-14: Declaring mixed content in a schema

<?xml version="1.0"?>
<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema">
  <xsd:element name="SONG" type="SongType"/>
  <xsd:complexType name="PersonType">
    <xsd:sequence>
         <xsd:element name="NAME">
           <xsd:complexType mixed="true">
    <xsd:sequence>
            <xsd:element name="GIVEN"  type="xsd:string"/>
             <xsd:element name="FAMILY" type="xsd:string"/>
    </xsd:sequence>
       </xsd:complexType>
      </xsd:element>
  </xsd:sequence>
</xsd:complexType>
  <xsd:complexType name="SongType">
    <xsd:sequence>
   <xsd:element name="TITLE"     type="xsd:string"/>
      <xsd:element name="COMPOSER"  type="PersonType"
                   maxOccurs="unbounded"/>
      <xsd:element name="PRODUCER"  type="PersonType"
                   minOccurs="0" maxOccurs="unbounded"/>
      <xsd:element name="PUBLISHER" type="xsd:string"
                   minOccurs="0"/>
      <xsd:element name="LENGTH" type="xsd:string"/>
      <xsd:element name="YEAR"   type="xsd:string"/>
      <xsd:element name="ARTIST" type="xsd:string"/>
                   maxOccurs
                  ="unbounded"/>
      <xsd:element name="PRICE" type="xsd:string"
                   minOccurs="0"/>
    </xsd:sequence>
  </xsd:complexType>
</xsd:schema>

Grouping

So far, all the schemas you've seen have held that order mattered; for example, that it would be wrong to put the COMPOSER before the TITLE or the PRODUCER after the ARTIST. Given these schemas, the document shown below in Listing 24-15 is clearly invalid. But should it be? Element order often does matter in narrative documents such as books and Web pages. However, it's not nearly as important in data-centric documents like the examples in this chapter. Do you really care whether the TITLE comes first or not, as long as there is a TITLE? After all, if the document's going to be shown to a human being, it will probably first be transformed with an XSLT style sheet that can easily place the contents in any order it likes.

Listing 24-15: A song document that places the elements in a different order

<?xml version="1.0"?>
<SONG xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
      xsi:noNamespaceSchemaLocation="song.xsd">
  <ARTIST>Village People</ARTIST>
  <TITLE>Hot Cop</TITLE>
  <COMPOSER>
    <NAME><GIVEN>Jacques</GIVEN> <FAMILY>Morali</FAMILY></NAME>
  </COMPOSER>
  <PUBLISHER>PolyGram Records</PUBLISHER>
  <COMPOSER>
    <NAME><FAMILY>Belolo</FAMILY> <GIVEN>Henri</GIVEN></NAME>
  </COMPOSER>
  <YEAR>1978</YEAR>
  <COMPOSER>
    <NAME><FAMILY>Willis</FAMILY> <GIVEN>Victor</GIVEN></NAME>
  </COMPOSER>
  <PRODUCER>
    <NAME><GIVEN>Jacques</GIVEN> <FAMILY>Morali</FAMILY></NAME>
  </PRODUCER>
  <PRICE>$1.25</PRICE>
</SONG>

The W3C XML Schema language provides three grouping constructs that specify whether and how ordering of individual elements is important. These are:

The xsd:all group requires that each element in the group must occur at most once, but that order is not important.
The xsd:choice group specifies that any one element from the group should appear. It can also be used to say that between N and M elements from the group should appear in any order.
The xsd:sequence group requires that each element in the group appear exactly once, in the specified order.

Unfortunately, these constructs are not everything you might desire. In particular, you can’t specify constraints such as those that would be required to really handle Listing 24-14. In particular, you can’t specify that you want a SONG to have exactly one TITLE, one or more COMPOSERs, zero or more PRODUCERs, one or more ARTISTs, but that you don’t care in what order the individual elements occur.

The xsd:all Group

You can specify that you want each NAME element to have exactly one GIVEN child and one FAMILY child, but that you don’t care what order they appear in. The xsd:all group accomplishes this. For example,

<xsd:complexType name="PersonType">
  <xsd:sequence>
    <xsd:element name="NAME">
      <xsd:complexType>
        <xsd:all>
          <xsd:element name="GIVEN" type="xsd:string"
                       minOccurs="1" maxOccurs="1"/>
          <xsd:element name="FAMILY" type="xsd:string"
                       minOccurs="1" maxOccurs="1"/>
        </xsd:all>
      </xsd:complexType>
    </xsd:element>
  </xsd:sequence>
</xsd:complexType>

The extension to handle what you want for Listing 24-15 seems obvious. It would look like this:

<xsd:complexType name="SongType">
  <xsd:all>
    <xsd:element name="TITLE" type="xsd:string"
                 minOccurs="1" maxOccurs="1"/>
    <xsd:element name="COMPOSER" type="PersonType"
                 minOccurs="1" maxOccurs="unbounded"/>
    <xsd:element name="PRODUCER" type="PersonType"
                 minOccurs="0" maxOccurs="unbounded"/>
    <xsd:element name="PUBLISHER" type="xsd:string"
                 minOccurs="0" maxOccurs="1"/>
    <xsd:element name="LENGTH" type="xsd:string"
                 minOccurs="1" maxOccurs="1"/>
    <xsd:element name="YEAR" type="xsd:string"
                 minOccurs="1" maxOccurs="1"/>
    <xsd:element name="ARTIST" type="xsd:string"
                 minOccurs="1" maxOccurs="unbounded"/>
    <xsd:element name="PRICE" type="xsd:string" minOccurs="0"/>
  </xsd:all>
</xsd:complexType>

Unfortunately, the W3C XML Schema language restricts the use of minOccurs and maxOccurs inside xsd:all elements. In particular, each one's value must be 0 or 1. You cannot set it to 4 or 7 or unbounded. Therefore the above type definition is invalid. Furthermore, xsd:all can only contain individual element declarations. It cannot contain xsd:choice or xsd:sequence elements. xsd:all offers somewhat more expressivity than DTDs do, but probably not as much as you want.

Choices

The xsd:choice element is the schema equivalent of the | in DTDs. When xsd:element elements are combined inside an xsd:choice, then exactly one of those elements must appear in instance documents. For example, the choice in this xsd:complexType requires either a PRODUCER or a COMPOSER, but not both.

<xsd:complexType name="SongType">
  <xsd:sequence>
    <xsd:element name="TITLE" type="xsd:string"/>
    <xsd:choice>
      <xsd:element name="COMPOSER" type="PersonType"/>
      <xsd:element name="PRODUCER" type="PersonType"/>
    </xsd:choice>
    <xsd:element name="PUBLISHER" type="xsd:string"
                 minOccurs="0"/>
    <xsd:element name="LENGTH" type="xsd:string"/>
    <xsd:element name="YEAR"   type="xsd:string"/>
    <xsd:element name="ARTIST" type="xsd:string"
                 maxOccurs="unbounded"/>
    <xsd:element name="PRICE" type="xsd:string" minOccurs="0"/>
  </xsd:sequence>
</xsd:complexType>

The xsd:choice element itself can have minOccurs and maxOccurs attributes that establish exactly how many selections may be made from the choice. For example, setting minOccurs to 1 and maxOccurs to 6 would indicate that between one and six elements listed in the xsd:choice should appear. Each of these can be any of the elements in the xsd:choice. For example, you could have six different elements, three of the same element and three of another, or up to six of the same element. This next xsd:choice allows for any number of artists, composers, and producers. However, in order to require that there be at least one ARTIST element and at least one COMPOSER element, rather than allowing all spaces to be filled by PRODUCER elements, it's necessary to place xsd:element declarations for these two outside the choice. This has the unfortunate side-effect of locking in more order than is really needed.

<xsd:complexType name="SongType">
  <xsd:sequence>
    <xsd:element name="TITLE" type="xsd:string"/>
    <xsd:element name="COMPOSER" type="PersonType"/>
    <xsd:choice minOccurs="0" maxOccurs="unbounded">
      <xsd:element name="PRODUCER" type="PersonType"/>
      <xsd:element name="COMPOSER" type="PersonType"/>
      <xsd:element name="ARTIST"   type="xsd:string"/>
    </xsd:choice>
    <xsd:element name="ARTIST" type="xsd:string"/>
    <xsd:element name="PUBLISHER" type="xsd:string"
                 minOccurs="0"/>
    <xsd:element name="LENGTH" type="xsd:string"/>
    <xsd:element name="YEAR"   type="xsd:string"/>
    <xsd:element name="PRICE" type="xsd:string" minOccurs="0"/>
  </xsd:sequence>
</xsd:complexType>

Sequences

An xsd:sequence element requires each member of the sequence to appear in the same order in the instance document as in the xsd:sequence element. I've used this frequently as the basic group for xsd:complexType elements in this chapter so far. The number of times each element is allowed to appear can be controlled by the xsd:element's minOccurs and maxOccurs attributes. You can add minOccurs and maxOccurs attributes to the xsd:sequence element to specify the number of times the sequence should repeat.

Simple Types

Until now I've focused on writing schemas that validate the element structures in an XML document. However, there's also a lot of non-XML structure in the song documents. The YEAR element isn't just a string. It's an integer, and maybe not just any integer either, but a positive integer with four digits. The PRICE element is some sort of money. The LENGTH element is a duration of time. DTDs have absolutely nothing to say about such non-XML structures that are inside the parsed character data content of elements and attributes. Schemas, however, do let you make all sorts of statements about what forms the text inside elements may take and what it means. Schemas provide much more sophisticated semantics for documents than DTDs do.

Listing 24-16 is a new schema for song documents. It's based on Listing 24-8, but read closely and you should notice that a few things have changed.

Listing 24-16: A schema with simple data types

<?xml version="1.0"?>
<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema">
  <xsd:element name="SONG" type="SongType"/>
  <xsd:complexType name="SongType">
    <xsd:sequence>
      <xsd:element name="TITLE"     type="xsd:string"/>
      <xsd:element name="COMPOSER"  type="xsd:string"
                   maxOccurs="unbounded"/>
      <xsd:element name="PRODUCER"  type="xsd:string"
                   minOccurs="0"    maxOccurs="unbounded"/>
      <xsd:element name="PUBLISHER" type="xsd:string"
                   minOccurs="0"/>
      <xsd:element name="LENGTH"    type="xsd:duration"/>
      <xsd:element name="YEAR"      type="xsd:gYear"/>
      <xsd:element name="ARTIST"    type="xsd:string"
                   maxOccurs="unbounded"/>
      <xsd:element name="PRICE"     type="xsd:string"
                   minOccurs="0"/>
    </xsd:sequence>
  </xsd:complexType>
</xsd:schema>

Did you spot the changes? The values of the type attributes of the LENGTH and YEAR declarations are no longer xsd:string. Instead, LENGTH has the type xsd:duration and YEAR has the type xsd:gYear. These declarations say that it's no longer okay for the YEAR and LENGTH elements to contain just any old string of text. Instead they must contain strings in particular formats. In particular, the YEAR element must contain a year; and the LENGTH element must contain a recognizable length of time. When you check a document against this schema, the validator will check that these elements contain the proper data. It's not just looking at the elements. It's looking at the content inside the elements!

Let's actually validate hotcop.xml against this schema and see what we get:

C:\XML>java sax.SAXCount -v hotcop.xml
[Error] hotcop.xml:10:25: Datatype error: In element 'LENGTH' :
Value '6:20' is not legal value for current datatype.
hotcop.xml: 1783 ms (10 elems, 2 attrs, 28 spaces, 98 chars)

That's unexpected! The problem is that 6:20 is not in the proper format for time durations, at least not the format that the W3C XML Schema language uses and that schema validators know how to check. Schema validators expect that time types are expressed in the format defined in ISO standard 8601, Representations of dates and times (http://www.iso.ch/markete/8601.pdf). This standard says that time durations should have the form PnYnMnDTnHnMdS, where n is an integer and d is a decimal number. P stands for "Period". nY gives the number of years; the first nM gives the number of months; and nD gives the number of days. T separates the date from the time. Following the T, nH gives the number of hours; the second nM gives the number of minutes; and dS gives the number of seconds. If d has a fraction part, then the duration can be specified to an arbitrary level of precision.

In this format, a duration of 6 minutes and 20 seconds should be written as P0Y0M0DT0H6M20S. If you prefer, the zero pieces can be left out, so you can write this more compactly as PT6M20S. Listing 24-17 shows the fixed version of hotcop.xml with the LENGTH in the right format.

Listing 24-17: fixed hotcop.xml

<?xml version="1.0"?>
<SONG xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
      xsi:noNamespaceSchemaLocation="24-16.xsd">
  <TITLE>Hot Cop</TITLE>
  <COMPOSER>Jacques Morali</COMPOSER>
  <COMPOSER>Henri Belolo</COMPOSER>
  <COMPOSER>Victor Willis</COMPOSER>
  <PRODUCER>Jacques Morali</PRODUCER>
  <PUBLISHER>PolyGram Records</PUBLISHER>
  <LENGTH>P0YT6M20S</LENGTH>
  <YEAR>1978</YEAR>
  <ARTIST>Village People</ARTIST>
</SONG>

Admittedly the ISO 8601 format for time durations is a little obtuse, if precise. You may well be asking whether there's a type that you can specify for the LENGTH that would make lengths such as 6:20 and 4:24 legal. In fact, there's no such type built-in to the W3C XML Schema language; but you can define one yourself. You'll learn how to do that soon, but first let's explore some of the other data types that are built-in to the W3C XML Schema language.

There are 44 built-in simple types in the W3C XML Schema language. These can be unofficially divided into seven groups:

Numeric types
Time types
XML types
String types
The boolean type
The URI reference type
The binary types

Numeric data types

The most obvious data types, and the ones most familiar to programmers, are the numeric data types. Among computer scientists, there's quite a bit of disagreement about how numbers should be represented in computer systems. The W3C XML Schema language tries to make everyone happy by providing almost every numeric type imaginable including:

Integer and floating point numbers
Finite size numbers similar to those in Java and C and infinitely precise, unlimited-size numbers similar to those in Eiffel and Java's java.math package
Signed and unsigned numbers

You'll probably only use a subset of these. For instance, you wouldn’t use both the arbitrarily large xsd:integer type and the four-byte limited xsd:int type. Table 24-1 summarizes the different numeric types.

Table 24-1: Schema Numeric Types

Name:	Type:	Examples:
`xsd:float`	IEEE 754 32-bit floating point number, or as close as you can get using a base 10 representation; same as Java's `float` type	-INF, -1E4, -0, 0, 12.78E-2, 12, INF, NaN
`xsd:double`	IEEE 754 64-bit floating point number, or as close as you can get using a base 10 representation; same as Java's `double` type	-INF, 1.401E-90, -1E4, -0, 0, 12.78E-2, 12, INF, NaN, 3.4E42
`xsd:decimal`	Arbitrary precision, decimal numbers; same as `java.math.BigDecimal`	-2.7E400, 5.7E-444, -3.1415292, 0, 7.8, 90200.76, 3.4E1024
`xsd:integer`	An arbitrarily large or small integer; same as `java.math.BigInteger`	-500000000000000000000000, -9223372036854775809, -126789, -1, 0, 1, 5, 23, 42, 126789, 9223372036854775808, 4567349873249832649873624958
`xsd:nonPositiveInteger`	An integer less than or equal to zero	0, -1, -2, -3, -4, -5, -6, -7, -8, -9, . . .
`xsd:negativeInteger`	An integer strictly less than zero	-1, -2, -3, -4, -5, -6, -7, -8, -9, . . .
`xsd:long`	An eight-byte two's complement integer such as Java's `long` type	-9223372036854775808, -9223372036854775807, . . . -6, -5, -4, -3, -2, -1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, . . ., 2147483645, 2147483646, 2147483647, 2147483648, . . .9223372036854775806, 9223372036854775807
`xsd:int`	An integer that can be represented as a four-byte, two's complement number such as Java's `int` type	-2147483648, -2147483647, -2147483646, 2147483645, . . . -6, -5, -4, -3, -2, -1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, . . ., 2147483645, 2147483646, 2147483647
`xsd:short`	An integer that can be represented as a two-byte, two's complement number such as Java's `short` type	-32768, -32767, -32766, . . ., -6, -5, -4, -3, -2, -1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, . . . 32765, 32766, 32767
`xsd:byte`	An integer that can be represented as a one-byte, two's complement number such as Java's `byte` type	-128, -127, -126, -125, . . ., -3, -2, -1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, . . .121, 122, 123, 124, 125, 126, 127
`xsd:nonNegativeInteger`	An integer greater than or equal to zero	0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, . . .. . .
`xsd:unsignedLong`	An eight-byte unsigned integer	0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, . . .18446744073709551614, 18446744073709551615
`xsd:unsignedInt`	A four-byte unsigned integer	0, 1, 2, 3, 4, 5, . . .4294967294, 4294967295
`xsd:unsignedShort`	A two-byte unsigned integer	0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, . . .65533, 65534, 65535
`xsd:unsignedByte`	A one-byte unsigned integer	0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, . . . 252, 253, 254, 255
`xsd:positiveInteger`	An integer strictly greater than zero	1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, . . .

Time data types

The next set of simple types the W3C XML Schema language provides are more familiar to database designers than to procedural programmers; these are the time types. These can represent times of day, dates, or durations of time. The formats, shown in Table 24-2, are all based on the ISO standard 8601, Representations of dates and times (http://www.iso.ch/markete/8601.pdf). Time zones are given as offsets from Coordinated Universal Time (Greenwich Mean Time to laypeople) or as the letter Z to indicate Coordinated Universal Time.

Table 24-2: XML Schema Time Types

Name:	Type:	Examples:
`xsd:dateTime`	A particular moment in Coordinated Universal Time, up to an arbitrarily small fraction of a second	1999-05-31T13:20:00.000-05:00, 1999-05-31T18:20:00.000Z, 1999-05-31T13:20:00.000, 1999-05-31T13:20:00.000-05:00.321
`xsd:date`	A specific day in history	-0044-03-15, 0001-01-01, 1969-06-27, 2000-10-31, 2001-11-17
`xsd:time`	A specific time of day that recurs every day	14:30:00.000, 09:30:00.000-05:00, 14:30:00.000Z
`xsd:gDay`	A day in no particular month, or rather in every month	--01, --02, . . . –09, --10, --11, --12, . . ., --28, --29, --30, --31
`xsd:gMonth`	A month in no particular year	--01--, --02--, --03--, ---04--, . . . --09--, --10--, --11--, --12--
`xsd:gYear`	A given year	. . . -0002, -0001, 0001, 0002, 0003, . . .1998, 1999, 2000, 2001, 2002, . . .9997, 9998, 9999
`xsd:gYearMonth`	A specific month in a specific year	1999-12, 2001-04, 1968-07
`xsd:gMonthDay`	A date in no particular year, or rather in every year	--10-31, --02-28, --02-29
`xsd:duration`	A length of time, without fixed endpoints, to an arbitrary fraction of a second	P2000Y10M31DT09H32M7.4312S

Notice in particular that in all the date formats the year comes first, followed by the month, then the day, then the hour, and so on. The largest unit of time is on the left and the smallest unit is on the right. This helps avoid questions such as whether 2001–02–11 is February 11, 2000 or November 2, 2001.

XML data types

The next batch of schema data types should be quite familiar. These are the types related to XML constructs themselves. Most of these types match attribute types in DTDs such as NMTOKENS or IDREF. The difference is that with schemas these types can be applied to both elements and attributes. These also include four new types related to other XML constructs: xsd:language, xsd:Name, xsd:QName, and xsd:NCName. Table 24-3 summarizes the different types.

Table 24-3: XML Schema XML Types

Name:	Type:	Examples:
`xsd:ID`	XML 1.0 `ID` attribute type; any XML name that's unique among ID type attributes and elements	`p1`, `p2`, `ss124-45-6789`, `_92`, `red`, `green`, `NT-Decl`, `seventeen`
`xsd:IDREF`	XML 1.0 `IDREF` attribute type; any XML name that's used as the value of an ID type attribute or element elsewhere in the document	`p1`, `p2`, `ss124-45-6789`, `_92`, `p1`, `p2`, `red`, `green`, `NT-Decl`, `seventeen`
`xsd:ENTITY`	XML 1.0 `ENTITY` attribute type; any XML name that's declared as an unparsed entity in the DTD	`PIC1`, `PIC2`, `PIC3`, `cow_movie`, `MonaLisa`, `Warhol`
`xsd:NOTATION`	XML 1.0 `NOTATION` attribute type; any XML name that's declared as a notation name in the schema using `xsd:notation`	`GIF`, `jpeg`, `TIF`, `pdf`, `TeX`
`xsd:IDREFS`	XML 1.0 `IDREFS` attribute type; a white space-separated list of XML names that are used as values of ID type attributes or elements elsewhere in the document	`p1 p2`, `ss124-45-6789` `_92`, `red green NT-Decl seventeen`
`xsd:ENTITIES`	XML 1.0 `ENTITIES` attribute type; a white space-separated list of `ENTITY` names	`PIC1 PIC2 PIC3`
`xsd:NMTOKEN`	XML 1.0 `NMTOKEN` attribute type	`12` `are` `you` `ready` `199`
`xsd:NMTOKENS`	XML 1.0 `NMTOKENS` attribute type, a white space-separated list of name tokens	`MI NY LA CA` `p1 p2 p3 p4 p5 p6` `1 2 3 4 5 6`
`xsd:language`	Valid values for `xml:lang` as defined in XML 1.0	`en`, `en-GB`, `en-US`, `fr`, `i-lux`, `ama`, `ara`, `ara-EG, x-choctaw`
`xsd:Name`	An XML 1.0 Name, with or without colons	`set`, `title`, `rdf`, `math`, `math123`, `xlink:href`, `song:title`
`xsd:QName`	a prefixed name	`song:title`, `math:set`, `xsd:element`
`xsd:NCName`	a local name without any colons	`set`, `title`, `rdf`, `math`, `tei.2`, `href`

Cross-Reference

For more details on the permissible values for elements and attributes declared to have these types, see Chapter 11.

String data types

You've already encountered the xsd:string type. It's the most generic simple type. It requires a sequence of Unicode characters of any length, but this is what all XML element content and attribute values are. There are also two very closely related types: xsd:token and xsd:CDATA. These are the same as xsd:string except that they limit the amount, location, and type of white space that can be used. Table 24-4 summarizes the string data types.

Table 24-4: XML Schema String Types

Name:	Type:	Examples:
`xsd:string`	A sequence of zero or more Unicode characters that are allowed in an XML document; essentially the only forbidden characters are most of the C0 controls, surrogates, and the byte-order mark	`p1`, `p2`, `123 45 6789`, `^&^&_92`, `red green blue`, `NT-Decl`, `seventeen; Mary had a little lamb`, `The love of money is the root of all Evil.`, `Would you paint the lily?` `Would you gild gold?`
`xsd:normalizedString`	A string that does not contain any tabs, carriage returns, or linefeeds	`PIC1`, `PIC2`, `PIC3`, `cow_movie`, `MonaLisa`, `Hello World` , `Warhol`, `red green`
`xsd:token`	A string with no leading or trailing white space, no tabs, no linefeeds, and not more than one consecutive space	`p1 p2`, `ss123 45 6789`, `_92`, `red`, `green`, `NT Decl`, `seventeenp1`, `p2`, `123 45 6789`, `^&^&_92`, `red green blue`, `NT-Decl`, `seventeen; Mary had a little lamb`, `The love of money is the root of all Evil.`

Binary types

It's impossible to include arbitrary binary files in XML documents because they might contain illegal characters such as a form feed or a null that would make the XML document malformed. Therefore, any such data must first be encoded in legal characters. The W3C XML Schema Language supports two such encodings, xsd:base64Binary and xsd:hexBinary.

Hexadecimal binary encodes each byte of the input as two hexadecimal digits — 00, 01, 02, 03, 04, 05, 06, 07, 08, 09, 0A, 0B, 0C, 0D, 0E, 0F, 10, 11, 12, and so on. Thus, an entire file can be encoded using only the digits 0 through 9 and the letters A through F. (Lowercase letters are also allowed, but uppercase letters are customary.) On the other hand, each byte is replaced by two bytes so this encoding doubles the size of the data. It's not a very efficient encoding. Hexadecimal binary encoded data tends to look like this:

A4E345EC54CC8D52198000FFEA6C807F41F332127323432147A89979EEF3

Base64 encoding uses a more complex algorithm and a larger character set, 65 ASCII characters chosen for their ability to pass through almost all gateways, mail relays, and terminal servers intact, as well as their existence with the same code points in ASCII, EBCDIC, and most other common character sets. Base64 encodes every three bytes as four characters, typically only increasing file size by a third, so it's somewhat more efficient than xsd:hexBinary. Base64 encoded data tends to look something like this:

6jKpNnmkkWeArsn5Oeeg2njcz+nXdk0f9kZI892ddlR8Lg1aMhPeFTYuoq3I6n BjWzuktNZKiXYBfKsSTB8U09dTiJo2ir3HJuY7eW/p89osKMfixPQsp9vQMgzph6Qa lY7j4MB7y5ROJYsTr1/fFwmj/yhkHwpbpzed1LE=

XML Digital Signatures use Base64 encoding to encode the binary signatures before wrapping them in an XML element.

Caution

I really discourage you from using either of these if at all possible. If you have binary data, it's much more efficient and much less obtuse to link to it using XLink or unparsed entities rather than encoding it in Base64 or hexadecimal binary.

Miscellaneous data types

There are two types left over that don’t fit neatly into the previous categories: xsd:boolean, and xsd:anyURI. The xsd:boolean type represents something similar to C++'s bool data type. It has four legal values: 0, 1, true, and false. 0 is considered to be the same as false, and 1 is considered the same as true.

The final schema simple type is xsd:anyURI. An element of this type contains a relative or absolute URI, possibly a URL, such as urn:isbn:0764547607, http://www.w3.org/TR/2000/WD-xmlschema-2-20000407/#timeDuration, /javafaq/reports/JCE1.2.1.htm, /TR/2000/WD-xmlschema-2-20000407/, or ../index.html.

Caution

Xerces 1.4.0 doesn’t yet accept relative URLs in elements and attributes with the type xsd:anyURI. This is scheduled to be fixed in Xerces 1.4.1.

Deriving Simple Types

You're not limited to the 44 simple types that the W3C XML Schema Language defines. As in object-oriented programming languages, you can create new data types by deriving from the existing types. The most common such derivation is to restrict a type to a subset of its normal values. For instance, you can define an integer type that only holds numbers between 1 and 20 by deriving from xsd:positiveInteger. You can create enumerated types that only allow a finite list of fixed values. You can create new types that join together the ranges of existing types through a union. For instance you can derive a type that can hold either an xsd:date or an xsd:int.

New simple types are created by xsd:simpleType elements, just as new complex types are created by xsd:complexType elements. The name attribute of xsd:simpleType assigns a name to the new type by which it can be referred to in xsd:element type attributes. The allowed content of elements and attributes with the new type can be specified by one of three child elements:

xsd:restriction to select a subset of the values allowed by the base type
xsd:union to combine multiple types
xsd:list to specify a list of elements of an existing simple type

Deriving by restriction

To create a new type by restricting from an existing type you give the xsd:simpleType element an xsd:restriction child element. The base attribute of this element specifies what type you're restricting. For example, this xsd:simpleType element creates a new type named phonoYear that's derived from xsd:gYear:

<xsd:simpleType name="phonoYear">
  <xsd:restriction base="xsd:gYear">
  </xsd:restriction>
</xsd:simpleType>

With this declaration any legal xsd:gYear is also a legal phonoYear, and any illegal year is also an illegal phonoYear. You can limit phonoYear to a subset of the normal year values by using facets to specify which values are and are not allowed. For instance, the minInclusive facet defines the minimum legal value for a type. This facet is added to a restriction as an xsd:minInclusive child element. The value attribute of the xsd:minInclusive element sets the minimum allowed value for the year:

<xsd:simpleType name="phonoYear">
  <xsd:restriction base="xsd:gYear">
    <xsd:minInclusive value="1877"/>
  </xsd:restriction>
</xsd:simpleType>

Here the value of xsd:minInclusive is set to 1877, the year Thomas Edison invented the phonograph. Thus, 1877 is a legal phonoYear, 1878 is a legal phonoYear, 2001 is a legal phonoYear, and 3005 is a legal phonoYear. However, 1876, 1875, 1874, and earlier years are not legal phonoYears, even though they are legal xsd:gYears.

Once the phonoYear type has been defined, you can use it just like one of the built-in types. For example, in the SONG schema, you'd declare that the year element has the type phonoYear like this:

<xsd:element type="phonoYear"/>

minInclusive is not the only facet you can apply to xsd:gYear. Other facets of xsd:gYear are:

xsd:minExclusive: the minimum value that all instances must be strictly greater than
xsd:maxInclusive: the maximum value that all instances must be less than or equal to
xsd:maxExclusive: the maximum value that all instances must be strictly less than
xsd:enumeration: a list of all legal values
xsd:whiteSpace: how white space is treated within the element
xsd:pattern: a regular expression to which the instance is compared

Each facet is represented as an empty element inside an xsd:restriction element. Each facet has a value attribute giving the value of that facet. One restriction can contain more than one facet. For example, this xsd:simpleType element defines a phonoYear as any year between 1877 and 2100, inclusive:

<xsd:simpleType name="phonoYear">
  <xsd:restriction base="xsd:gYear">
    <xsd:minInclusive value="1877"/>
    <xsd:maxInclusive value="2100"/>
  </xsd:restriction>
</xsd:simpleType>

It's possible that multiple facets may conflict. For instance, the minInclusive value could be 2100 and the maxInclusive value could be 1877. While this is probably a design mistake, it is syntactically legal. It would just mean that the set of phonoYears was the empty set, and phonoYear type elements could not actually be used in instance documents.

Facets

Facets are shared among many types. For instance, the minInclusive facet can constrain essentially any well-ordered type, including not only xsd:gYear, but also xsd:byte, xsd:unsignedByte, xsd:integer, xsd:positiveInteger, xsd:negativeInteger, xsd:nonNegativeInteger, xsd:nonPositiveInteger, xsd:int, xsd:unsignedInt, xsd:long, xsd:unsignedLong, xsd:short, xsd:unsignedShort, xsd:decimal, xsd:float, xsd:double, xsd:time, xsd:dateTime, xsd:duration, xsd:date, xsd:gMonth, xsd:gYearMonth, and xsd:gMonthDay. The complete list of constraining facets that can be applied to different types is:

xsd:minInclusive: the value that all instances must be greater than or equal to
xsd:minExclusive: the value that all instances must be strictly greater than
xsd:maxInclusive: the value that all instances must be less than or equal to
xsd:maxExclusive: the value that all instances must be strictly less than
xsd:enumeration: a list of all legal values
xsd:whiteSpace: how white space is treated within the element
xsd:pattern: a regular expression to which the instance is compared
xsd:length: the exact number of characters in the element
xsd:minLength: the minimum number of characters allowed in the element
xsd:maxLength: the maximum number of characters allowed in the element
xsd:totalDigits: the maximum number of digits allowed in the element
xsd:fractionDigits: the maximum number of digits allowed in the fractional part of the element

Not all facets apply to all types. For instance it doesn’t make much sense to talk about the minimum value of an xsd:NMTOKEN or the number of fraction digits in an xsd:gYear. However, when the same facet is shared by different types, it has the same syntax and basic meaning for all the types.

Facets for strings: length, minLength, maxLength

The three length facets — xsd:length, xsd:minLength, and xsd:maxLength — apply to the xsd:string type and its subtypes: xsd:normalizedString, xsd:token, xsd:hexBinary, xsd:base64Binary, xsd:QName, xsd:NCName, xsd:ID, xsd:IDREF, xsd:IDREFS, xsd:language, xsd:anyURI, xsd:ENTITY, xsd:ENTITIES, xsd:NOTATION, xsd:NOTATIONS, xsd:NMTOKEN, and xsd:NMTOKENS. These facets specify the number of characters allowed in the element or attribute value. The value attribute of each of these facets must contain a nonnegative integer. xsd:length sets the exact number of characters in the value, whereas xsd:minLength sets the minimum length and xsd:maxLength sets the maximum length.

For example, the schema in Listing 24-18 uses the xsd:minLength and xsd:maxLength facets to derive a new Str255 data type from xsd:string. Whereas xsd:string allows strings of any length from zero on up, Str255 requires each string to have a minimum length of 1 and a maximum length of 255. The schema then assigns this data type to all the names and titles to indicate that each must contain between 1 and 255 characters:

Listing 24-18: A schema that derives a Str255 data type from xsd:string

<?xml version="1.0"?>
<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema">
  <xsd:simpleType name="Str255">
    <xsd:restriction base="xsd:string">
      <xsd:minLength value="1"/>
      <xsd:maxLength value="255"/>
    </xsd:restriction>
  </xsd:simpleType>
  <xsd:element name="SONG" type="SongType"/>
  <xsd:complexType name="SongType">
    <xsd:sequence>
      <xsd:element name="TITLE"     type="Str255"/>
      <xsd:element name="COMPOSER"  type="Str255"
                   maxOccurs="unbounded"/>
      <xsd:element name="PRODUCER"  type="Str255"
                   minOccurs="0"    maxOccurs="unbounded"/>
      <xsd:element name="PUBLISHER" type="Str255"
                   minOccurs="0"/>
      <xsd:element name="LENGTH"    type="xsd:duration"/>
      <xsd:element name="YEAR"      type="xsd:gYear"/>
      <xsd:element name="ARTIST"    type="Str255"
                   maxOccurs="unbounded"/>
      <xsd:element name="PRICE"     type="xsd:string"
                   minOccurs="0"/>
    </xsd:sequence>
  </xsd:complexType>
</xsd:schema>

The whiteSpace facet

The whiteSpace facet is unusual. Unlike the other 11 facets, xsd:whiteSpace does not in any way constrain the allowed content of elements. Instead, it suggests what the application should do with any white space that it finds in the instance document. It says how significant that white space is. However, it does not in any way say that any particular kind of white space is legal or illegal.

The xsd:whiteSpace facet has three possible values:

preserve: The white space in the input document is unchanged.
replace: Each tab, carriage return, and linefeed is replaced with a single space.
collapse: Each tab, carriage return, and linefeed is replaced with a single space. Furthermore, after this replacement is performed, all runs of multiple spaces are condensed to a single space. Leading and trailing white space is deleted.

Again, these are all just hints to the application. None of them have any affect on validation.

The whiteSpace facet can only be applied to xsd:string, xsd:normalizedString, and xsd:token types. Furthermore, it only fully applies to elements. XML 1.0 requires that parsers replace all white space in attributes, and collapse white space in attributes whose type is anything other than CDATA, regardless of what the schema says.

The schema in Listing 24-19 uses the xsd:whiteSpace facets to derive a new CollapsedString data type from xsd:string. Then it assigns this data type to all the names and titles to indicate that white space should be collapsed in these elements:

Listing 24-19: A schema that suggests collapsing white space in elements

<?xml version="1.0"?>
<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema">
  <xsd:element name="SONG" type="SongType"/>
  <xsd:simpleType name="CollapsedString">
    <xsd:restriction base="xsd:string">
       <xsd:whiteSpace value="collapse"/>
    </xsd:restriction>
  </xsd:simpleType>
  <xsd:complexType name="SongType">
    <xsd:sequence>
      <xsd:element name="TITLE"     type="CollapsedString"/>
      <xsd:element name="COMPOSER"  type="CollapsedString"
        maxOccurs="unbounded"/>
      <xsd:element name="PRODUCER"  type="CollapsedString"
        minOccurs="0" maxOccurs="unbounded"/>
      <xsd:element name="PUBLISHER" type="CollapsedString"
        minOccurs="0"/>
      <xsd:element name="LENGTH"    type="xsd:duration"/>
      <xsd:element name="YEAR"      type="xsd:gYear"/>
      <xsd:element name="ARTIST"    type="CollapsedString"
        maxOccurs="unbounded"/>
      <xsd:element name="PRICE"     type="xsd:string"
                   minOccurs="0"/>
    </xsd:sequence>
  </xsd:complexType>
</xsd:schema>

Facets for decimal numbers: totalDigits and fractionDigits

When formatting numbers, it's useful to be able to specify how many digits should be used in the entire number, the integer parts, and the fraction parts. Schemas don’t go as far in this regard as the printf() function in C or the java.text.DecimalFormat class in Java, but they do offer you some control.

The xsd:totalDigits facet specifies the maximum number of decimal digits in a number. It applies to most numeric types including xsd:byte, xsd:unsignedByte, xsd:integer, xsd:positiveInteger, xsd:negativeInteger, xsd:nonNegativeInteger, xsd:nonPositiveInteger, xsd:int, xsd:unsignedInt, xsd:long, xsd:unsignedLong, xsd:short, xsd:unsignedShort, and xsd:decimal. The only exceptions are the IEEE 754 types that occupy a fixed number of bytes; that is, xsd:float and xsd:double. The value of this facet must be a positive integer.

The xsd:fractionDigits facet specifies the maximum number of decimal digits to the right of the decimal point. (There is no facet that allows you to specify the minimum number of digits or fraction digits.) This only really applies to xsd:decimal. Technically, it applies to all the integer types to, but for those types it's fixed to the value zero; that is, no fraction digits at all. You're only allowed to change it for xsd:decimal. The value of this facet must be a nonnegative integer.

The enumeration facet

Rather than setting some sort of range on legal values, the xsd:enumeration facet simply lists all allowed values. It applies to every simple type except xsd:boolean. The syntax is a little unusual. Each possible value gets its own xsd:enumeration element as a child of the xsd:restriction element.

Listing 24-20 uses an enumeration to derive a PublisherType from xsd:string. It requires that the publisher be one of the oligopoly that controls 90 percent of all U.S. music. (Warner-Elektra-Atlantic, Universal Music Group, Sony Music Entertainment, Inc., Capitol Records, Inc., and BMG Music).

Listing 24-20: A schema that uses an enumeration to derive a type from xsd:string

<?xml version="1.0"?>
<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema">
  <xsd:element name="SONG" type="songType"/>
  <xsd:simpleType name="PublisherType">
    <xsd:restriction base="xsd:string">
      <xsd:enumeration value="Warner-Elektra-Atlantic"/>
      <xsd:enumeration value="Universal Music Group"/>
      <xsd:enumeration value="Sony Music Entertainment, Inc."/>
      <xsd:enumeration value="Capitol Records, Inc."/>
      <xsd:enumeration value="BMG Music"/>
    </xsd:restriction>
  </xsd:simpleType>
  <xsd:complexType name="songType">
    <xsd:sequence>
      <xsd:element name="TITLE"     type="xsd:string"/>
      <xsd:element name="COMPOSER"  type="xsd:string"
        maxOccurs="unbounded"/>
      <xsd:element name="PRODUCER"  type="xsd:string"
        minOccurs="0" maxOccurs="unbounded"/>
      <xsd:element name="PUBLISHER" type="PublisherType"
        minOccurs="0"/>
      <xsd:element name="LENGTH"    type="xsd:duration"/>
      <xsd:element name="YEAR"      type="xsd:gYear"/>
      <xsd:element name="ARTIST"    type="xsd:string"
        maxOccurs="unbounded"/>
      <xsd:element name="PRICE"     type="xsd:string"
                   minOccurs="0"/>
    </xsd:sequence>
  </xsd:complexType>
</xsd:schema>

xsd:string is far from the only type you can derive from via enumeration. You can derive from xsd:int, xsd:NMTOKEN, xsd:date, and, indeed, from all simple types except xsd:boolean. Of course, the enumerated values all have to be legal instances of the base type.

The pattern facet

There's one element in the song examples that clearly deserves a data type, but so far doesn’t have one — PRICE. However none of the built-in data types really match the format for prices. Recall that PRICE elements look like this:

<PRICE>$1.25</PRICE>

This isn’t an integer of any kind, because it has a decimal point. It could be a floating point number, but that wouldn’t account for the currency sign. You could drop off the currency sign like this:

<PRICE>1.25</PRICE>

However, then you'd have to assume you were working in dollars. What if you wanted to sell songs priced in pounds or yen or lira? Perhaps you could make the currency sign part of a separate element, like this:

<PRICE>
  <CURRENCY>$</CURRENCY>
  <AMOUNT>1.25</AMOUNT>
</PRICE>

AMOUNT could be an xsd:float, and CURRENCY could be an xsd:string. However, this still isn’t perfect. You want to limit the CURRENCY to exactly one character, and that character must be a currency sign. You don't want to allow it to contain any arbitrary string. Furthermore, you'd like to limit the precision of the AMOUNT to exactly two decimal places. You probably don’t want to sell songs that cost $1.1 or $1.99999.

The solution to this problem, and to all similar problems where the values you want to allow don’t quite fit any of the existing types, is to use the xsd:pattern facet whose value attribute contains a regular expression that matches all legal values and doesn’t match any illegal values.

The regular expressions used in schemas are similar to the regular expressions you might be familiar with from Perl, grep, or other languages. You use statements like [A-Z]+ to mean "a string containing one more of the capital letters from A to Z" or (club)* to mean "a string composed of zero or more repetitions of the word club".

Table 24-5 summarizes the grammar of XML schema regular expressions. In this table A and B represent some string or another regular expression particle from elsewhere in the table; that is, they will be replaced by something else when actually used in a regular expression. n and m represent some integer that will be replaced by a specific number.

Table 24-5: Regular Expression Symbols for XML Schema

Symbol:	Meaning:
`A?`	Zero or one occurrences of A
`A*`	Zero or more occurrences of A
`A+`	One or more occurrences of A
`A{n,m}`	Between n and m occurrences of A
`A{n}`	Exactly n occurrences of A
`A{n,}`	At least n occurrences of A
`A\|B`	Either A or B
`AB`	A followed by B
`.`	Any one character
`\p{A}`	One character from Unicode character class A
`[abcdefg]`	A single occurrence of any of the characters contained in the brackets
`[^abcdefg]`	A single occurrence of any of the characters not contained in the brackets
`[a-z]`	A single occurrence of any character from a to z inclusive
`[^a-z]`	A single occurrence of any of character except those from a to z inclusive
`\n`	Linefeed
`\r`	Carriage return
`\t`	Tab
`\\`	The backward slash \
`\\|`	The vertical bar \|
`\.`	The period .
`\-`	The hyphen -
`\^`	The caret ^
`\?`	The question mark ?
`\*`	The asterisk *
`\+`	The plus sign +
`\{`	The open brace {
`\}`	The closing brace }
`\(`	The open parenthesis (
`\)`	The closing parenthesis )
`\[`	The open bracket [
`\]`	The close bracket ]

For the most part, these symbols have exactly the same meanings that they have in Perl. The schema regular expression syntax is somewhat weaker than Perl's, but then whose isn’t? In any case, this should be sufficient power to meet any reasonable needs that schemas have.

Schema regular expressions do have one important feature that isn’t available prior to Perl 5.6 and is unfamiliar to most developers — you can use \p{} to stand in for a character in a particular Unicode character class. For instance, N is the Unicode character class for numbers. This doesn’t just include the European digits 0 through 9, but also the Arabic-Indic digits, the Devanagari digits, the Thai digits, and many more besides. Therefore \p{N} represents any digit defined anywhere in Unicode. \p{N}+ represents a string consisting of one or more Unicode digits. Table 24-6 lists the various Unicode character classes you can take advantage of in regular expressions. For the money regular expression, you need the Sc class for currency indicators and the Nd class for decimal digits. This is a little more restrictive than the N class, which includes nondecimal digits, such as the Roman numerals and the Han ideograph representing 100,000,000.

Table 24-6: Unicode Character Classes

Abbreviation:	Includes:	Examples:
Letters:
L	All Letters	a, b, c, A, B, C, ü, Ü, ç, Ç, ζ, θ, Ζ, Θ, а, б, в, А, Б, В, א, ב, ג, dz, Dz, DZ
Lu	Uppercase letters	A, B, C, Ü, Ç, Ζ, Θ, А, Б, В, DZ
Ll	Lowercase letters	a, b, c, ü, ç, ζ, θ, а, б, в, dz
Lt	Title case letters	Dz
Lm	Modifier letters; letters that are attached to the previous characters somehow	^h, ^j, ^r, ^w
Lo	Other letters; typically ones from languages that don’t distinguish upper- and lowercase	א, ב, ג, Japanese Katakana and Hiragana, most Han ideographs
Marks:
M	All Marks
Mn	Nonspacing marks; mostly accent marks that are attached to the previous character on the top or bottom, and thus do not change the amount of space the character occupies	`, ', ¨, ¯
Mc	Spacing combining marks; accent marks that are attached to the previous character on the left or right, and thus do change the amount of space the character occupies	^T, Gurmukhi vowel sign AA
Me	Enclosing marks that completely surround a character	The Cyrillic hundred thousands and millions signs
Numbers :
N	All numbers	0, 1, 2, 3, ¼, ½, ², ³, ٠, ٩, I, II, III, IV, V, 〡, 〢, 〣, 〤
Nd	Decimal digits; characters that represent one of the numbers 0 through 9	0, 1, 2, 3, ٠, ٩
Nl	Numbers based on letters	I, II, III, IV 〡, 〢, 〣, 〤
No	Other numbers	¼, ½, ², ³
Punctuation:
P	All punctuation	-, _, ・, (, [, {, ), ], }, ‘, “, «, ’, ”, », !, ?, @, *, ¡, ¿, ·
Pc	Connectors	_, ・
Pd	Dashes	Hyphens, soft hyphens, em dashes, en dashes, etc.
Ps	Opening punctuation	(, [, {
Pe	Closing punctuation	), ], }
Pi	Initial quote marks	‘, “, «
Pf	Final quote marks	’, ”, »
Po	Other punctuation marks	!, ?, @, *, ¡, ¿, ·
Separators :
Z	All separators
Zs	Space	Space, non-breaking space, en space, em space
Zl	Line separators	Unicode character 2028, the line separator
Zp	Paragraph separators	Unicode character 2029, the paragraph separator
Symbols:
S	All Symbols	∂, ∆, @@Pi, $, ¥, £, ~, ¯, ¨, @@i, ©, ®, °, ╟▲, ☺
Sm	Mathematical symbols	∂, ∆, @@Pi, ∑, √, ≠, ≤, ≥, ≈
Sc	Currency signs	$, ¥, £, ¤, €, ₣, ₤, ₧, ₪, ₫
Sk	Modifier symbols	~, ¯, ¨
So	Other symbols	@@i, ©, ®, °, §, ¶, ↔, ℅, ℓ, @@N, ╓, ╗,╟▲, ☺, ♀, ♂, ♠, ♪, Braille, Han radicals
Other:
C	All Others
Cc	Control characters	Carriage return, line feed, tab and the C1 controls
Cf	Format characters	The left-to-right and right-to-left marks used to indicate change of direction in bidirectional text
Co	Private use characters; code points which may be used for a program's internal purposes
Cn	Unassigned; code points which, while legal in XML, the Unicode specification has not yet assigned a character to.

You're now ready to put together a regular expression that describes money strings such as $1.25. What you want to say is that each such string contains:

1. A currency symbol
2. One or more decimal digits
3. An optional fractional part which, if present at all, consists of a decimal point and two decimal digits

Here's the regular expression that says that:

\p{Sc}\p{Nd}+(\.\p{Nd}\p{Nd})?

It begins with \p{Sc} to indicate a currency symbol such as $, ¥, £, or ¤

This is followed by \p{Nd}+. \p{Nd} represents any decimal digit character. The + indicates one or more of these characters.

Next there's a parenthesized expression followed by a question mark, (\.\p{Nd}\p{Nd})?. The question mark indicates the parenthesized expression is optional. However, if it does appear its entire contents must be present, not just part. In other words, the question mark stands for zero or one, just as it does in DTDs. The contents of the parentheses are \.\p{Nd}\p{Nd}, which represents a period followed by two decimal digits, for example .35. Normally a period in a regular expression means any character at all, so here it's escaped with a preceding backslash to indicate that we really do want the actual period character.

Now that you have a regular expression that represents money, you're ready to define a money type. As for the other facets, this is done with the xsd:simpleType and xsd:restriction elements. Putting these together with the regular expression produces this type definition:

<xsd:simpleType name="money">
  <xsd:restriction base="xsd:string">
    <xsd:pattern value="\p{Sc}\p{Nd}+(\.\p{Nd}\p{Nd})?"/>
  </xsd:restriction>
</xsd:simpleType>

Listing 24-21 provides the complete song schema including this type definition. Take special note of the XML comment used to elucidate the regular expression. Regular expressions can be quite opaque, and a comment like this one can go a long way toward making the schema more understandable.

Listing 24-21: A schema that defines a custom money type

<?xml version="1.0"?>
<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema">
  <xsd:element name="SONG" type="SongType"/>
  <xsd:simpleType name="money">
    <xsd:restriction base="xsd:string">
      <xsd:pattern value="\p{Sc}\p{Nd}+(\.\p{Nd}\p{Nd})?"/>
      <!--
         Regular Expression:
         \p{Sc}             Any Unicode currency indicator;
                            e.g., $, &#xA5, &#xA3, &#A4, etc.
         \p{Nd}             A Unicode decimal digit character
         \p{Nd}+            One or more Unicode decimal digits
         \.                 The period character
         (\.\p{Nd}\p{Nd})
         (\.\p{Nd}\p{Nd})?  Zero or one strings of the form .35
         This works for any decimalized currency.
      -->
    </xsd:restriction>
  </xsd:simpleType>
  <xsd:complexType name="SongType">
    <xsd:sequence>
      <xsd:element name="TITLE"     type="xsd:string"/>
      <xsd:element name="COMPOSER"  type="PersonType"
                   maxOccurs="unbounded"/>
      <xsd:element name="PRODUCER"  type="PersonType"
                   minOccurs="0" maxOccurs="unbounded"/>
      <xsd:element name="PUBLISHER" type="xsd:string"
                   minOccurs="0"/>
      <xsd:element name="LENGTH"    type="xsd:duration"/>
      <xsd:element name="YEAR"      type="xsd:gYear"/>
      <xsd:element name="ARTIST"    type="xsd:string"
                   maxOccurs="unbounded"/>
      <xsd:element name="PRICE" type="money" maxOccurs="1"/>
    </xsd:sequence>
  </xsd:complexType>
  <xsd:complexType name="PersonType">
    <xsd:sequence>
      <xsd:element name="NAME">
        <xsd:complexType>
          <xsd:all>
            <xsd:element name="GIVEN"  type="xsd:string"/>
            <xsd:element name="FAMILY" type="xsd:string"/>
          </xsd:all>
        </xsd:complexType>
      </xsd:element>
    </xsd:sequence>
  </xsd:complexType>
</xsd:schema>

Unions

Restriction is not the only way to create a new simple type, although it is the most common way. You can also combine types using unions. For example, you could combine the built-in xsd:decimal type with the money type just defined to create a type that could contain either a decimal or a money value. To do this, give the xsd:simpleType element an xsd:union child element instead of an xsd:restriction child element. The xsd:union element contains more xsd:simpleType elements identifying the types you're combining in the union. For example, this is the above described money/xsd:decimal combined type:

<xsd:simpleType name="MoneyOrDecimal">
  <xsd:union>
    <xsd:simpleType>
      <xsd:restriction base="xsd:decimal">
      </xsd:restriction>
    </xsd:simpleType>
    <xsd:simpleType>
      <xsd:restriction base="xsd:string">
        <xsd:pattern value="\p{Sc}\p{Nd}+(\.\p{Nd}\p{Nd})?"/>
      </xsd:restriction>
    </xsd:simpleType>
  </xsd:union>
</xsd:simpleType>

Lists

Schemas can also specify that an element or attribute contains a list of a particular simple type. For example, this YEARS element contains a list of years:

<YEARS>1987 1999 1992   2002</YEARS>

Elements such as this can be specified using an xsd:list in the xsd:simpleType. The itemType attribute says what type of strings may appear in the list. For example:

<xsd:simpleType name="YearList">
  <xsd:list itemType="xsd:gYear"/>
</xsd:simpleType>

requires that elements with type YearList contain a white space-separated list of legal xsd:gYear values.

Caution

I must admit that I'm not very fond of list types, especially for elements. It seems to me that if you're going to have a list of different items, each of those items should be a separate element, possibly a child element of some parent element, but still its own element. Lists make a little more sense for attributes, but if there's a lot of substructure in the text, you should probably be using an element instead of an attribute anyway.

You can derive another list type from an existing list type. When so doing, you can restrict it according to the length, minLength, maxLength, and enumeration facets. In this case, the values of the three length facets refer to the number of items in the list rather than the number of characters in the content. For example, this xsd:simpleType element derives a DoubleYear list type that must hold exactly two years from the YearList type defined above:

<xsd:simpleType name="DoubleYear">
  <xsd:restriction base="YearList">
    <xsd:length value="2"/>
  </xsd:restriction>
</xsd:simpleType>

Empty Elements

Empty elements are those that cannot contain any child elements or parsed character data. This is the same as using the EMPTY content model in a DTD. As an example of this technique I'll define an empty PHOTO element. This will be used in the next section when attributes are introduced.

To create an empty element, you define it as a type, but don’t give it an xsd:sequence, xsd:all, or xsd:choice child. Thus, you don’t actually provide any child elements. For example:

  <!-- An empty element -->
  <xsd:complexType name="PhotoType">
  </xsd:complexType>

Caution

This does not require the PHOTO element to be defined with an empty element tag such as <PHOTO/>. The start-tag-end-tag pair <PHOTO></PHOTO> is also acceptable. In fact, the XML 1.0 specification says these two forms are equivalent. Schemas change nothing about XML 1.0. An XML 1.0 parser that knows nothing about schemas will have no trouble reading a document that uses schemas.

Attributes

In the examples so far, two XML constructs have been conspicuous by their absence: entities and attributes. The omission of entities was quite deliberate. Schemas cannot declare entities. If you need entities, you must use a DTD. (Of course, you can use a schema as well as the DTD.) However, schemas are fully capable of declaring attributes. Indeed they do a much better job of it than DTDs do because schemas can use the full set of data types like xsd:float and xsd:anyURI.

Note

You may not have noticed my avoidance of attributes because the examples all used xmlns:xsi and xsi:noNamespaceSchemaLocation attributes on the root element. However, as far as a schema validator is concerned, attributes used to declare namespaces, or to attach documents to schemas, "don't count". You do not have to, and indeed should not, declare these attributes. However, you do have to declare all the other attributes you use.

As a concrete example, let's consider how you might add an empty PHOTO element to the SONG documents. This element would be similar to the IMG element in HTML, and have an SRC attribute that contained a URL pointing to the photo's location, an ALT attribute containing some text in the event that the PHOTO can’t be displayed, and WIDTH and HEIGHT attributes that together give the size of the image in pixels. Listing 24-22 demonstrates:

Listing 24-22: The PHOTO element has several attributes of different types

<?xml version="1.0"?>
<SONG xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
      xsi:noNamespaceSchemaLocation="24-23.xsd">
  <TITLE>Yes I Am</TITLE>
  <PHOTO ALT="Melissa Etheridge holding a guitar"
         WIDTH="100" HEIGHT="300"
         SRC="guitar.jpg"/>
  <COMPOSER>
    <NAME>
      <GIVEN>Melissa</GIVEN>
      <FAMILY>Etheridge</FAMILY>
    </NAME>
  </COMPOSER>
  <PRODUCER>
    <NAME>
      <GIVEN>Hugh</GIVEN>
      <FAMILY>Padgham</FAMILY>
    </NAME>
  </PRODUCER>
  <PRODUCER>
    <NAME>
      <GIVEN>Melissa</GIVEN>
      <FAMILY>Etheridge</FAMILY>
    </NAME>
  </PRODUCER>
  <PUBLISHER>Island Records</PUBLISHER>
  <LENGTH>P0YT4M24S</LENGTH>
  <YEAR>1993</YEAR>
  <ARTIST>Melissa Etheridge</ARTIST>
  <PRICE>$1.25</PRICE>
</SONG>

Even though the PHOTO element is empty, because it has attributes it has a complex type. You define a PhotoType just as you previously defined a PersonType and a SongType. However, where those types used xsd:element to declare child elements, this type will use xsd:attribute to declare attributes.

  <xsd:complexType name="PhotoType">
    <xsd:attribute name="SRC"    type="xsd:anyURI"/>
    <xsd:attribute name="WIDTH"  type="xsd:positiveInteger"/>
    <xsd:attribute name="HEIGHT" type="xsd:positiveInteger"/>
    <xsd:attribute name="ALT"    type="xsd:string"/>
  </xsd:complexType>

Because the SRC attribute should contain a URL, it's been given the type xsd:anyURI. Because the HEIGHT and WIDTH attributes should each be an integer greater than zero, they're given the type xsd:positiveInteger. Finally, because the ALT attribute can contain essentially any string of text of any length, it's set to the most general type, xsd:string.

In this particular example, all the elements either have child elements or attributes, not both. However, that's certainly not required. In general, elements can have both child elements and attributes. Just use both xsd:element and xsd:attribute in the same xsd:complexType element. The xsd:attribute elements must come after the xsd:sequence, xsd:choice, or xsd:all group that forms the body of the element. For example, this xsd:element says that a PERSON element may have an optional attribute named ID with type ID:

  <xsd:complexType name="PersonType">
    <xsd:sequence>
      <xsd:element name="NAME">
        <xsd:complexType>
          <xsd:all>
            <xsd:element name="GIVEN"  type="xsd:string"/>
            <xsd:element name="FAMILY" type="xsd:string"/>
          </xsd:all>
        </xsd:complexType>
      </xsd:element>
    </xsd:sequence>
    <xsd:attribute name="ID" type="xsd:ID"/>
  </xsd:complexType>

Attributes can also be attached to elements that can only contain text such as an xsd:string or an xsd:gYear. The details are a little more complex, because an element with attributes by definition has a complex type. To make this work, you derive a new complex type from a simple type by giving the xsd:complexType element an xsd:simpleContent child element instead of an xsd:sequence, xsd:choice, or xsd:all. The xsd:simpleContent element itself has an xsd:extension child element whose base attribute identifies the simple type to extend such as xsd:string. The xsd:attribute elements are placed inside the xsd:extension element.

For example, suppose you want to allow the TITLE elements to have ID attributes like this:

<TITLE ID="test">Yes I Am</TITLE>

Previously TITLE was defined with type xsd:string. Instead let's derive a new type called StringWithID from xsd:string like this:

<xsd:complexType name="StringWithID">
  <xsd:simpleContent>
    <xsd:extension base="xsd:string">
      <xsd:attribute name="ID" type="xsd:ID"/>
    </xsd:extension>
  </xsd:simpleContent>
</xsd:complexType>

The StringWithID type can then be applied to the TITLE element in the usual way like this:

<xsd:element name="TITLE" type="StringWithID"/>

By default attributes declared in schemas are optional (#IMPLIED in DTD terminology). However, an xsd:attribute can have a use attribute with the value required to indicate that the element must occur. In this case, you probably do want to insist that each of the four attributes be present. Therefore the declaration of PhotoType becomes this:

  <xsd:complexType name="PhotoType">
    <xsd:attribute name="SRC"    type="xsd:anyURI"
                   use="required" />
    <xsd:attribute name="WIDTH"  type="xsd:positiveInteger"
                   use="required" />
    <xsd:attribute name="HEIGHT" type="xsd:positiveInteger"
                   use="required" />
    <xsd:attribute name="ALT"    type="xsd:string"
                   use="required" />
  </xsd:complexType>

The use attribute can also have the value optional to indicate that it may or may not be present. (This is also the default if there is no use attribute.) If optional, then xsd:attribute may also have a default attribute giving the value the parser will provide if it doesn’t find one in the instance document. If there is no default attribute, then this is the same as #IMPLIED in ATTLIST declarations in DTDs. Instead of a use attribute, xsd:attribute can have a fixed attribute whose value is the constant value for the attribute, whether present in the instance document or not. This has the same affect as #FIXED in DTDs. Listing 24-23 puts this all together in a complete schema for songs, including a PHOTO element with several required attributes.

Listing 24-23: A SONG schema that declares attributes

<?xml version="1.0"?>
<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema">
  <xsd:element name="SONG" type="SongType"/>
  <xsd:complexType name="PhotoType">
    <xsd:attribute name="SRC"    type="xsd:anyURI"
                   use="required" />
    <xsd:attribute name="WIDTH"  type="xsd:positiveInteger"
                   use="required" />
    <xsd:attribute name="HEIGHT" type="xsd:positiveInteger"
                   use="required" />
    <xsd:attribute name="ALT"    type="xsd:string"
                   use="required" />
  </xsd:complexType>
  <xsd:complexType name="SongType">
    <xsd:sequence>
      <xsd:element name="TITLE"     type="xsd:string"/>
      <xsd:element name="PHOTO"     type="PhotoType"/>
      <xsd:element name="COMPOSER"  type="PersonType"
                   maxOccurs="unbounded"/>
      <xsd:element name="PRODUCER"  type="PersonType"
                   minOccurs="0" maxOccurs="unbounded"/>
      <xsd:element name="PUBLISHER" type="xsd:string"
                   minOccurs="0"/>
      <xsd:element name="LENGTH"    type="xsd:duration"/>
      <xsd:element name="YEAR"      type="xsd:gYear"/>
      <xsd:element name="ARTIST"    type="xsd:string"
                   maxOccurs="unbounded"/>
      <xsd:element name="PRICE" type="money"/>
    </xsd:sequence>
  </xsd:complexType>
  <xsd:simpleType name="money">
    <xsd:restriction base="xsd:string">
      <xsd:pattern value="\p{Sc}\p{Nd}+(\.\p{Nd}\p{Nd})?"/>
      <!--
         Regular Expression:
         \p{Sc}             Any Unicode currency indicator;
                            e.g., $, &#xA5, &#xA3, &#A4, etc.
         \p{Nd}             A Unicode decimal digit character
         \p{Nd}+            One or more Unicode decimal digits
         \.                 The period character
         (\.\p{Nd}\p{Nd})
         (\.\p{Nd}\p{Nd})?  Zero or one strings of the form .35
         This works for any decimalized currency.
      -->
    </xsd:restriction>
  </xsd:simpleType>
  <xsd:complexType name="PersonType">
    <xsd:sequence>
      <xsd:element name="NAME">
        <xsd:complexType>
          <xsd:all>
            <xsd:element name="GIVEN"  type="xsd:string"/>
            <xsd:element name="FAMILY" type="xsd:string"/>
          </xsd:all>
        </xsd:complexType>
      </xsd:element>
    </xsd:sequence>
  </xsd:complexType>
</xsd:schema>

Namespaces

So far the example song documents have been blissfully namespace-free. Adding namespaces to the documents, and designing a schema that applies to the namespace-qualified documents is not particularly difficult. Namespaces add some important features, such as the ability to write schemas and validate documents that use elements and attributes from multiple XML applications. However, the terminology is a little on the confusing side. Some words, such as qualified, don’t mean quite the same thing in schemas as they do in other XML technologies, so you do need to pay close attention and read what follows carefully.

Schemas for default namespaces

Let's begin with a simple example in which the XML application described by the schema uses a single default, nonprefixed namespace. Most of the time each namespace URI maps to exactly one schema (though later you'll learn several techniques to break large schemas into parts using xsd:import and xsd:include).

The schema for elements that are not in any namespace is identified by an xsi:noNamespaceSchemaLocation attribute. The schemas for elements that are in namespaces are identified by an xsi:schemaLocation attribute. This attribute contains a list of namespace URI/schema URI pairs. Each namespace URI is followed by one schema URI. The namespace URI is almost always absolute, but the schema URI is almost always a URL and often a relative URL.

Listing 24-24 demonstrates. This is the familiar hotcop.xml document that you've seen several times already, though it's been simplified a bit to keep the examples smaller. All the elements in this document are in the http://ibiblio.org/xml/namespace/song namespace defined by the xmlns attribute on the root element. The attributes in this document are not in any namespace because they don’t have prefixes. There are two things you need to remember here:

1. Attributes without prefixes are never in any namespace, no matter what namespace their parent element is in, no matter what default namespace the document uses.
2. For purposes of schema validation, namespace declaration attributes, such as xmlns and xmlns:xsi, and schema attachment attributes, such as xsi:schemaLocation, don’t count. You do not need to declare these in your schema.

In this case, all the elements are in the http://ibiblio.org/xml/namespace/song namespace, so an xsi:schemaLocation attribute is needed to associate this namespace with a URL where the schema can be found, namespace_song.xsd for this example.

Listing 24-24: A SONG document in the http://ibiblio.org/xml/namespace/song namespace

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<SONG xmlns="http://ibiblio.org/xml/namespace/song"
      xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
      xsi:schemaLocation =
       "http://ibiblio.org/xml/namespace/song
        namespace_song.xsd"
>
  <TITLE>Hot Cop</TITLE>
  <!-- I've temporarily dropped the SRC attribute on this
       element. I'm going to replace it with XLinks shortly.
    -->
  <PHOTO ALT="Victor Willis in Cop Outfit" WIDTH="100"
         HEIGHT="200"/>
  <COMPOSER>Jacques Morali</COMPOSER>
  <COMPOSER>Henri Belolo</COMPOSER>
  <COMPOSER>Victor Willis</COMPOSER>
  <PRODUCER>Jacques Morali</PRODUCER>
  <PUBLISHER>PolyGram Records</PUBLISHER>
  <LENGTH>P0YT6M20S</LENGTH>
  <YEAR>1978</YEAR>
  <ARTIST>Village People</ARTIST>
</SONG>

What does namespace_song.xsd look like? Listing 24-25 shows you. It's much the same schema as before, although I've dropped the MoneyType and PersonType to save a little room.

Listing 24-25: A schema for SONG documents in the http://ibiblio.org/xml/namespace/song namespace

<?xml version="1.0"?>
<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema"
  xmlns="http://ibiblio.org/xml/namespace/song"
  targetNamespace="http://ibiblio.org/xml/namespace/song"
  elementFormDefault="qualified"
  attributeFormDefault="unqualified"
>
  <xsd:element name="SONG" type="SongType"/>
  <xsd:complexType name="PhotoType">
    <xsd:attribute name="WIDTH"  type="xsd:positiveInteger"
                   use="required" />
    <xsd:attribute name="HEIGHT" type="xsd:positiveInteger"
                   use="required" />
    <xsd:attribute name="ALT"    type="xsd:string"
                   use="required" />
  </xsd:complexType>
  <xsd:complexType name="SongType">
    <xsd:sequence>
      <xsd:element name="TITLE"     type="xsd:string"/>
      <xsd:element name="PHOTO"     type="PhotoType"/>
      <xsd:element name="COMPOSER"  type="xsd:string"
                   maxOccurs="unbounded"/>
      <xsd:element name="PRODUCER"  type="xsd:string"
                   minOccurs="0" maxOccurs="unbounded"/>
      <xsd:element name="PUBLISHER" type="xsd:string"
                   minOccurs="0"/>
      <xsd:element name="LENGTH"    type="xsd:duration"/>
      <xsd:element name="YEAR"      type="xsd:gYear"/>
      <xsd:element name="ARTIST"    type="xsd:string"
                   maxOccurs="unbounded"/>
    </xsd:sequence>
  </xsd:complexType>
</xsd:schema>

The main body of the schema is much the same as before. However, the xsd:schema start tag has several new attributes. It looks like this:

<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema"
  xmlns="http://ibiblio.org/xml/namespace/song"
  targetNamespace="http://ibiblio.org/xml/namespace/song"
  elementFormDefault="qualified"
  attributeFormDefault="unqualified"
>

The first xmlns attribute establishes the default namespace for this schema, which is, after all, an XML document itself. It sets the namespace to http://ibiblio.org/xml/namespace/song, the same as in the instance documents you're trying to model. This says that the unprefixed element names used in this schema such as PhotoType are in the http://ibiblio.org/xml/namespace/song namespace.

The second attribute says that this schema applies to documents in the http://ibiblio.org/xml/namespace/song namespace; that is, the elements identified by name attributes such as SONG, PHOTO, and TITLE are in the http://ibiblio.org/xml/namespace/song namespace.

The third attribute, elementFormDefault, has the value qualified. This means that the elements being described in this document are in fact in a namespace; specifically they're in the target namespace given previously by the targetNamespace attribute. This does not mean that the elements being modeled necessarily have prefixes, merely that they are in some namespace.

Finally, the fourth attribute, attributeFormDefault, has the value unqualified. This means that the attributes described by this schema are not in a namespace.

Schemas have one major advantage over DTDs when working with documents with namespaces. They validate against the local name and the namespace URIs of the elements and attributes, not the prefix and the local name like DTDs do. This means the prefixes do not have to match in the schema and in the instance documents. Indeed one might use prefixes and the other might use the default namespace.

For instance, consider Listing 24-26. This is the same as Listing 24-24 except that it uses the song prefix rather than the default namespace to indicate the http://ibiblio.org/xml/namespace/song namespace. However, it can use the exact same schema! The schema does not need to change just because the prefix (or lack thereof) has changed. As long as the namespace URI stays the same, the schema is happy.

Listing 24-26: A SONG document in the http://ibiblio.org/xml/namespace/song namespace with prefixes

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<song:SONG
      xmlns:song="http://ibiblio.org/xml/namespace/song"
      xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
      xsi:schemaLocation =
       "http://ibiblio.org/xml/namespace/song
        namespace_song.xsd"
>
  <song:TITLE>Hot Cop</song:TITLE>
  <!-- I've temporarily dropped the SRC attribute on this
       element. I'm going to replace it with XLinks shortly.
    -->
  <song:PHOTO ALT="Victor Willis in Cop Outfit" WIDTH="100"
         HEIGHT="200"/>
  <song:COMPOSER>Jacques Morali</song:COMPOSER>
  <song:COMPOSER>Henri Belolo</song:COMPOSER>
  <song:COMPOSER>Victor Willis</song:COMPOSER>
  <song:PRODUCER>Jacques Morali</song:PRODUCER>
  <song:PUBLISHER>PolyGram Records</song:PUBLISHER>
  <song:LENGTH>P0YT6M20S</song:LENGTH>
  <song:YEAR>1978</song:YEAR>
  <song:ARTIST>Village People</song:ARTIST>
</song:SONG>

Multiple namespaces, multiple schemas

Now let's consider the case in which one document mixes markup from different vocabularies. In particular, let’s suppose that you want to use XLink to connect the PHOTO element to the actual JPEG image rather than application-specific markup such as SRC. You need to set xlink:type, xlink:href, xlink:show, and xlink:actuate attributes on the PHOTO element to give it the proper meaning and behavior like this:

<PHOTO xlink:type="simple" xlink:href="hotcop.jpg"
       xlink:show="embed"  xlink:actuate="onLoad"
       ALT="Victor Willis in Cop Outfit"
       WIDTH="100" HEIGHT="200"/>

Cross-Reference

XLinks are discussed in Chapter 20 .

Now the document uses two main namespaces, the http://ibiblio.org/xml/namespace/song namespace for songs and the http://www.w3.org/1999/xlink namespace for XLinks. Thus, it needs two schemas. However, because the root element can have only one xsi:schemaLocation attribute, it has to serve double duty and declare both. Listing 24-27 demonstrates.

Listing 24-27: A SONG document that uses XLink to embed photos

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<SONG xmlns="http://ibiblio.org/xml/namespace/song"
      xmlns:xlink="http://www.w3.org/1999/xlink"
      xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
      xsi:schemaLocation =
      "http://ibiblio.org/xml/namespace/song 24-29.xsd
       http://www.w3.org/1999/xlink xlink.xsd"
>
  <TITLE>Hot Cop</TITLE>
  <PHOTO xlink:type="simple" xlink:href="hotcop.jpg"
         xlink:show="embed"  xlink:actuate="onLoad"
         ALT="Victor Willis in Cop Outfit"
         WIDTH="100" HEIGHT="200"/>
  <COMPOSER>Jacques Morali</COMPOSER>
  <COMPOSER>Henri Belolo</COMPOSER>
  <COMPOSER>Victor Willis</COMPOSER>
  <PRODUCER>Jacques Morali</PRODUCER>
  <PUBLISHER>PolyGram Records</PUBLISHER>
  <LENGTH>P0YT6M20S</LENGTH>
  <YEAR>1978</YEAR>
  <ARTIST>Village People</ARTIST>
</SONG>

Listing 24-28 shows the XLink schema. It only declares attributes, no elements at all. You haven’t seen an example of this yet, but it's not hard. Just use xsd:attribute elements at the top-level, that is, as direct children of the xsd:schema element. The other difference between these top-level xsd:attribute elements and the ones you've seen before is that three of the attributes have fixed values, and don’t even need to be explicitly included in the instance document. Only the xlink:href attribute asks the author to supply a value. However, this is rather specific to this particular use of XLink. Almost anything else you'd do with an XLink other than embedding an image or other non-XML content into the document would require a different schema that used different defaults.

Listing 24-28: xlink.xsd: An XLink schema

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema"
  xmlns="http://www.w3.org/1999/xlink"
  targetNamespace="http://www.w3.org/1999/xlink"
  attributeFormDefault="unqualified"
>
  <xsd:attribute name="type"    type="xsd:string"
                 fixed="simple"/>
  <xsd:attribute name="href"    type="xsd:anyURI"/>
  <xsd:attribute name="actuate" type="xsd:string"
                 fixed="onLoad"/>
  <xsd:attribute name="show"    type="xsd:string"
                 fixed="embed"/>
</xsd:schema>

This schema doesn’t actually apply these attributes to any elements. Therefore, the schema that does describe the PHOTO element needs to import xlink.xsd in order to reference these declarations. This is done with an xsd:import element. The xsd:import's schemaLocation attribute tells the processor where to find the schema to import. The namespace attribute says which elements and attributes the schema declares. Once this schema has been imported, you can add those attributes to any xsd:complexType by giving it an xsd:attribute child whose ref attribute identifies the attribute to be attached. Listing 24-29 demonstrates.

Listing 24-29: A SONG schema that imports the XLink schema

[If you’re going to use http://ibiblio.org/xml/namespace/song as a namespace, you might want to follow your own advice and RDDL it.]

stet - erh

<?xml version="1.0"?>
<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema"
  xmlns="http://ibiblio.org/xml/namespace/song"
  xmlns:xlink="http://www.w3.org/1999/xlink"
  targetNamespace="http://ibiblio.org/xml/namespace/song"
  elementFormDefault="qualified"
  attributeFormDefault="unqualified"
>
  <xsd:import namespace="http://www.w3.org/1999/xlink"
              schemaLocation="xlink.xsd"/>
  <xsd:element name="SONG" type="SongType"/>
  <xsd:complexType name="PhotoType">
    <xsd:attribute name="WIDTH"  type="xsd:positiveInteger"
                   use="required" />
    <xsd:attribute name="HEIGHT" type="xsd:positiveInteger"
                   use="required" />
    <xsd:attribute name="ALT"    type="xsd:string"
                   use="required" />
    <xsd:attribute ref="xlink:type"/>
    <xsd:attribute ref="xlink:href" use="required"/>
    <xsd:attribute ref="xlink:actuate"/>
    <xsd:attribute ref="xlink:show"/>
  </xsd:complexType>
  <xsd:complexType name="SongType">
    <xsd:sequence>
      <xsd:element name="TITLE"     type="xsd:string"/>
      <xsd:element name="PHOTO"     type="PhotoType"/>
      <xsd:element name="COMPOSER"  type="xsd:string"
                   maxOccurs="unbounded"/>
      <xsd:element name="PRODUCER"  type="xsd:string"
                   minOccurs="0" maxOccurs="unbounded"/>
      <xsd:element name="PUBLISHER" type="xsd:string"
                   minOccurs="0"/>
      <xsd:element name="LENGTH"    type="xsd:duration"/>
      <xsd:element name="YEAR"      type="xsd:gYear"/>
      <xsd:element name="ARTIST"    type="xsd:string"
                   maxOccurs="unbounded"/>
    </xsd:sequence>
  </xsd:complexType>
</xsd:schema>

Annotations

At some point in this chapter, it's likely to have occurred to you that schemas can get rather large and rather complex. If that hasn’t occurred to you yet, just imagine a schema not for the very small and simple song documents demonstrated in this chapter, but for much larger XML applications such as Scalable Vector Graphics, XHTML, and DocBook.

You can certainly use regular XML comments to describe schemas, and I encourage you to do so, especially when you're doing something less than obvious in the schema. The W3C XML Schema language also provides a more formal mechanism for annotating schemas. Both the top-level xsd:schema element itself and the various other schema elements (xsd:complexType, xsd:all, xsd:element, xsd:attribute, and so on) can contain xsd:annotation child elements that describe that part of the schema for human readers or for other computer programs. This element has two kinds of child elements:

The xsd:documentation child element describes the schema for human readers. It often contains copyright and similar information.
The xsd:appInfo child element describes the schema for computer programs. For instance, it might contain instructions about what style sheets to apply to the schema.

Each xsd:annotation element can contain any number of either of these. However, no special syntax has been defined for the content of these elements. You can put anything in there you find convenient, including other XML markup, subject only to the usual well-formedness constraints. Thus an xsd:documentation element might contain XHTML and an xsd:appInfo element might contain XSLT. Then again either or both might simply contain plain, unmarked-up text. For example, this annotation could be added to the song schemas developed in this chapter:

  <xsd:annotation>
   <xsd:documentation>
    Song schema for Chapter 23 of the XML Bible, Gold Edition
    Copyright 2001 Elliotte Rusty Harold.
    elharo@metalab.unc.edu
   </xsd:documentation>
  </xsd:annotation>

Summary

In this chapter, you learned that:

Schemas address a number of perceived limitations of DTDs, including a strange, non-XML syntax, namespace incompatibility, lack of data typing, and limited extensibility and scalability.
There are multiple XML schema languages including Relax, Schematron, TREX, and the W3C XML Schema language described in this chapter.
An XML document can indicate the schema that applies to its non-namespace-qualified elements via an xsi:noNamespaceSchemaLocation attribute, which is normally placed on the root element.
An XML document can indicate the schema that applies to its namespace qualified elements via an xsi:schemaLocation attribute, which is normally placed on the root element.
Schemas declare elements with xsd:element elements.
The type attribute of xsd:element specifies the data type of that element.
Elements with complex types can have attributes and child elements.
Elements with simple types only contain parsed character data.
The xsd:complexType element defines a new type for an element that can contain child elements, attributes, and/or mixed content.
The xsd:group, xsd:all, xsd:choice, and xsd:sequence elements let you specify particular combinations of elements in an element's content model.
The minOccurs and maxOccurs attributes of xsd:element determine how many of a given element are allowed in the instance document at that point. The default for each is 1. maxOccurs can be set to unbounded to indicate that any number of the element may appear.
There are 44 built-in simple types, including many numeric, string, time, and XML types.
The xsd:simpleType element defines a new type for an element or attribute that can only contain character data.
You can define your own simple types by restricting an existing type such as xsd:string with the xsd:restriction element. The base attribute of the xsd:restriction child specifies what type you're deriving from.
Each xsd:restriction element contains one or more child elements representing facets: xsd:minInclusive, xsd:minExclusive, xsd:maxInclusive, xsd:maxExclusive, xsd:enumeration, xsd:whiteSpace, xsd:pattern, xsd:length, xsd:minLength, xsd:maxLength, xsd:totalDigits, and/or xsd:fractionDigits.
An xsd:simpleType element can create a new type by unifying the value spaces of existing types. Each existing type combined into the new type is identified by an xsd:union child element.
A list type can hold one or more white space-separated instances of an existing type. Such a type is defined by the xsd:list child of an xsd:simpleType element.
Schemas declare attributes with xsd:attribute elements.
The xsd:import element imports declarations for elements and attributes in a different namespace from another schema document.
Adding xsd:annotation elements helps make your schemas more readable.
The xsd:documentation child of an xsd:annotation element provides information for human readers.
The xsd:appInfo child of an xsd:annotation element provides information for software programs reading the schema, though schema validators ignore it.

In the next chapter, we explore another standard XML application from the W3C, the Resource Description Framework (RDF). RDF is an XML application for encoding meta-data and information structures.

[ Cafe con Leche | XML Bible Home Page | Order from amazon.com | Publisher Page ]