Document Type Definitions (DTDs) are an outgrowth of XML's heritage in the Standardized General Markup Language (SGML). SGML
was always intended for narrative-style documents: books, reports, technical manuals, brochures, Web pages, and the like.
DTDs were designed to serve the needs of these sorts of documents, and indeed they serve them very well. DTDs let you state
very simply and straightforwardly that every book must have one or more authors, that every song has exactly one title, that
every PERSON element has an ID attribute, and so forth. Indeed for narrative documents that are intended for human beings to read from start to finish,
that are more or less composed of words in a row, there's really no need for anything beyond a DTD. However, XML has gone
well beyond the uses envisioned for SGML. XML is being used for object serialization, stock trading, remote procedure calls,
vector graphics, and many more things that look nothing like traditional narrative documents; and it is in these new arenas
that DTDs are showing some limits.
The limitation most developers notice first is the almost complete lack of data typing, especially for element content. DTDs
can't say that a PRICE element must contain a number, much less a number that's greater than zero with two decimal digits of precision and a dollar
sign. There's no way to say that a MONTH element must be an integer between 1 and 12. There's no way to indicate that a TITLE must contain between 1 and 255 characters. None of these are particularly important things to do for the narrative documents
SGML was aimed at; but they're very common things to want to do with data formats intended for computer-to-computer exchange
of information rather than computer-to-human communication. Humans are very good at handling fuzzy systems where expected
data is missing, or perhaps in not quite the right format; computers are not. Computers need to know that when they expect
an element to contain an integer between 1 and 12, the element really contains an integer in that range and nothing else.
The second problem is that DTDs have an unusual non-XML syntax. You actually need separate parsers and APIs to handle DTDs than you do to handle XML documents themselves. For instance, consider this common element declaration:
<!ELEMENT TITLE (#PCDATA)>
This is not a legal XML element. You can’t begin an element name with an exclamation point. TITLE is not an attribute. Neither is (#PCDATA). This is a very different way of describing information than is used in XML document instances. One would expect that if
XML were really powerful enough to live up to all its hype then it would be powerful enough to describe itself. You shouldn’t
need two different syntaxes: one for the information and one for the meta-information detailing the structure of the information.
XML element and attribute syntax should suffice for both info and meta-info.
The third problem is that DTDs are only marginally extensible and don’t scale very well. It's difficult to combine independent DTDs together in a sensible way. You can do this with parameter entity references. Indeed, SMIL 2.0 and modular XHTML are based on this idea. However, the modularized DTDs are very messy and very hard to follow. The largest DTDs in use today are in the ballpark of 10,000 lines of code, and it's questionable whether much larger XML applications can be defined before the entire DTD becomes completely unmanageable and incomprehensible. By contrast, the largest computer programs in existence today, which are much more intrinsically complex than even the most ambitious DTDs, easily reach sizes of 1,000,000 lines of code and more; sometimes even 10,000,000 lines of code or more.
Perhaps most annoyingly, DTDs are only marginally compatible with namespaces. The first principle of namespaces is that only the URI matters. The prefix does not. The prefix can change as long as the URI remains the same. However, validation of documents that use namespace prefixes works only if the DTD declares the prefixed names. You cannot use namespace URIs in a DTD. You must use the actual prefixes. If you change the prefixes in the document but don’t change the DTD, then the document immediately ceases to be valid. There are some tricks that you can perform with parameter entity references to make DTDs less dependent on the actual prefix, but they're complicated and not well understood in the XML community. And even when they are understood, these tricks simply feel far too much like a dirty hack rather than a clean, maintainable solution.
Finally, there are a number of annoying minor limitations where DTDs don’t allow you to do things that it really feels like
you ought to be able to do. For instance, DTDs cannot enforce the order or number of child elements in mixed content. That
is, you can't make statements such as each PARAGRAPH element must begin with exactly one SUMMARY element that is followed by plain text. Similarly you can’t enforce the number of child elements without also enforcing their
order. For instance, you cannot easily say that a PERSON element must contain a FIRST_NAME child, a MIDDLE_NAME child, and a LAST_NAME child, but that you don’t care what order they appear in. Again, there are workarounds; but they grow combinatorially complex
with the number of possible child elements.
Schemas are an attempt to solve all these problems by defining a new XML-based syntax for describing the permissible contents of XML documents that includes:
However, schemas are not a be-all and end-all solution. In particular, schemas do not replace DTDs! You can use both schemas and DTDs in the same document. DTDs can do several things that schemas cannot do, most importantly declaring entities. And of course, DTDs still work very well for the classic sort of narrative documents they were originally designed for. Indeed, for these types of documents, a DTD is often considerably easier to write than an equivalent schema. Parsers and other software will continue to support DTDs for as long as they support XML.
The word schema derives from the Greek word σχημα, meaning form or shape. It was first popularized in the Western world by Immanuel Kant in the late 1700s. According to the 1933 edition of the Oxford English Dictionary, Kant used the word schema to mean, "Any one of certain forms or rules of the ‘productive imagination’ through which the understanding is able to apply its ‘categories’ to the manifold of sense-perception in the process of realizing knowledge or experience." (And you thought computer science was full of unintelligible technical jargon!)
Schemas remained the province of philosophers for the next 200 years until, the word schema entered computer science, probably through database theory. Here, schema originally meant any document that described the permissible content of a database. More specifically, a schema was a description of all the tables in a database and the fields in the table. A schema also described what type of data each field could contain: CHAR, INT, CHAR[32], BLOB, DATE, and so on.
The word schema has grown from that source definition to a more generic meaning of any document that describes the permissible contents of other documents, especially if data typing is involved. Thus, you'll hear about different kinds of schemas from different technologies, including vocabulary schemas, RDF schemas, organizational schemas, X.500 schemas and, of course, XML schemas.
You say schemas, I say schemata
Probably no single topic has been more controversial in the schema world than the proper plural form of the word schema. The original Greek plural is σχηματα, schemata in Latin transliteration; and this is the form which Kant used and which you'll find in most dictionaries. This was fine for the 200 years when only people with PhDs in philosophy actually used the word. However, as often happens when words from other languages are adopted into popular English, its plural changed to something that sounds more natural to an Anglophone ear. In this case, the plural form schemata seems to be rapidly dying out in favor of the simpler schemas. In fact, the three World Wide Web Consortium (W3C) schema specifications all use the plural form schemas. I follow this convention in this book.
Since schemas is such a generic term, it shouldn't come as any surprise to you that there's more than one schema language
for XML. In fact there are many, each with its own unique advantages and disadvantages. These include Murata Makoto's Relax
(http://www.xml.gr.jp/relax/), Rick Jelliffe's Schematron (http://www.ascc.net/xml/resource/schematron/schematron.html), James Clark's TREX - Tree Regular Expressions for XML (http://www.thaiopensource.com/trex/), the Document Definition Markup Language (DDML, also known as XSchema, http://purl.oclc.org/NET/ddml), and the W3C's misleadingly, generically titled XML Schema language. In addition, traditional XML DTDs can be considered to be yet another schema language.
There are also a number of dead XML schema languages that have been abandoned by their manufacturers in favor of other languages. These include Document Content Description (DCD), Commerce One's Schema for Object-Oriented XML (SOX), and Microsoft's XML-Data Reduced (XDR). None of these are worth your time or investment at this point. They never achieved broad adoption, and their vendors are now moving to the W3C XML Schema language instead.
This chapter focuses almost exclusively on the W3C XML Schema language. Nonetheless, TREX, Relax, and Schematron are definitely worthy of your attention as well. In particular, if you find W3C schemas to be excessively complex (and many people do so find them) and if you want a simpler schema language that still offers a complete set of extensible data types, you should consider Relax. Relax adopts the less controversial data types half of the W3C XML Schema recommendation, but replaces the much more complex and much less popular structures half with a much simpler language. Relax also has the advantage of being an official JIS and ISO standard.
Most schema languages, including W3C schemas, Relax, TREX, DDML, and DTDs, take the approach that you must carefully specify
what is allowed in the document. They are conservative: Everything not permitted is forbidden. If, on the other hand, you're
looking for a less-restrictive schema language in which everything not forbidden is permitted, you should consider Schematron.
Schematron is based on XPath, which allows it to make statements none of the other major schema languages can, such as "An
a element cannot have another a element as a descendant, even though an a element can contain a strong element which can contain an a element if it itself is not a descendant of an a element." This isn’t a theoretical example. This is a real restriction in XHTML that has to be made in the prose of the specification
because neither DTDs nor schemas are powerful enough to say it. What it means is that links can’t nest; that is, a link cannot
contain another link.
From this point forward, I will use the unqualified word schema to refer to the W3C's XML schema language; but please keep in mind that alternatives that are equally deserving of the appellation do exist.
The W3C XML Schema language was created by the W3C XML Schema Working Group based on many different submissions from a variety of companies and individuals. It is a very large specification designed to handle a broad range of use cases. In fact, the schema specification is considerably larger and more complex than the XML 1.0 specification. It is an open standard, free to be implemented by any interested party. There are no known patent, trademark, or other intellectual property restrictions that would prevent you from doing anything you might reasonably want to do with schemas. (which unfortunately is not quite the same thing as saying that there are no known patent, trademark, or other intellectual property restrictions that would prevent you from doing anything you might reasonably want to do. The U.S. Patent Office has been a little out of control lately, granting patents left and right for inventions that really don’t deserve it, including a lot of software and business processes. I would not be surprised to learn of an as yet unnoticed patent that at least claims to cover some or all of the W3C XML Schema language.)
Caution
This chapter is based on the May 2, 2001 Recommendation of XML Schemas. At the time of this writing, (June 2001) no software yet implements all of the final Recommendation. In fact, only one parser, Xerces-J, currently supports most of the W3C XML Schema language. Eventually, of course, this should be less of an issue as the standard evolves toward its final incarnation and more vendors implement the full schema language described here. In the meantime, if you do encounter something that doesn’t seem to work quite right, please report the problem to your parser vendor, not to me.
Let's begin our exploration of schemas with the ubiquitous Hello World example. Recall, once again, Listing 3-2 (greeting.xml) from Chapter 3. It is shown below:
Listing 3-2: greeting.xml
<?xml version="1.0"?>
<GREETING>
Hello XML!
</GREETING>
This XML document contains a single element, GREETING. (Remember that <?xml version="1.0"?> is the XML declaration, not an element.) This element contains parsed character data. A schema for this document has to declare
the GREETING element. It may declare other elements too, including ones that aren’t present in this particular document, but it must at
least declare the GREETING element.
Listing 24-1 is a very simple schema for GREETING elements. By convention it would be stored in a file with the three-letter extension .xsd, greeting.xsd for example, but
that's not required. It is an XML document so it has an XML declaration. It can be written and saved in any text editor that
knows how to save Unicode files. As always, you can use a different character set if you declare it in an encoding declaration.
Schema documents are XML documents and have all the privileges and responsibilities of other XML documents. They can even
have DTDs, DOCTYPE declarations, and style sheets if that seems useful to you, although in practice most do not.
Listing 24-1: greeting.xsd
<?xml version="1.0"?>
<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema">
<xsd:element name="GREETING" type="xsd:string"/>
</xsd:schema>
The root element of this and all other schemas is schema. This must be in the http://www.w3.org/2001/XMLSchema namespace. Normally, this namespace is bound to the prefix xsd or xs, although this can change as long as the URI stays the same. The other common approach is to make this URI the default namespace,
although that generally requires a few extra attributes to help separate out the names from the XML application the schema
describes from the names of the schema elements themselves. You'll see this when namespaces are discussed at the end of this
chapter.
Elements are declared using xsd:element elements. Listing 24-1 includes a single such element declaring the GREETING element. The name attribute specifies which element is being declared, GREETING in this example. This xsd:element element also has a type attribute whose value is the data type of the element. In this case the type is xsd:string, a standard type for elements that can contain any amount of text in any form but not child elements. It's equivalent to
a DTD content model of #PCDATA. That is, this xsd:element says that a valid GREETING element must look like this:
<GREETING>
various random text but no markup
</GREETING>
There's no restriction on what text the element can contain. It can be zero or more Unicode characters with any meaning. Thus
a GREETING element can also look like this:
<GREETING>Hello!</GREETING>
Or even this:
<GREETING></GREETING>
However, a valid GREETING element may not look like this:
<GREETING>
<SOME_TAG>various random text</SOME_TAG>
<SOME_EMPTY_TAG/>
</GREETING>
Nor may it look like this:
<GREETING>
<GREETING>various random text</GREETING>
</GREETING>
Each GREETING element must consist of nothing more and nothing less than parsed character data between an opening <GREETING> tag and a closing </GREETING> tag.
Before a document can be validated against a DTD, the document itself must contain a document type declaration pointing to the DTD it should be validated against. You cannot easily receive a document from a third party and validate it against your own DTD. You have to validate it against the DTD that the document's author specified. This is excessively limiting.
For example, imagine you're running an e-commerce business that accepts orders for products using SOAP or XML-RPC. Each order
comes to you over the Internet as an XML document. Before accepting that order the first thing you want to do is check that
it's valid against a DTD you've defined to make sure that it contains all the necessary information. However, if DTDs are
all you have to validate with, then there's nothing to prevent a hacker sending you a document whose DOCTYPE declaration points to a different DTD. Then your system may report that the document is valid according to the hacked DTD,
even though it would be invalid when compared to the correct DTD. If your system accepts the invalid document, it could introduce
corrupt data that crashes the system or lets the hacker order goods they haven’t paid for, all because the person authoring
the document got to choose which DTD to validate against rather than the person validating the document.
Schemas are more flexible. The schema specification specifically allows for a variety of different means for associating documents with schemas. For instance, one possibility is that both the name of the document to validate and the name of the schema to validate it against could be passed to the validator program on the command line like this:
C:\>validator greeting.xml greeting.xsd
Parsers could also let you choose the schema by setting a SAX property or an environment variable. Many other schemes are
possible. The schema specification does not mandate any one way of doing this. However, it does define one particular way
to associate a document with a schema. As with DOCTYPE declarations and DTDs, this requires modifying the instance document to point to the schema. The difference is that with
schemas, unlike with DTDs, this is not the only way to do it. Parser vendors are free to develop other mechanisms if they
want to.
To attach a schema to a document, add an xsi:noNamespaceSchemaLocation attribute to the document's root element. (You can also add it to the first element in the document that the schema applies
to, but most of the time adding it to the root element is simplest.) The xsi prefix is mapped to the http://www.w3.org/2001/XMLSchema-instance URI. As always, the prefix can change as long as the URI stays the same. Listing 24-2 demonstrates.
Listing 24-2: valid_greeting.xml
<?xml version="1.0"?>
<GREETING xsi:noNamespaceSchemaLocation="greeting.xsd"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
Hello XML!
</GREETING>
You can now run the document through any parser that supports schema validation. One such parser is Xerces Java 1.4.0 from
the XML Apache Project. In fact, you can use the same SAXCount program you learned about in Chapter 8 to validate against
schemas as well as DTDs. When you set the -v flag, SAXCount validates the documents it parses against a DTD if it sees a DOCTYPE declaration and against a schema if it finds an xsi:noNamespaceSchemaLocation attribute. Assuming SAXCount finds no errors, it simply returns the amount of time that was required to parse the document:
C:\XML>java sax.SAXCount -v valid_greeting.xml
valid_greeting.xml: 701 ms (1 elems, 1 attrs, 0 spaces, 12
chars)
Note
This chapter uses Xerces Java 1.4.0, which provides partial support for the May 2, 2001 Recommendation of XML Schema. At the
time of this writing Xerces C++ has no schema support at all. Furthermore, earlier versions of Xerces Java support earlier
drafts of the W3C XML Schema language that use different namespace URIs. In particular, they support the http://www.w3.org/2000/10/XMLSchema-and http://www.w3.org/1999/XMLSchema namespaces. You can download the latest version of Xerces from http://xml.apache.org/xerces-j/.
Now let's suppose you have a document that's not valid, such as Listing 24-3. This document uses a P element that hasn't been declared in the schema.
Listing 24-3: invalid_greeting.xml
<?xml version="1.0"?>
<GREETING
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:noNamespaceSchemaLocation="greeting.xsd">
<P>Hello XML!</P>
</GREETING>
Running it through sax.SAXCount, you now get this output showing you what the problems are:
C:\XML>java sax.SAXCount -v invalid_greeting.xml
[Error] invalid_greeting.xml:5:6: Element type "P" must be
declared.
[Error] invalid_greeting.xml:6:13: Datatype error: In element
'GREETING' : Can not have element children within a simple type
content.
invalid_greeting.xml: 1292 ms (2 elems, 2 attrs, 0 spaces, 14
chars)
The validator found two problems. The first is that the P element is used but is not, itself, declared. The second is that the GREETING element is declared to have type xsd:string, one of several "simple" types that cannot have any child elements. However, in this case, the GREETING element does contain a child element: the P element.
The W3C XML Schema language divides elements into complex and simple types. A simple type element is one like GREETING that can only contain text and does not have any attributes. It cannot contain any child elements. It may, however, be more
limited in the kind of text it can contain. For instance, a schema can say that a simple element contains an integer, a date,
or a decimal value between 3.76 and 98.24. Complex elements can have attributes and can have child elements.
Most documents need a mix of both complex and simple elements. For example, consider Listing 24-4. This document describes
the song Yes I Am by Melissa Etheridge. The root element is SONG. This element has a number of child elements giving the title of the song, the composer, the producer, the publisher, the
duration of the song, the year it was released, the price, and the artist who sang it. Except for SONG itself, these are all simple elements that can have type xsd:string. You might see documents like this used in CD databases, MP3 players, Gnutella clients, or anything else that needs to store
information about songs.
Listing 24-4: yesiam.xml
<?xml version="1.0"?>
<SONG xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:noNamespaceSchemaLocation="song.xsd">
<TITLE>Yes I Am</TITLE>
<COMPOSER>Melissa Etheridge</COMPOSER>
<PRODUCER>Hugh Padgham</PRODUCER>
<PUBLISHER>Island Records</PUBLISHER>
<LENGTH>4:24</LENGTH>
<YEAR>1993</YEAR>
<ARTIST>Melissa Etheridge</ARTIST>
<PRICE>$1.25</PRICE>
</SONG>
Now you need a schema that describes this and all other reasonable song documents. Listing 24-5 is the first attempt at such a schema.
Listing 24-5: song.xsd
<?xml version="1.0"?>
<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema">
<xsd:element name="SONG" type="SongType"/>
<xsd:complexType name="SongType">
<xsd:sequence>
<xsd:element name="TITLE" type="xsd:string"/>
<xsd:element name="COMPOSER" type="xsd:string"/>
<xsd:element name="PRODUCER" type="xsd:string"/>
<xsd:element name="PUBLISHER" type="xsd:string"/>
<xsd:element name="LENGTH" type="xsd:string"/>
<xsd:element name="YEAR" type="xsd:string"/>
<xsd:element name="ARTIST" type="xsd:string"/>
<xsd:element name="PRICE" type="xsd:string"/>
</xsd:sequence>
</xsd:complexType>
</xsd:schema>
The root element of this schema is once again xsd:schema, and once again the prefix xsd is mapped to the namespace URI http://www.w3.org/2001/XMLSchema. This will be the case for all schemas in this chapter, and indeed all schemas that you write. I won’t note it again.
This schema declares a single top-level element. That is, there is exactly one element declared in an xsd:element declaration that is an immediate child of the root xsd:schema element. This is the SONG element. Only top-level elements can be the root elements of documents described by this schema, though in general they do
not have to be the root element.
The SONG element is declared to have type SongType. The W3C Schema Working Group wasn't prescient. They built a lot of common types into the language, but they didn’t know
that I was going to need a song type, and they didn’t provide one. Indeed, they could not reasonably have been expected to
predict and provide for the numerous types that schema designers around the world were ever going to need. Instead, they provided
facilities to allow users to define their own types. SongType is one such user-defined type. In fact, you can tell it's not a built-in type because it doesn’t begin with the prefix xsd. All built-in types are in the http://www.w3.org/2001/XMLSchema namespace.
The xsd:complexType element defines a new type. The name attribute of this element names the type being defined. Here that name is SongType, which matches the type previously assigned to the SONG element. Forward references (for example, xsd:element using the SongType type before it's been defined) are perfectly acceptable in schemas. Circular references are okay, too. Type A can depend
on type B which depends on type A. Schema processors sort all this out without any difficulty.
The contents of the xsd:complexType element specify what content a SongType element must contain. In this example, the schema says that every SongType element contains a sequence of eight child elements: TITLE, COMPOSER, PRODUCER, PUBLISHER, LENGTH, YEAR, PRICE, and ARTIST. Each of these is declared to have the built-in type xsd:string. Each SongType element must contain exactly one of each of these in exactly that order. The only other content it may contain is insignificant
white space between the tags.
You can validate Listing 24-4, yesiam.xml, against the song schema, and it does, indeed, prove valid. Are you done? Is song.xsd now an adequate description of legal song documents? Suppose you instead wanted to validate Listing 24-6, a song document that describes Hot Cop by the Village People. Is it valid according to the schema in Listing 24-5?
Listing 24-6: hotcop.xml
<?xml version="1.0"?>
<SONG xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:noNamespaceSchemaLocation="song.xsd">
<TITLE>Hot Cop</TITLE>
<COMPOSER>Jacques Morali</COMPOSER>
<COMPOSER>Henri Belolo</COMPOSER>
<COMPOSER>Victor Willis</COMPOSER>
<PRODUCER>Jacques Morali</PRODUCER>
<PUBLISHER>PolyGram Records</PUBLISHER>
<LENGTH>6:20</LENGTH>
<YEAR>1978</YEAR>
<ARTIST>Village People</ARTIST>
</SONG>
The answer is no, it is not. The reason is that this song was a collaboration between three different composers and the existing schema only allows a single composer. Furthermore, the price is missing. If you looked at other songs, you'd find similar problems with the other child elements. Under Pressure has two artists, David Bowie and Queen. We Are the World has dozens of artists. Many songs have multiple producers. A garage band without a publisher might record a song and post it on Napster in the hope of finding one.
The song schema needs to be adjusted to allow for varying numbers of particular elements. This is done by attaching minOccurs and maxOccurs attributes to each xsd:element element. These attributes specify the minimum and maximum number of instances of the element that may appear at that point
in the document. The value of each attribute is an integer greater than or equal to zero. The maxOccurs attribute may also have the value unbounded to indicate that an unlimited number of the particular element may appear. Listing 24-7 demonstrates.
Listing 24-7: minOccurs and maxOccurs
<?xml version="1.0"?>
<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema">
<xsd:element name="SONG" type="SongType"/>
<xsd:complexType name="SongType">
<xsd:sequence>
<xsd:element name="TITLE" type="xsd:string"
minOccurs="1" maxOccurs="1"/>
<xsd:element name="COMPOSER" type="xsd:string"
minOccurs="1" maxOccurs="unbounded"/>
<xsd:element name="PRODUCER" type="xsd:string"
minOccurs="0" maxOccurs="unbounded"/>
<xsd:element name="PUBLISHER" type="xsd:string"
minOccurs="0" maxOccurs="1"/>
<xsd:element name="LENGTH" type="xsd:string"
minOccurs="1" maxOccurs="1"/>
<xsd:element name="YEAR" type="xsd:string"
minOccurs="1" maxOccurs="1"/>
<xsd:element name="ARTIST" type="xsd:string"
minOccurs="1" maxOccurs="unbounded"/>
<xsd:element name="PRICE" type="xsd:string"
minOccurs="0" maxOccurs="1"/>
</xsd:sequence>
</xsd:complexType>
</xsd:schema>
This schema says that every SongType element must have, in order,
TITLE (minOccurs="1" maxOccurs="1")
COMPOSERs (minOccurs="1" maxOccurs="unbounded")
PRODUCERs, although possibly no producer at all (minOccurs="0" maxOccurs="unbounded")
PUBLISHER or no PUBLISHER at all (minOccurs="0" maxOccurs="1")
LENGTH (minOccurs="1" maxOccurs="1")
YEAR (minOccurs="1" maxOccurs="1")
ARTIST, possibly more (minOccurs="1" maxOccurs="unbounded")
PRICE, (minOccurs="0" maxOccurs="1")
This is much more flexible and easier to use than the limited ?, *, and + that are available in DTDs. It is very straightforward to say, for example, that you want between 4 and 7 of a given element.
Just set minOccurs to 4 and maxOccurs to 7.
If minOccurs and maxOccurs are not present, then the default value of each is 1. Taking advantage of this, the song schema can be written a little more
compactly as shown in Listing 24-8.
Listing 24-8: Taking advantage of the default values of minOccurs and maxOccurs
<?xml version="1.0"?>
<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema">
<xsd:element name="SONG" type="SongType"/>
<xsd:complexType name="SongType">
<xsd:sequence>
<xsd:element name="TITLE" type="xsd:string"/>
<xsd:element name="COMPOSER" type="xsd:string"
maxOccurs="unbounded"/>
<xsd:element name="PRODUCER" type="xsd:string"
minOccurs="0" maxOccurs="unbounded"/>
<xsd:element name="PUBLISHER" type="xsd:string"
minOccurs="0"/>
<xsd:element name="LENGTH" type="xsd:string"/>
<xsd:element name="YEAR" type="xsd:string"/>
<xsd:element name="ARTIST" type="xsd:string"
maxOccurs="unbounded"/>
<xsd:element name="PRICE" type="xsd:string"
minOccurs="0"/>
</xsd:sequence>
</xsd:complexType>
</xsd:schema>
The examples so far have all been relatively flat. That is, a SONG element contained other elements; but those elements only contained parsed character data, not child elements of their own.
Suppose, however, that some child elements do contain other elements, as in Listing 24-9. Here the COMPOSER and PRODUCER elements each contain NAME elements.
Listing 24-9: A deeper hierarchy
<?xml version="1.0"?>
<SONG xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:noNamespaceSchemaLocation="24-10.xsd">
<TITLE>Hot Cop</TITLE>
<COMPOSER>
<NAME>Jacques Morali</NAME>
</COMPOSER>
<COMPOSER>
<NAME>Henri Belolo</NAME>
</COMPOSER>
<COMPOSER>
<NAME>Victor Willis</NAME>
</COMPOSER>
<PRODUCER>
<NAME>Jacques Morali</NAME>
</PRODUCER>
<PUBLISHER>PolyGram Records</PUBLISHER>
<LENGTH>6:20</LENGTH>
<YEAR>1978</YEAR>
<ARTIST>Village People</ARTIST>
</SONG>
Because the COMPOSER and PRODUCER elements now have complex content, you can no longer use one of the built-in types such as xsd:string to declare them. Instead you have to define a new ComposerType and ProducerType using top-level xsd:complexType elements. Listing 24-10 demonstrates.
Listing 24-10: Defining separate ComposerType and ProducerType types
<?xml version="1.0"?>
<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema">
<xsd:element name="SONG" type="SongType"/>
<xsd:complexType name="ComposerType">
<xsd:sequence>
<xsd:element name="NAME" type="xsd:string"/>
</xsd:sequence>
</xsd:complexType>
<xsd:complexType name="ProducerType">
<xsd:sequence>
<xsd:element name="NAME" type="xsd:string"/>
</xsd:sequence>
</xsd:complexType>
<xsd:complexType name="SongType">
<xsd:sequence>
<xsd:element name="TITLE" type="xsd:string"/>
<xsd:element name="COMPOSER" type="ComposerType"
maxOccurs="unbounded"/>
<xsd:element name="PRODUCER" type="ProducerType"
minOccurs="0" maxOccurs="unbounded"/>
<xsd:element name="PUBLISHER" type="xsd:string"
minOccurs="0"/>
<xsd:element name="LENGTH" type="xsd:string"/>
<xsd:element name="YEAR" type="xsd:string"/>
<xsd:element name="ARTIST" type="xsd:string"
maxOccurs="unbounded"/>
<xsd:element name="PRICE" type="xsd:string"
minOccurs="0"/>
</xsd:sequence>
</xsd:complexType>
</xsd:schema>
You may have noticed that PRODUCER and COMPOSER are very similar. Each contains a single NAME child element and nothing else. In a DTD you'd take advantage of this shared content model via a parameter entity reference.
In a schema, it's much easier. Simply given them the same type. While you could declare that the PRODUCER has ComposerType or vice versa, it's better to declare that both have a more generic PersonType. Listing 24-11 demonstrates.
Listing 24-11: Using a single PersonType for both COMPOSER and PRODUCER
<?xml version="1.0"?>
<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema">
<xsd:element name="SONG" type="SongType"/>
<xsd:complexType name="PersonType">
<xsd:sequence>
<xsd:element name="NAME" type="xsd:string"/>
</xsd:sequence>
</xsd:complexType>
<xsd:complexType name="SongType">
<xsd:sequence>
<xsd:element name="TITLE" type="xsd:string"/>
<xsd:element name="COMPOSER" type="PersonType"
maxOccurs="unbounded"/>
<xsd:element name="PRODUCER" type="PersonType"
minOccurs="0" maxOccurs="unbounded"/>
<xsd:element name="PUBLISHER" type="xsd:string"
minOccurs="0"/>
<xsd:element name="LENGTH" type="xsd:string"/>
<xsd:element name="YEAR" type="xsd:string"/>
<xsd:element name="ARTIST" type="xsd:string"
maxOccurs="unbounded"/>
<xsd:element name="PRICE" type="xsd:string"
minOccurs="0"/>
</xsd:sequence>
</xsd:complexType>
</xsd:schema>
Suppose you wanted to divide the NAME elements into separate GIVEN and FAMILY elements like this:
<NAME>
<GIVEN>Victor</GIVEN>
<FAMILY>Willis</FAMILY>
</NAME>
<NAME>
<GIVEN>Jacques</GIVEN>
<FAMILY>Morali</FAMILY>
</NAME>
To declare this, you could use an xsd:complexType element to define a new NameType element like this:
<xsd:complexType name="NameType">
<xsd:sequence>
<xsd:element name="GIVEN" type="xsd:string"/>
<xsd:element name="FAMILY" type="xsd:string"/>
</xsd:sequence>
</xsd:complexType>
Then the PersonType would be defined like this:
<xsd:complexType name="PersonType">
<xsd:sequence>
<xsd:element name="NAME" type="NameType"/>
</xsd:sequence>
</xsd:complexType>
However, the NAME element is only used inside PersonType elements. Perhaps it shouldn't be a top-level definition. For instance, you may not want to allow NAME elements to be used as root elements, or to be children of things that aren’t PersonType elements. You can prevent this by defining a name with an anonymous type. To do this, instead of assigning the NAME element a type with a type attribute on the corresponding xsd:element element, you give it an xsd:complexType child element to define its type. Listing 24-12 demonstrates.
Listing 24-12: Anonymous types
<?xml version="1.0"?>
<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema">
<xsd:element name="SONG" type="SongType"/>
<xsd:complexType name="PersonType">
<xsd:sequence>
<xsd:element name="NAME">
<xsd:complexType>
<xsd:sequence>
<xsd:element name="GIVEN" type="xsd:string"/>
<xsd:element name="FAMILY" type="xsd:string"/>
</xsd:sequence>
</xsd:complexType>
</xsd:element>
</xsd:sequence>
</xsd:complexType>
<xsd:complexType name="SongType">
<xsd:sequence>
<xsd:element name="TITLE" type="xsd:string"/>
<xsd:element name="COMPOSER" type="PersonType"
maxOccurs="unbounded"/>
<xsd:element name="PRODUCER" type="PersonType"
minOccurs="0" maxOccurs="unbounded"/>
<xsd:element name="PUBLISHER" type="xsd:string"
minOccurs="0"/>
<xsd:element name="LENGTH" type="xsd:string"/>
<xsd:element name="YEAR" type="xsd:string"/>
<xsd:element name="ARTIST" type="xsd:string"
maxOccurs="unbounded"/>
<xsd:element name="PRICE" type="xsd:string"
minOccurs="0"/>
</xsd:sequence>
</xsd:complexType>
</xsd:schema>
Defining the element types inside the xsd:element elements that are themselves children of xsd:complexType elements is a very powerful technique. Among other things, it enables you to give elements with the same name different types
when used in different elements. For example, you can say that the NAME of a PERSON contains GIVEN and FAMILY child elements while the NAME of a MOVIE contains an xsd:string and the NAME of a VARIABLE contains a string containing only alphanumeric characters from the ASCII character set.
Schemas offer much greater control over mixed content than DTDs do. In particular, schemas let you enforce the order and number of elements appearing in mixed content. For example, suppose you wanted to allow extra text to be mixed in with the names to provide middle initials, titles, and the like as shown in Listing 24-13.
Caution
The format used here is purely for illustrative purposes. In practice, I'd recommend that you make the middle names and titles separate elements as well.
Listing 24-13: Mixed content
<?xml version="1.0"?>
<SONG xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:noNamespaceSchemaLocation="24-14.xsd">
<TITLE>Hot Cop</TITLE>
<COMPOSER>
<NAME>
Mr. <GIVEN>Jacques</GIVEN> <FAMILY>Morali</FAMILY> Esq.
</NAME>
</COMPOSER>
<COMPOSER>
<NAME>
Mr. <GIVEN>Henri</GIVEN> L. <FAMILY>Belolo</FAMILY>, M.D.
</NAME>
</COMPOSER>
<COMPOSER>
<NAME>
Mr. <GIVEN>Victor</GIVEN> C. <FAMILY>Willis</FAMILY>
</NAME>
</COMPOSER>
<PRODUCER>
<NAME>
Mr. <GIVEN>Jacques</GIVEN> S. <FAMILY>Morali</FAMILY>
</NAME>
</PRODUCER>
<PUBLISHER>PolyGram Records</PUBLISHER>
<LENGTH>6:20</LENGTH>
<YEAR>1978</YEAR>
<ARTIST>Village People</ARTIST>
</SONG>
It's very easy to declare that an element has mixed content in schemas. First, set up the xsd:complexType exactly as you would if the element only contained child elements. Then add a mixed attribute to it with the value true. Listing 24-14 demonstrates. It is almost identical to Listing 24-12 except for the addition of the mixed="true" attribute.
Listing 24-14: Declaring mixed content in a schema
<?xml version="1.0"?>
<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema">
<xsd:element name="SONG" type="SongType"/>
<xsd:complexType name="PersonType">
<xsd:sequence>
<xsd:element name="NAME">
<xsd:complexType mixed="true">
<xsd:sequence>
<xsd:element name="GIVEN" type="xsd:string"/>
<xsd:element name="FAMILY" type="xsd:string"/>
</xsd:sequence>
</xsd:complexType>
</xsd:element>
</xsd:sequence>
</xsd:complexType>
<xsd:complexType name="SongType">
<xsd:sequence>
<xsd:element name="TITLE" type="xsd:string"/>
<xsd:element name="COMPOSER" type="PersonType"
maxOccurs="unbounded"/>
<xsd:element name="PRODUCER" type="PersonType"
minOccurs="0" maxOccurs="unbounded"/>
<xsd:element name="PUBLISHER" type="xsd:string"
minOccurs="0"/>
<xsd:element name="LENGTH" type="xsd:string"/>
<xsd:element name="YEAR" type="xsd:string"/>
<xsd:element name="ARTIST" type="xsd:string"/>
maxOccurs
="unbounded"/>
<xsd:element name="PRICE" type="xsd:string"
minOccurs="0"/>
</xsd:sequence>
</xsd:complexType>
</xsd:schema>
So far, all the schemas you've seen have held that order mattered; for example, that it would be wrong to put the COMPOSER before the TITLE or the PRODUCER after the ARTIST. Given these schemas, the document shown below in Listing 24-15 is clearly invalid. But should it be? Element order often
does matter in narrative documents such as books and Web pages. However, it's not nearly as important in data-centric documents
like the examples in this chapter. Do you really care whether the TITLE comes first or not, as long as there is a TITLE? After all, if the document's going to be shown to a human being, it will probably first be transformed with an XSLT style
sheet that can easily place the contents in any order it likes.
Listing 24-15: A song document that places the elements in a different order
<?xml version="1.0"?>
<SONG xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:noNamespaceSchemaLocation="song.xsd">
<ARTIST>Village People</ARTIST>
<TITLE>Hot Cop</TITLE>
<COMPOSER>
<NAME><GIVEN>Jacques</GIVEN> <FAMILY>Morali</FAMILY></NAME>
</COMPOSER>
<PUBLISHER>PolyGram Records</PUBLISHER>
<COMPOSER>
<NAME><FAMILY>Belolo</FAMILY> <GIVEN>Henri</GIVEN></NAME>
</COMPOSER>
<YEAR>1978</YEAR>
<COMPOSER>
<NAME><FAMILY>Willis</FAMILY> <GIVEN>Victor</GIVEN></NAME>
</COMPOSER>
<PRODUCER>
<NAME><GIVEN>Jacques</GIVEN> <FAMILY>Morali</FAMILY></NAME>
</PRODUCER>
<PRICE>$1.25</PRICE>
</SONG>
The W3C XML Schema language provides three grouping constructs that specify whether and how ordering of individual elements is important. These are:
xsd:all group requires that each element in the group must occur at most once, but that order is not important.
xsd:choice group specifies that any one element from the group should appear. It can also be used to say that between N and M elements
from the group should appear in any order.
xsd:sequence group requires that each element in the group appear exactly once, in the specified order.
Unfortunately, these constructs are not everything you might desire. In particular, you can’t specify constraints such as
those that would be required to really handle Listing 24-14. In particular, you can’t specify that you want a SONG to have exactly one TITLE, one or more COMPOSERs, zero or more PRODUCERs, one or more ARTISTs, but that you don’t care in what order the individual elements occur.
You can specify that you want each NAME element to have exactly one GIVEN child and one FAMILY child, but that you don’t care what order they appear in. The xsd:all group accomplishes this. For example,
<xsd:complexType name="PersonType">
<xsd:sequence>
<xsd:element name="NAME">
<xsd:complexType>
<xsd:all>
<xsd:element name="GIVEN" type="xsd:string"
minOccurs="1" maxOccurs="1"/>
<xsd:element name="FAMILY" type="xsd:string"
minOccurs="1" maxOccurs="1"/>
</xsd:all>
</xsd:complexType>
</xsd:element>
</xsd:sequence>
</xsd:complexType>
The extension to handle what you want for Listing 24-15 seems obvious. It would look like this:
<xsd:complexType name="SongType">
<xsd:all>
<xsd:element name="TITLE" type="xsd:string"
minOccurs="1" maxOccurs="1"/>
<xsd:element name="COMPOSER" type="PersonType"
minOccurs="1" maxOccurs="unbounded"/>
<xsd:element name="PRODUCER" type="PersonType"
minOccurs="0" maxOccurs="unbounded"/>
<xsd:element name="PUBLISHER" type="xsd:string"
minOccurs="0" maxOccurs="1"/>
<xsd:element name="LENGTH" type="xsd:string"
minOccurs="1" maxOccurs="1"/>
<xsd:element name="YEAR" type="xsd:string"
minOccurs="1" maxOccurs="1"/>
<xsd:element name="ARTIST" type="xsd:string"
minOccurs="1" maxOccurs="unbounded"/>
<xsd:element name="PRICE" type="xsd:string" minOccurs="0"/>
</xsd:all>
</xsd:complexType>
Unfortunately, the W3C XML Schema language restricts the use of minOccurs and maxOccurs inside xsd:all elements. In particular, each one's value must be 0 or 1. You cannot set it to 4 or 7 or unbounded. Therefore the above type definition is invalid. Furthermore, xsd:all can only contain individual element declarations. It cannot contain xsd:choice or xsd:sequence elements. xsd:all offers somewhat more expressivity than DTDs do, but probably not as much as you want.
The xsd:choice element is the schema equivalent of the | in DTDs. When xsd:element elements are combined inside an xsd:choice, then exactly one of those elements must appear in instance documents. For example, the choice in this xsd:complexType requires either a PRODUCER or a COMPOSER, but not both.
<xsd:complexType name="SongType">
<xsd:sequence>
<xsd:element name="TITLE" type="xsd:string"/>
<xsd:choice>
<xsd:element name="COMPOSER" type="PersonType"/>
<xsd:element name="PRODUCER" type="PersonType"/>
</xsd:choice>
<xsd:element name="PUBLISHER" type="xsd:string"
minOccurs="0"/>
<xsd:element name="LENGTH" type="xsd:string"/>
<xsd:element name="YEAR" type="xsd:string"/>
<xsd:element name="ARTIST" type="xsd:string"
maxOccurs="unbounded"/>
<xsd:element name="PRICE" type="xsd:string" minOccurs="0"/>
</xsd:sequence>
</xsd:complexType>
The xsd:choice element itself can have minOccurs and maxOccurs attributes that establish exactly how many selections may be made from the choice. For example, setting minOccurs to 1 and maxOccurs to 6 would indicate that between one and six elements listed in the xsd:choice should appear. Each of these can be any of the elements in the xsd:choice. For example, you could have six different elements, three of the same element and three of another, or up to six of the
same element. This next xsd:choice allows for any number of artists, composers, and producers. However, in order to require that there be at least one ARTIST element and at least one COMPOSER element, rather than allowing all spaces to be filled by PRODUCER elements, it's necessary to place xsd:element declarations for these two outside the choice. This has the unfortunate side-effect of locking in more order than is really
needed.
<xsd:complexType name="SongType">
<xsd:sequence>
<xsd:element name="TITLE" type="xsd:string"/>
<xsd:element name="COMPOSER" type="PersonType"/>
<xsd:choice minOccurs="0" maxOccurs="unbounded">
<xsd:element name="PRODUCER" type="PersonType"/>
<xsd:element name="COMPOSER" type="PersonType"/>
<xsd:element name="ARTIST" type="xsd:string"/>
</xsd:choice>
<xsd:element name="ARTIST" type="xsd:string"/>
<xsd:element name="PUBLISHER" type="xsd:string"
minOccurs="0"/>
<xsd:element name="LENGTH" type="xsd:string"/>
<xsd:element name="YEAR" type="xsd:string"/>
<xsd:element name="PRICE" type="xsd:string" minOccurs="0"/>
</xsd:sequence>
</xsd:complexType>
An xsd:sequence element requires each member of the sequence to appear in the same order in the instance document as in the xsd:sequence element. I've used this frequently as the basic group for xsd:complexType elements in this chapter so far. The number of times each element is allowed to appear can be controlled by the xsd:element's minOccurs and maxOccurs attributes. You can add minOccurs and maxOccurs attributes to the xsd:sequence element to specify the number of times the sequence should repeat.
Until now I've focused on writing schemas that validate the element structures in an XML document. However, there's also a
lot of non-XML structure in the song documents. The YEAR element isn't just a string. It's an integer, and maybe not just any integer either, but a positive integer with four digits.
The PRICE element is some sort of money. The LENGTH element is a duration of time. DTDs have absolutely nothing to say about such non-XML structures that are inside the parsed
character data content of elements and attributes. Schemas, however, do let you make all sorts of statements about what forms
the text inside elements may take and what it means. Schemas provide much more sophisticated semantics for documents than
DTDs do.
Listing 24-16 is a new schema for song documents. It's based on Listing 24-8, but read closely and you should notice that a few things have changed.
Listing 24-16: A schema with simple data types
<?xml version="1.0"?>
<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema">
<xsd:element name="SONG" type="SongType"/>
<xsd:complexType name="SongType">
<xsd:sequence>
<xsd:element name="TITLE" type="xsd:string"/>
<xsd:element name="COMPOSER" type="xsd:string"
maxOccurs="unbounded"/>
<xsd:element name="PRODUCER" type="xsd:string"
minOccurs="0" maxOccurs="unbounded"/>
<xsd:element name="PUBLISHER" type="xsd:string"
minOccurs="0"/>
<xsd:element name="LENGTH" type="xsd:duration"/>
<xsd:element name="YEAR" type="xsd:gYear"/>
<xsd:element name="ARTIST" type="xsd:string"
maxOccurs="unbounded"/>
<xsd:element name="PRICE" type="xsd:string"
minOccurs="0"/>
</xsd:sequence>
</xsd:complexType>
</xsd:schema>
Did you spot the changes? The values of the type attributes of the LENGTH and YEAR declarations are no longer xsd:string. Instead, LENGTH has the type xsd:duration and YEAR has the type xsd:gYear. These declarations say that it's no longer okay for the YEAR and LENGTH elements to contain just any old string of text. Instead they must contain strings in particular formats. In particular,
the YEAR element must contain a year; and the LENGTH element must contain a recognizable length of time. When you check a document against this schema, the validator will check
that these elements contain the proper data. It's not just looking at the elements. It's looking at the content inside the
elements!
Let's actually validate hotcop.xml against this schema and see what we get:
C:\XML>java sax.SAXCount -v hotcop.xml
[Error] hotcop.xml:10:25: Datatype error: In element 'LENGTH' :
Value '6:20' is not legal value for current datatype.
hotcop.xml: 1783 ms (10 elems, 2 attrs, 28 spaces, 98 chars)
That's unexpected! The problem is that 6:20 is not in the proper format for time durations, at least not the format that the
W3C XML Schema language uses and that schema validators know how to check. Schema validators expect that time types are expressed
in the format defined in ISO standard 8601, Representations of dates and times (http://www.iso.ch/markete/8601.pdf). This standard says that time durations should have the form PnYnMnDTnHnMdS, where n is an integer and d is a decimal number. P stands for "Period". nY gives the number of years; the first nM gives the number of months; and nD gives the number of days. T separates the date from the time. Following the T, nH gives the number of hours; the second nM gives the number of minutes; and dS gives the number of seconds. If d has a fraction part, then the duration can be specified to an arbitrary level of precision.
In this format, a duration of 6 minutes and 20 seconds should be written as P0Y0M0DT0H6M20S. If you prefer, the zero pieces
can be left out, so you can write this more compactly as PT6M20S. Listing 24-17 shows the fixed version of hotcop.xml with
the LENGTH in the right format.
Listing 24-17: fixed hotcop.xml
<?xml version="1.0"?>
<SONG xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:noNamespaceSchemaLocation="24-16.xsd">
<TITLE>Hot Cop</TITLE>
<COMPOSER>Jacques Morali</COMPOSER>
<COMPOSER>Henri Belolo</COMPOSER>
<COMPOSER>Victor Willis</COMPOSER>
<PRODUCER>Jacques Morali</PRODUCER>
<PUBLISHER>PolyGram Records</PUBLISHER>
<LENGTH>P0YT6M20S</LENGTH>
<YEAR>1978</YEAR>
<ARTIST>Village People</ARTIST>
</SONG>
Admittedly the ISO 8601 format for time durations is a little obtuse, if precise. You may well be asking whether there's a
type that you can specify for the LENGTH that would make lengths such as 6:20 and 4:24 legal. In fact, there's no such type built-in to the W3C XML Schema language;
but you can define one yourself. You'll learn how to do that soon, but first let's explore some of the other data types that
are built-in to the W3C XML Schema language.
There are 44 built-in simple types in the W3C XML Schema language. These can be unofficially divided into seven groups:
The most obvious data types, and the ones most familiar to programmers, are the numeric data types. Among computer scientists, there's quite a bit of disagreement about how numbers should be represented in computer systems. The W3C XML Schema language tries to make everyone happy by providing almost every numeric type imaginable including:
java.math package
You'll probably only use a subset of these. For instance, you wouldn’t use both the arbitrarily large xsd:integer type and the four-byte limited xsd:int type. Table 24-1 summarizes the different numeric types.
Table 24-1: Schema Numeric Types
|
Name: |
Type: |
Examples: |
|
|
IEEE 754 32-bit floating point number, or as close as you can get using a base 10 representation; same as Java's |
-INF, -1E4, -0, 0, 12.78E-2, 12, INF, NaN |
|
|
IEEE 754 64-bit floating point number, or as close as you can get using a base 10 representation; same as Java's |
-INF, 1.401E-90, -1E4, -0, 0, 12.78E-2, 12, INF, NaN, 3.4E42 |
|
|
Arbitrary precision, decimal numbers; same as |
-2.7E400, 5.7E-444, -3.1415292, 0, 7.8, 90200.76, 3.4E1024 |
|
|
An arbitrarily large or small integer; same as |
-500000000000000000000000, -9223372036854775809, -126789, -1, 0, 1, 5, 23, 42, 126789, 9223372036854775808, 4567349873249832649873624958 |
|
|
An integer less than or equal to zero |
0, -1, -2, -3, -4, -5, -6, -7, -8, -9, . . . |
|
|
An integer strictly less than zero |
-1, -2, -3, -4, -5, -6, -7, -8, -9, . . . |
|
|
An eight-byte two's complement integer such as Java's |
-9223372036854775808, -9223372036854775807, . . . -6, -5, -4, -3, -2, -1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, . . ., 2147483645, 2147483646, 2147483647, 2147483648, . . .9223372036854775806, 9223372036854775807 |
|
|
An integer that can be represented as a four-byte, two's complement number such as Java's |
-2147483648, -2147483647, -2147483646, 2147483645, . . . -6, -5, -4, -3, -2, -1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, . . ., 2147483645, 2147483646, 2147483647 |
|
|
An integer that can be represented as a two-byte, two's complement number such as Java's |
-32768, -32767, -32766, . . ., -6, -5, -4, -3, -2, -1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, . . . 32765, 32766, 32767 |
|
|
An integer that can be represented as a one-byte, two's complement number such as Java's |
-128, -127, -126, -125, . . ., -3, -2, -1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, . . .121, 122, 123, 124, 125, 126, 127 |
|
|
An integer greater than or equal to zero |
0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, . . .. . . |
|
|
An eight-byte unsigned integer |
0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, . . .18446744073709551614, 18446744073709551615 |
|
|
A four-byte unsigned integer |
0, 1, 2, 3, 4, 5, . . .4294967294, 4294967295 |
|
|
A two-byte unsigned integer |
0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, . . .65533, 65534, 65535 |
|
|
A one-byte unsigned integer |
0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, . . . 252, 253, 254, 255 |
|
|
An integer strictly greater than zero |
1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, . . . |
The next set of simple types the W3C XML Schema language provides are more familiar to database designers than to procedural
programmers; these are the time types. These can represent times of day, dates, or durations of time. The formats, shown in
Table 24-2, are all based on the ISO standard 8601, Representations of dates and times (http://www.iso.ch/markete/8601.pdf). Time zones are given as offsets from Coordinated Universal Time (Greenwich Mean Time to laypeople) or as the letter Z to
indicate Coordinated Universal Time.
Table 24-2: XML Schema Time Types
|
Name: |
Type: |
Examples: |
|
|
A particular moment in Coordinated Universal Time, up to an arbitrarily small fraction of a second |
1999-05-31T13:20:00.000-05:00, 1999-05-31T18:20:00.000Z, 1999-05-31T13:20:00.000, 1999-05-31T13:20:00.000-05:00.321 |
|
|
A specific day in history |
-0044-03-15, 0001-01-01, 1969-06-27, 2000-10-31, 2001-11-17 |
|
|
A specific time of day that recurs every day |
14:30:00.000, 09:30:00.000-05:00, 14:30:00.000Z |
|
|
A day in no particular month, or rather in every month |
--01, --02, . . . –09, --10, --11, --12, . . ., --28, --29, --30, --31 |
|
|
A month in no particular year |
--01--, --02--, --03--, ---04--, . . . --09--, --10--, --11--, --12-- |
|
|
A given year |
. . . -0002, -0001, 0001, 0002, 0003, . . .1998, 1999, 2000, 2001, 2002, . . .9997, 9998, 9999 |
|
|
A specific month in a specific year |
1999-12, 2001-04, 1968-07 |
|
|
A date in no particular year, or rather in every year |
--10-31, --02-28, --02-29 |
|
|
A length of time, without fixed endpoints, to an arbitrary fraction of a second |
P2000Y10M31DT09H32M7.4312S |
Notice in particular that in all the date formats the year comes first, followed by the month, then the day, then the hour, and so on. The largest unit of time is on the left and the smallest unit is on the right. This helps avoid questions such as whether 2001–02–11 is February 11, 2000 or November 2, 2001.
The next batch of schema data types should be quite familiar. These are the types related to XML constructs themselves. Most
of these types match attribute types in DTDs such as NMTOKENS or IDREF. The difference is that with schemas these types can be applied to both elements and attributes. These also include four
new types related to other XML constructs: xsd:language, xsd:Name, xsd:QName, and xsd:NCName. Table 24-3 summarizes the different types.
Table 24-3: XML Schema XML Types
|
Name: |
Type: |
Examples: |
|
|
XML 1.0 |
|
|
|
XML 1.0 |
|
|
|
XML 1.0 |
|
|
|
XML 1.0 |
|
|
|
XML 1.0 |
|
|
|
XML 1.0 |
|
|
|
XML 1.0 |
|
|
|
XML 1.0 |
|
|
|
Valid values for |
|
|
|
An XML 1.0 Name, with or without colons |
|
|
|
a prefixed name |
|
|
|
a local name without any colons |
|
Cross-Reference
For more details on the permissible values for elements and attributes declared to have these types, see Chapter 11.
You've already encountered the xsd:string type. It's the most generic simple type. It requires a sequence of Unicode characters of any length, but this is what all
XML element content and attribute values are. There are also two very closely related types: xsd:token and xsd:CDATA. These are the same as xsd:string except that they limit the amount, location, and type of white space that can be used. Table 24-4 summarizes the string data
types.
Table 24-4: XML Schema String Types
|
Name: |
Type: |
Examples: |
|
|
A sequence of zero or more Unicode characters that are allowed in an XML document; essentially the only forbidden characters are most of the C0 controls, surrogates, and the byte-order mark |
|
|
|
A string that does not contain any tabs, carriage returns, or linefeeds |
|
|
|
A string with no leading or trailing white space, no tabs, no linefeeds, and not more than one consecutive space |
|
It's impossible
to include arbitrary binary files in XML documents because they might contain illegal characters such as a form feed or a
null that would make the XML document malformed. Therefore, any such data must first be encoded in legal characters. The W3C
XML Schema Language supports two such encodings, xsd:base64Binary and xsd:hexBinary.
Hexadecimal binary encodes each byte of the input as two hexadecimal digits — 00, 01, 02, 03, 04, 05, 06, 07, 08, 09, 0A, 0B, 0C, 0D, 0E, 0F, 10, 11, 12, and so on. Thus, an entire file can be encoded using only the digits 0 through 9 and the letters A through F. (Lowercase letters are also allowed, but uppercase letters are customary.) On the other hand, each byte is replaced by two bytes so this encoding doubles the size of the data. It's not a very efficient encoding. Hexadecimal binary encoded data tends to look like this:
A4E345EC54CC8D52198000FFEA6C807F41F332127323432147A89979EEF3
Base64 encoding uses a more complex algorithm and a larger character set, 65 ASCII characters chosen for their ability to
pass through almost all gateways, mail relays, and terminal servers intact, as well as their existence with the same code
points in ASCII, EBCDIC, and most other common character sets. Base64 encodes every three bytes as four characters, typically
only increasing file size by a third, so it's somewhat more efficient than xsd:hexBinary. Base64 encoded data tends to look something like this:
6jKpNnmkkWeArsn5Oeeg2njcz+nXdk0f9kZI892ddlR8Lg1aMhPeFTYuoq3I6n BjWzuktNZKiXYBfKsSTB8U09dTiJo2ir3HJuY7eW/p89osKMfixPQsp9vQMgzph6Qa lY7j4MB7y5ROJYsTr1/fFwmj/yhkHwpbpzed1LE=
XML Digital Signatures use Base64 encoding to encode the binary signatures before wrapping them in an XML element.
Caution
I really discourage you from using either of these if at all possible. If you have binary data, it's much more efficient and much less obtuse to link to it using XLink or unparsed entities rather than encoding it in Base64 or hexadecimal binary.
There are two types left over that don’t fit neatly into the previous categories: xsd:boolean, and xsd:anyURI. The xsd:boolean type represents something similar to C++'s bool data type. It has four legal values: 0, 1, true, and false. 0 is considered to be the same as false, and 1 is considered the same as true.
The final schema simple type is xsd:anyURI. An element of this type contains a relative or absolute URI, possibly a URL, such as urn:isbn:0764547607, http://www.w3.org/TR/2000/WD-xmlschema-2-20000407/#timeDuration, /javafaq/reports/JCE1.2.1.htm, /TR/2000/WD-xmlschema-2-20000407/, or ../index.html.
Caution
Xerces 1.4.0 doesn’t yet accept relative URLs in elements and attributes with the type xsd:anyURI. This is scheduled to be fixed in Xerces 1.4.1.
You're not limited to the 44 simple types that the W3C XML Schema Language defines. As in object-oriented programming languages,
you can create new data types by deriving from the existing types. The most common such derivation is to restrict a type to
a subset of its normal values. For instance, you can define an integer type that only holds numbers between 1 and 20 by deriving
from xsd:positiveInteger. You can create enumerated types that only allow a finite list of fixed values. You can create new types that join together
the ranges of existing types through a union. For instance you can derive a type that can hold either an xsd:date or an xsd:int.
New simple types are created by xsd:simpleType elements, just as new complex types are created by xsd:complexType elements. The name attribute of xsd:simpleType assigns a name to the new type by which it can be referred to in xsd:element type attributes. The allowed content of elements and attributes with the new type can be specified by one of three child elements:
xsd:restriction to select a subset of the values allowed by the base type
xsd:union to combine multiple types
xsd:list to specify a list of elements of an existing simple type
To create a new type by restricting from an existing type you give the xsd:simpleType element an xsd:restriction child element. The base attribute of this element specifies what type you're restricting. For example, this xsd:simpleType element creates a new type named phonoYear that's derived from xsd:gYear:
<xsd:simpleType name="phonoYear">
<xsd:restriction base="xsd:gYear">
</xsd:restriction>
</xsd:simpleType>
With this declaration any legal xsd:gYear is also a legal phonoYear, and any illegal year is also an illegal phonoYear. You can limit phonoYear to a subset of the normal year values by using facets to specify which values are and are not allowed. For instance, the minInclusive facet defines the minimum legal value for a type. This facet is added to a restriction as an xsd:minInclusive child element. The value attribute of the xsd:minInclusive element sets the minimum allowed value for the year:
<xsd:simpleType name="phonoYear">
<xsd:restriction base="xsd:gYear">
<xsd:minInclusive value="1877"/>
</xsd:restriction>
</xsd:simpleType>
Here the value of xsd:minInclusive is set to 1877, the year Thomas Edison invented the phonograph. Thus, 1877 is a legal phonoYear, 1878 is a legal phonoYear, 2001 is a legal phonoYear, and 3005 is a legal phonoYear. However, 1876, 1875, 1874, and earlier years are not legal phonoYears, even though they are legal xsd:gYears.
Once the phonoYear type has been defined, you can use it just like one of the built-in types. For example, in the SONG schema, you'd declare that the year element has the type phonoYear like this:
<xsd:element type="phonoYear"/>
minInclusive is not the only facet you can apply to xsd:gYear. Other facets of xsd:gYear are:
xsd:minExclusive: the minimum value that all instances must be strictly greater than
xsd:maxInclusive: the maximum value that all instances must be less than or equal to
xsd:maxExclusive: the maximum value that all instances must be strictly less than
xsd:enumeration: a list of all legal values
xsd:whiteSpace: how white space is treated within the element
xsd:pattern: a regular expression to which the instance is compared
Each facet is represented as an empty element inside an xsd:restriction element. Each facet has a value attribute giving the value of that facet. One restriction can contain more than one facet. For example, this xsd:simpleType element defines a phonoYear as any year between 1877 and 2100, inclusive:
<xsd:simpleType name="phonoYear">
<xsd:restriction base="xsd:gYear">
<xsd:minInclusive value="1877"/>
<xsd:maxInclusive value="2100"/>
</xsd:restriction>
</xsd:simpleType>
It's possible that multiple facets may conflict. For instance, the minInclusive value could be 2100 and the maxInclusive value could be 1877. While this is probably a design mistake, it is syntactically legal. It would just mean that the set
of phonoYears was the empty set, and phonoYear type elements could not actually be used in instance documents.
Facets are shared among many types. For instance, the minInclusive facet can constrain essentially any well-ordered type, including not only xsd:gYear, but also xsd:byte, xsd:unsignedByte, xsd:integer, xsd:positiveInteger, xsd:negativeInteger, xsd:nonNegativeInteger, xsd:nonPositiveInteger, xsd:int, xsd:unsignedInt, xsd:long, xsd:unsignedLong, xsd:short, xsd:unsignedShort, xsd:decimal, xsd:float, xsd:double, xsd:time, xsd:dateTime, xsd:duration, xsd:date, xsd:gMonth, xsd:gYearMonth, and xsd:gMonthDay. The complete list of constraining facets that can be applied to different types is:
xsd:minInclusive: the value that all instances must be greater than or equal to
xsd:minExclusive: the value that all instances must be strictly greater than
xsd:maxInclusive: the value that all instances must be less than or equal to
xsd:maxExclusive: the value that all instances must be strictly less than
xsd:enumeration: a list of all legal values
xsd:whiteSpace: how white space is treated within the element
xsd:pattern: a regular expression to which the instance is compared
xsd:length: the exact number of characters in the element
xsd:minLength: the minimum number of characters allowed in the element
xsd:maxLength: the maximum number of characters allowed in the element
xsd:totalDigits: the maximum number of digits allowed in the element
xsd:fractionDigits: the maximum number of digits allowed in the fractional part of the element
Not all facets apply to all types. For instance it doesn’t make much sense to talk about the minimum value of an xsd:NMTOKEN or the number of fraction digits in an xsd:gYear. However, when the same facet is shared by different types, it has the same syntax and basic meaning for all the types.
The three length facets — xsd:length, xsd:minLength, and xsd:maxLength — apply to the xsd:string type and its subtypes: xsd:normalizedString, xsd:token, xsd:hexBinary, xsd:base64Binary, xsd:QName, xsd:NCName, xsd:ID, xsd:IDREF, xsd:IDREFS, xsd:language, xsd:anyURI, xsd:ENTITY, xsd:ENTITIES, xsd:NOTATION, xsd:NOTATIONS, xsd:NMTOKEN, and xsd:NMTOKENS. These facets specify the number of characters allowed in the element or attribute value. The value attribute of each of these facets must contain a nonnegative integer. xsd:length sets the exact number of characters in the value, whereas xsd:minLength sets the minimum length and xsd:maxLength sets the maximum length.
For example, the schema in Listing 24-18 uses the xsd:minLength and xsd:maxLength facets to derive a new Str255 data type from xsd:string. Whereas xsd:string allows strings of any length from zero on up, Str255 requires each string to have a minimum length of 1 and a maximum length of 255. The schema then assigns this data type to
all the names and titles to indicate that each must contain between 1 and 255 characters:
Listing 24-18: A schema that derives a Str255 data type from xsd:string
<?xml version="1.0"?>
<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema">
<xsd:simpleType name="Str255">
<xsd:restriction base="xsd:string">
<xsd:minLength value="1"/>
<xsd:maxLength value="255"/>
</xsd:restriction>
</xsd:simpleType>
<xsd:element name="SONG" type="SongType"/>
<xsd:complexType name="SongType">
<xsd:sequence>
<xsd:element name="TITLE" type="Str255"/>
<xsd:element name="COMPOSER" type="Str255"
maxOccurs="unbounded"/>
<xsd:element name="PRODUCER" type="Str255"
minOccurs="0" maxOccurs="unbounded"/>
<xsd:element name="PUBLISHER" type="Str255"
minOccurs="0"/>
<xsd:element name="LENGTH" type="xsd:duration"/>
<xsd:element name="YEAR" type="xsd:gYear"/>
<xsd:element name="ARTIST" type="Str255"
maxOccurs="unbounded"/>
<xsd:element name="PRICE" type="xsd:string"
minOccurs="0"/>
</xsd:sequence>
</xsd:complexType>
</xsd:schema>
The whiteSpace facet is unusual. Unlike the other 11 facets, xsd:whiteSpace does not in any way constrain the allowed content of elements. Instead, it suggests what the application should do with any
white space that it finds in the instance document. It says how significant that white space is. However, it does not in any
way say that any particular kind of white space is legal or illegal.
The xsd:whiteSpace facet has three possible values:
preserve: The white space in the input document is unchanged.
replace: Each tab, carriage return, and linefeed is replaced with a single space.
collapse: Each tab, carriage return, and linefeed is replaced with a single space. Furthermore, after this replacement is performed,
all runs of multiple spaces are condensed to a single space. Leading and trailing white space is deleted.
Again, these are all just hints to the application. None of them have any affect on validation.
The whiteSpace facet can only be applied to xsd:string, xsd:normalizedString, and xsd:token types. Furthermore, it only fully applies to elements. XML 1.0 requires that parsers replace all white space in attributes,
and collapse white space in attributes whose type is anything other than CDATA, regardless of what the schema says.
The schema in Listing 24-19 uses the xsd:whiteSpace facets to derive a new CollapsedString data type from