Validity

Programmers have long known the value of verifiable preconditions on functions and methods. (A lot of us carelessly don’t use them, but that’s a topic for another book.) One of the important innovations of XML is the ability to place preconditions on the data the programs read, and to do this in a simple declarative way. XML allows you to say that every Order element must contain exactly one Customer element, that each Customer element must have an id attribute that contains an XML name token, that every ShipTo element must contain one or more Streets, one City, one State, and one Zip, and so forth. Checking an XML document against this list of conditions is called validation. Validation is an optional step but an important one.

There is more than one language in which you can express such conditions. Generically, these are called schema languages, and the documents that list the constraints are called schemas. Different schema languages have different strengths and weaknesses. The document type definition (DTD) is the only schema language built into most XML parsers and endorsed as a standard part of XML. However, because of the extensible nature of XML, many other schema languages have been invented that can easily be integrated with your systems.

DTDs

A DTD focuses on the element structure of a document. It says what elements a document may contain, what each element may and must contain in what order, and what attributes each element has.

Element Declarations

In order to be valid according to a DTD, each element used in the document must be declared in an ELEMENT declaration. For example, this is an ELEMENT declaration that says that Name elements contain #PCDATA, that is, text but no child elements.

<!ELEMENT Name (#PCDATA)>

Elements that can have children are declared by listing the names of their children in order, separated by commas. For example, this ELEMENT declaration says that an Order element contains a Customer element, a Product element, a Subtotal element, a Tax element, a Shipping element, and a Total element in that order:

<!ELEMENT Order (Customer, Product, Subtotal, Tax, Shipping, Total)>

The parenthesized list of things an element can contain is called the element’s content model. You can attach a question mark after an element name in the content model to indicate that the element is optional; that is, that either zero or one instance of the element may occur at that position. You can attach an asterisk after the element name to indicate that zero or more instances of the element may occur at that position, or a plus sign to indicate that one or more instances of the element must occur at that position. For example, this element declaration states that a ShipTo element must contain zero or one GiftRecipient elements, one or more Street elements, and exactly one City, State, and Zip elements in that order:

<!ELEMENT ShipTo (GiftRecipient?, Street+, City, State, Zip)>

You can use a vertical bar instead of a comma to indicate that either one or the other of the elements may appear. You can group collections of elements with parentheses to indicate that the entire group should be treated as a unit. You can suffix a *, ?, or + to the group to indicate that zero or more, zero or one, or one or more of those groups may appear at that point. Finally, you may replace the entire content model with the keyword EMPTY to specify that the element must not contain any content at all.

Attribute Declarations

A DTD also specifies which attributes may and must appear on which elements. Each attribute is declared in an ATTLIST declaration which specifies:

  • The element to which the attribute belongs

  • The name of the attribute

  • The type of the attribute

  • The default value of the attribute

For example, this ATTLIST declaration says that every Customer element must have an attribute named id with type ID:

<!ATTLIST Customer id ID #REQUIRED>

DTDs define ten different types for attributes:

CDATA

Any string of text; the default type for undeclared attributes in invalid documents

NMTOKEN

A string composed of one or more legal XML name characters. Unlike an XML name, a name token may start with a digit.

NMTOKENS

A white space separated list of name tokens

ID

An XML name that is unique among ID type attributes in the document

IDREF

An XML name used as an ID attribute value on some element in the document

IDREFS

A white space separated list of XML names used as ID attribute values somewhere in the document

ENTITY

The name of an unparsed entity declared in an ENTITY declaration in the DTD

ENTITIES

A white space separated list of unparsed entities declared in the DTD

NOTATION

The name of a notation declared in a NOTATION declaration in the DTD

Enumeration

A list of all legal values for the attribute, separated by vertical bars. Each possible value must be an XML name token.

Most parsers and APIs will tell you what the type of an attribute is if you want to know, but in practice this knowledge is not very useful. W3C XML schema language schemas offer much more complete data typing for both elements and attributes, including not only these types but also the more customary data types like int and double.

DTDs allow four possible default values for attributes:

#REQUIRED

Each element in the instance document must provide a value for this attribute.

#IMPLIED

Each element in the instance document may or may not provide a value for this attribute. If an element does not, then no default value is provided from the DTD.[7]

#FIXED "value"

The attribute always has the value that follows #FIXED in double or single quotes, whether or not it’s present in the instance document.

"value"

By default the attribute has the value specified in the DTD in single or double quotes. However, individual instances of the element may specify a different value.

Parsers may or may not tell you whether an attribute came from the instance document or was defaulted in from the DTD. It’s relatively rare that you care about this one way or the other. However, if you’re using a document that relies heavily on attribute values from DTDs, (e.g. for namespace declarations) make sure you’re using a parser that does read the external DTD subset.

Example 1.8 is a complete DTD for order documents of the type shown in this chapter. It uses both ELEMENT and ATTLIST declarations.

Example 1.8. A DTD for order documents

<!ELEMENT Order (Customer, Product+, Subtotal, Tax, Shipping, Total)>
<!ELEMENT Customer (#PCDATA)>
<!ATTLIST Customer id ID #REQUIRED>
<!ELEMENT Product (Name, SKU, Quantity, Price, Discount?, 
                   ShipTo, GiftMessage?)>
<!ELEMENT Name (#PCDATA)>
<!ELEMENT SKU (#PCDATA)>
<!ELEMENT Quantity (#PCDATA)>
<!ELEMENT Price (#PCDATA)>
<!ATTLIST Price currency (USD | CAN | GBP) #REQUIRED>
<!ELEMENT Discount (#PCDATA)>
<!ELEMENT ShipTo (GiftRecipient?, Street+, City, State, Zip)>
<!ELEMENT GiftRecipient (#PCDATA)>
<!ELEMENT Street (#PCDATA)>
<!ELEMENT City   (#PCDATA)>
<!ELEMENT State  (#PCDATA)>
<!ELEMENT Zip    (#PCDATA)>
<!ELEMENT GiftMessage (#PCDATA)>
<!ELEMENT Subtotal (#PCDATA)>
<!ATTLIST Subtotal currency (USD | CAN | GBP) #REQUIRED>
<!ELEMENT Tax (#PCDATA)>
<!ATTLIST Tax currency (USD | CAN | GBP) #REQUIRED
              rate CDATA "0.0"
>

<!ELEMENT Shipping (#PCDATA)>
<!ATTLIST Shipping currency (USD | CAN | GBP) #REQUIRED
                   method   (USPS | UPS | Overnight) "UPS">
<!ELEMENT Total (#PCDATA)>
<!ATTLIST Total currency (USD | CAN | GBP) #REQUIRED>

Document Type Declarations

Documents are associated with particular DTDs using document type declarations. This is a document type declaration that points to the DTD in Example 1.8:

<!DOCTYPE Order SYSTEM "order.dtd">

The document type declaration is placed in the instance document’s prolog, after the XML declaration but before the root element start-tag. For example,

<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE Order SYSTEM "order.dtd">
<Order>
  ...

This does assume that the DTD can be found in the same directory where the document itself is. If you prefer you can use an absolute URL instead. For example,

<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE Order SYSTEM "http://www.ibiblio.org/xml/dtds/order.dtd">
<Order>
  ...

Even though Example 1.5 satisfies all the conditions expressed in Example 1.8, it is not valid because it does not have a document type declaration pointing to that DTD.

Caution

The acronym DTD is correctly used only to mean “document type definition”. It should never be used to mean “document type declaration”. The document type declaration may contain or point to the document type definition (or both); but it is not the same thing.

DTDs are not just about validation. They can also affect the content of the instance document itself. In particular, they can:

  • Define entities

  • Define notations

  • Provide default values for attributes

Assuming you’re using a validating parser, there is little reason to care about how such things happen. The entities the DTD defines will be resolved before you see them. The notations will be applied to the appropriate elements and entities. A default attribute value will be just one more attribute in an element’s list of attributes. Some APIs may tell you what entity a particular element came form or whether an attribute value was defaulted from the DTD or present in the instance document. However, most of the time you simply do not need to know this.

Schemas

The W3C XML Schema Language (schemas for short, though it’s hardly the only schema language) addresses several limitations of DTDs. First schemas are written in XML instance document syntax, using tags, elements, and attributes. Secondly, schemas are fully namespace aware. Thirdly, schemas can assign data types like integer and date to elements, and validate documents not only based on the element structure but also on the contents of the elements.

Example 1.9 shows a schema for order documents. Where order.dtd uses an ELEMENT declaration, order.xsd uses an xsd:element element. Where order.dtd uses an ATTLIST declaration, order.xsd uses an xsd:attribute element.

But order.xsd doesn’t just repeat the same constraints found in order.dtd. It also assigns types and ranges to the elements. For instance, it requires that all the money elements—Tax, Shipping, Subtotal, Total, and Price—contain a decimal number such as 9.85, 7.2, or -3.25. [8] If one of these elements contained text that was not a decimal number such as “France”, then the validator would notice and report the problem. DTDs cannot detect mistakes like this. A DTD can note that there is no Price element where one is expected, but it cannot determine that the Price element does not actually give a price.

Example 1.9. order.xsd: a schema for order documents

<?xml version="1.0"?>
<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema">

  <xsd:element name="Order">
    <xsd:complexType>
      <xsd:sequence>
        <xsd:element name="Customer">
          <xsd:complexType>
            <xsd:simpleContent>
              <xsd:extension base="xsd:string">
                <xsd:attribute name="id" type="xsd:ID"/>
              </xsd:extension>
            </xsd:simpleContent>
          </xsd:complexType>
         </xsd:element>
        <xsd:element name="Product" maxOccurs="unbounded">
          <xsd:complexType>
            <xsd:sequence>
              <xsd:element name="Name"     type="xsd:string"/>
              <xsd:element name="SKU" 
               type="xsd:positiveInteger"/>
              <xsd:element name="Quantity" 
               type="xsd:positiveInteger"/>
              <xsd:element name="Price"    type="MoneyType"/>
              <xsd:element name="Discount" type="xsd:decimal" 
                           minOccurs="0"/>
              <xsd:element name="ShipTo">
                <xsd:complexType>
                  <xsd:sequence>
                    <xsd:element name="GiftRecipient" 
                     type="xsd:string" 
                     minOccurs="0" maxOccurs="unbounded"/>
                    <xsd:element name="Street" 
                     type="xsd:string"/>
                    <xsd:element name="City" type="xsd:string"/>
                    <xsd:element name="State" 
                     type="xsd:string"/>
                    <xsd:element name="Zip" type="xsd:string"/>
                  </xsd:sequence>
                </xsd:complexType>
              </xsd:element>
              <xsd:element name="GiftMessage" type="xsd:string" 
                           minOccurs="0"/>
            </xsd:sequence>        
          </xsd:complexType>           
        </xsd:element>
        <xsd:element name="Subtotal" type="MoneyType"/>
        <xsd:element name="Tax">
          <xsd:complexType>
            <xsd:simpleContent>
              <xsd:extension base="MoneyType">
                <xsd:attribute name="rate" type="xsd:decimal"/>
              </xsd:extension>
            </xsd:simpleContent>
          </xsd:complexType>        
        </xsd:element>
        <xsd:element name="Shipping">
          <xsd:complexType>
            <xsd:simpleContent>
              <xsd:extension base="MoneyType">
                <xsd:attribute name="method" type="xsd:string"/>
              </xsd:extension>
            </xsd:simpleContent>
          </xsd:complexType>                
        </xsd:element>
        <xsd:element name="Total" type="MoneyType"/>
      </xsd:sequence>
    </xsd:complexType>  
  </xsd:element>

  <xsd:complexType name="MoneyType">
    <xsd:simpleContent>
      <xsd:extension base="xsd:decimal">
        <xsd:attribute name="currency" type="xsd:string"/>
      </xsd:extension>
    </xsd:simpleContent>
  </xsd:complexType>

</xsd:schema>

There are multiple ways to indicate that a document should satisfy a known schema. The most common is an xsi:noNamespaceSchemaLocation attribute on the root element of the instance document. The xsi prefix is bound to the http://www.w3.org/2001/XMLSchema-instance URI. For example,

<?xml version="1.0" encoding="ISO-8859-1"?>
<Order xsi:noNamespaceSchemaLocation="order.xsd"
       xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
  ...

Some parsers also provide ways to specify a schema from inside a program, for instance by setting various properties. I’ll discuss this more when we get to programmatic validation in Chapter 7.

Schemas are still pretty bleeding edge technology at the time of this writing. (May, 2002). There are only a few parsers that provide incomplete implementations of the full W3C XML Schemas 1.0 specification. Nonetheless, developers have been clamoring for this functionality (if not necessarily this syntax) for some time so schemas seem likely to achieve broad adoption relatively quickly.

For the moment, schema support is limited to simple validation, much as DTD support is. A schema-aware parser will read an XML document, compare what it sees there to a schema, and return a boolean result: either the document satisfies the schema or it does not. In the event the document fails to satisfy the schema, the parser might give you a line number and a more detailed error message about exactly what the problem is, but that’s it. More complete use of schemas, in which parsers tell you what the type of any element is so you can, for example, convert elements with type xsd:int to actual Java ints, are still a matter for research and experiment.

Schematron

Rick Jelliffe’s Schematron is a radically different approach to an XML schema language. Whereas other languages are conservative (everything not permitted is forbidden) Schematron is liberal (everything not forbidden is permitted). Furthermore, Schematron is based on XPath so it can check cooccurence constraints between elements and attributes; e.g. that the content of the total price element must be equal to the sum of the content of the subtotal, tax, and shipping elements. Finally, Schematron can be implemented as an XSLT stylesheet rather than requiring special software.

Example 1.10 shows a Schematron schema for order documents. To keep the example smaller, I did not test absolutely everything I could. Instead, I took advantage of Schematron’s liberality to test only those conditions that neither DTDs nor schemas can validate; for instance, that the total price is the sum of the subtotal, the tax, and the shipping. I haven’t necessarily lost anything by doing this. I can validate a single document against multiple different kinds of schemas. For instance, orders could first be checked against the DTD, then checked against a W3C XML Schema Language schema, and only checked against this Schematron schema if they passed the first two tests.

Example 1.10. order.sct: a Schematron schema for order documents

<?xml version="1.0"?>
<schema xmlns="http://www.ascc.net/xml/schematron">
  <title>A Schematron Schema for Orders</title>
  <pattern>
    <rule context="Order">
      <!-- Due to round-off error, floating point numbers 
           should rarely be compared for direct equality. 
           For this purpose, it's enough if they're accurate 
           within one penny. -->
      <assert test="(Shipping+Subtotal+Tax - Total)&lt;0.01 
                and (Shipping+Subtotal+Tax - Total)&gt;-0.01">
        The subtotal, tax, and shipping 
        must add up to the total.
      </assert>
      
      <assert test= 
       "(Subtotal+Shipping)*((Tax/@rate) div 100.0) 
         - Tax &lt; 0.01 and (Subtotal+Shipping)*((Tax/@rate) 
         div 100.0)-Tax &gt; -0.01
      ">
        The tax was incorrectly calculated.
      </assert>
  
    </rule>
  </pattern>
</schema>

XPath is not by itself Turing complete so there are still some limits to what you can express in a Schematron schema. For instance, you can’t sum up the Quantity times the Price for each Product element and make sure that equals the Subtotal. However, Schematron is still a lot more powerful than other schema languages.

Schematron is implemented in a very unusual fashion. First you run your Schematron schema through an XSLT processor using a skeleton stylesheet Jelliffe provides. This produces a new XSLT stylesheet. In essence, this compiles the Schematron schema into an XSLT stylesheet. The compiler is itself written in XSLT. You then transform all your instance documents using the compiled schema. If any of the assertions fail, the output will contain the assertion message. Otherwise it will contain just the XML declaration. For example, using Michael Kay’s SAXON XSLT processor, to validate Example 1.2 against Example 1.10:

C:\XMLJAVA>saxon order.sct skeleton1-5.xsl>order_sct.xsl
C:\XMLJAVA>saxon order.xml order_sct.xsl
<?xml version="1.0" encoding="utf-8"?>

Schematron is the idiosyncratic product of one person. It is therefore not a standard part of any major parsers, unlike DTDs and the W3C XML Schema Language. However, it’s not particularly difficult to install Jelliffe’s Schematron validation software into most systems. Since Schematron is implemented in XSLT, all you need is a good API to access an XSLT engine. I’ll take this up again in the final chapter when I discuss APIs for XSLT.

The Last Mile

Schematron is powerful, but there are still some checks it cannot perform. In particular, it cannot perform any checks that require information external to the document and the schema. For example, it cannot verify that the page at a referenced URL is reachable. It cannot verify that a file exists on the local file system. It cannot compare the SKUs, names, and prices in an order document with their values in a remote database. None of the extant schema languages allow you to state conditions like these.

Java can do all of these things. The java.net.URL class can easily test whether a URL is live. The exists() method of the java.io.File class is a simple test for whether a file is where you think it is. JDBC is a whole API remote database access. However, unlike the more limited constraints of DTDs, the W3C XML Schema Language, or even Schematron, simply listing the conditions is not enough. To test such conditions, you have to write the code that tests them. Nobody’s done the hard work for you. There will always be some constraints you need a full-blown programming language to check. Indeed doing exactly this will be a major focus of the remainder of this book.

One thing you can learn from the existing languages is the clean way they separate validation from processing. If you design your own validation layer, you should do that too. Perform all validation before the document is processed for its contents. If possible, separate the constraints from the code that checks them.



[7] This is really a bad choice of terminology. Nothing is being implied here. A more accurate keyword would be #OPTIONAL. However #IMPLIED is what XML gives us.

[8] It would be possible to go further and require that each money item be a positive number with two decimal digits of precision such as 9.85 but not 7.2 or -3.25, but for now I wanted to keep this example smaller.


Copyright 2001, 2002 Elliotte Rusty Haroldelharo@metalab.unc.eduLast Modified May 21, 2002
Up To Cafe con Leche