Table of Contents
Before we can explore the available APIs for processing XML documents with Java, we’re going to need a few good examples. For most of this book, my examples are going to focus on XML protocols. These are XML applications used for machine-to-machine exchange of information across the Internet over HTTP. In this chapter I’ll show you how such documents move from one machine to another, and how you can use Java to interpose yourself in the process. However, since this is not a book about network programming, I’m going to be careful to keep all the details of network transport separate from the generation and processing of XML documents. When you work with an XML document, you don’t care whether it came from a file, a network socket, a string, or something else.
Three such XML protocol applications are of particular interest. The first is a very straightforward application called RSS. RSS is used to exchange headlines and abstracts between different Web news sites. It is available in two versions, RSS 0.9.1, which is based on an early working draft of the Resource Description Framework (RDF), and RSS 1.0 which is based on the final W3C recommendation of RDF. Both variants are used on the Web today.
The second XML application I’ll investigate in some detail is XML-RPC. XML-RPC supports remote procedure calls across the Internet by passing method names and arguments embedded in an XML document over HTTP. The third example application is a more complex implementation of this idea called SOAP. Whereas XML-RPC uses only elements, SOAP adds attributes and namespaces as well. SOAP even lets the body of the message be an XML element from some other vocabulary, so it opens up a host of other interesting examples.
One of the major uses of XML is for exchanging data between heterogenous systems. Given almost any collection of data, it’s straightforward to design some XML markup that fits it. Since XML is natively supported on essentially any platform of interest, you can send data encoded in such an XML application from point A to point B without worrying about whether point A and point B agree on how many bytes there are in a float, whether ints are big endian or little endian, whether strings are null delimited or use an initial length byte, or any of the myriad of other issues that arise when moving data between systems. As long as both ends of the connection agree on the XML application used, they can exchange information without worrying about what software produced the data. One side can use Perl and the other Java. One can use Windows and the other Unix. One can run on a mainframe and the other on a Mac. The document can be passed over HTTP, e-mail, NFS, BEEP, Jabber, or sneakernet. Everything except the XML document itself can be ignored.
The details of the XML markup used depend heavily on the information you’re exchanging. If you’re exchanging financial data, you might use the Open Financial Exchange (OFX). If you’re exchanging genetic codes, you might use the Gene Expression Markup Language (GEML). If you’re exchanging news articles in a syndication service, you might use NewsML. And if no standard XML application exists that fits your needs, you’ll probably invent your own; but whatever XML application you choose, there are certain features that will crop up again and again and that can benefit from standardization. These include the envelope used to pass the data and the representations of basic data types like integer and date.
When only two systems are involved, they only talk to each other, and they always send the same type of message, an envelope may not be needed. It’s enough for one system to send the other the message in the agreed upon XML format. However, when it’s actually many dozens, hundreds, or even thousands of different systems exchanging many different kinds of messages in many different ways, it’s useful to have some standards that are independent of the content of the message. This offers up some hope that when a message in an unrecognized format is received, it can still be processed in a reasonable fashion. For example, a system might receive a message ordering one thousand “Frodo Lives” buttons but not know how to handle that order. However, it may be able to read enough information from the envelope to route the request to the program that does know how to process the order.
In XML-RPC, essentially all the markup is the envelope and all the text content is the data inside the envelope. SOAP and RSS are a little more complex. For SOAP, the envelope is an XML document, and the data is too. In some ways RSS, especially RSS 1.0, is the most complex of all because it’s based on the relatively complex RDF syntax. RDF mixes the envelope and the data together so that you can’t point to any one element in the whole document and say “That’s the envelope,” or “That element is the data.” Instead, pieces of both the envelope and the data are intermingled throughout the complete document. In all three cases, however, it’s straightforward to extract the data from the envelope for further processing.
Another area that’s ripe for standardization is the proper representation of low-level data such as dates and numbers. Nobody really cares how many bytes there are in an int as long as there are enough to hold all the values they want to hold. Nobody really cares whether dates are written Day-Month-Year or Month-Day-Year as long as it’s easy to tell which is which. It doesn’t really matter how this information is passed, as long as there’s one standard way of doing it that everyone can agree on and process without excessive hassle.
In XML all data of any type must be passed as text. The proper textual representation of simple data types such as integer and date is trickier than most developers initially assume. For example, integers can be straightforwardly represented in the form 42, -76, +34562, 0, and so forth. The normal base-10 representation with optional plus or minus signs is fully adequate for most needs. However, consider the number 28562476535, the dollar value of Bill Gates’s Microsoft stock holdings alone as of July 24, 2002. This is a perfectly good integer, albeit a large one. However, it’s so large that trying to use it in many programs will lead to a crash or some other form of error.
Floating point numbers are even worse. Two different computers can look at an unambiguous string such as 65431987467.324345192 and interpret it as two different numbers. Dates cause problems even for humans. Is 07/04/01 the fourth of July, 2001? the fourth of July 1901? the seventh of April 2001? Some other date? These are all very really issues that cause real problems in systems today.
XML itself doesn’t standardize the text representation of data, but the W3C XML Schema Languages does. In particular, schemas define the 44 simple data types shown in Table 2.1. By assigning these types to particular elements, you can clearly state what a particular string means in a syntax everyone can understand. And if these types aren’t enough, the W3C XML Schema Language also lets you define new types that are combinations or restrictions of these basic types.
Table 2.1. Primitive Data Types defined in the W3C XML Schema Language
|xsd:string||The schema equivalent of #PCDATA, any string of Unicode characters that may appear in an XML document|
|xsd:boolean||true, false, 1, 0|
|xsd:decimal||A decimal number such as 44.145629 or -0.32, with an arbitrary size and precision; similar to the java.math.BigDecimal class|
|xsd:float||The 4-byte IEEE-754 floating point number which best approximates the specified decimal string, same as Java’s float type|
|xsd:double||The 8-byte IEEE-754 floating point number which best approximates the specified decimal string, same as Java’s double type|
|xsd:integer||An integer of arbitrary size, similar to the java.math.BigInteger class|
|xsd:nonPositiveInteger||An integer less than or equal to zero|
|xsd:negativeInteger||An integer strictly less than zero|
|xsd:nonNegativeInteger||An integer greater than or equal to zero|
|xsd:long||An integer between -9223372036854775808 and +9223372036854775807 inclusive; equivalent to Java’s long primitive data type|
|xsd:int||An integer between -2147483648 and 2147483647 inclusive; equivalent to Java’s int primitive data type|
|xsd:short||An integer between -32768 and 32767 inclusive; equivalent to Java’s short primitive data type|
|xsd:byte||An integer between -128 and 127 inclusive; equivalent to Java’s byte primitive data type|
|xsd:unsignedLong||An integer between 0 and 18446744073709551615.|
|xsd:unsignedInt||An integer between 0 and 4294967295|
|xsd:unsignedShort||An integer between 0 and 65535|
|xsd:unsignedByte||An integer between 0 and 255|
|xsd:positiveInteger||An integer strictly greater than zero|
|xsd:duration||A length of time given in the ISO 8601 extended format: PnYnMnDTnHnMnS. The number of seconds can be a decimal or an integer. All the other values must be non-negative integers. For example, P1Y2M3DT4H5M6.7S is one year, two months, three days, four hours, five minutes, and 6.7 seconds.|
|xsd:dateTime||A particular moment of time on a particular day up to an arbitrary fraction of a second in the ISO 8601 format: CCYY-MM-DDThh:mm:ss. This can be suffixed with a Z to indicate coordinated universal time (UTC) or an offset from UTC. For example, Neil Armstrong set foot on the moon at 1969-07-20T21:28:00-06:00 by the clock in Houston mission control which is also known as 1969-07-21T02:28:00Z|
|xsd:time||A certain time of day on no particular day in the ISO 8601 format: hh:mm:ss.sss. A time zone specified as an offset from UTC is optional. For example, on most days I wake up around 07:00:00.000-05:00 and go to bed around 23:30:00.000-05:00.|
|xsd:date||A particular date in history given in ISO 8601 format: YYYYMMDD; e.g. 20010706 or 19690920.|
|xsd:gYearMonth||A certain month in a certain year; e.g. 2001-12 or 1999-03.|
|xsd:gYear||A year in the Gregorian calendar ranging from 0001 to 2001 to 9999, 10000, 10001 and beyond. Earlier dates can be represented as ‑0001, ‑0002, ‑0003, and so forth back to the Big Bang. There is no year zero, however.|
|xsd:gMonthDay||A specific day of a specific month in no particular year in the form ‑‑02-28. For example, Christmas comes on ‑‑12-25.|
|xsd:gDay||A particular day of no particular month in the form ‑‑‑01, ‑‑‑02, ‑‑‑03, through ‑‑‑31|
|xsd:gMonth||A particular month in no particular year in the form ‑‑01‑‑, ‑‑02‑‑, ‑‑03‑‑, through ‑‑12‑‑|
|xsd:hexBinary||Hexadecimal encoded binary data; each byte of the data is replaced by the two hexadecimal digits that represent its unsigned value|
|xsd:base64Binary||Base-64 encoded binary data|
|xsd:anyURI||An absolute or relative URL or a URN|
|xsd:QName||An optionally prefixed XML name such as SOAP-ENV:Body or Body. Unprefixed names must be in the default namespace.|
|xsd:NOTATION||The name of a notation declared in the current schema|
|xsd:normalizedString||A string in which carriage returns (\r), line feeds (\n) and tab (\t) characters should be treated the same as spaces|
|xsd:token||A string in which all runs of white space should be treated the same as a single space|
|xsd:language||An RFC 1766 language identifier such as en, fr-CA, or i-klingon|
|xsd:NMTOKEN||An XML name token|
|xsd:NMTOKENS||A white space separated list of XML name tokens|
|xsd:Name||An XML name|
|xsd:NCName||An XML name that does not contain any colons; that is, an unprefixed name|
|xsd:ID||An NCName which is unique among other things of ID type in the same document|
|xsd:IDREF||An NCName used as an ID somewhere in the document|
|xsd:IDREFS||A whitespace separated list of IDREFs|
|xsd:ENTITY||An NCName that has been declared as an unparsed entity in the document’s DTD|
|xsd:ENTITIES||A white space separated list of ENTITY names|
Even without using schema validation or the full schema apparatus, you can use these types in your own documents. Simply attach an xsi:type attribute to any element identifying the type of that element’s content. The xsi prefix is mapped to the http://www.w3.org/2001/XMLSchema-instance namespace URI. Example 2.1 shows an XML document that uses these types to label different parts of an order document. Notice that some things that might naively be assumed to be numeric types are in fact strings.
Example 2.1. An XML document that labels elements with schema simple types
<?xml version="1.0" encoding="ISO-8859-1"?> <Order xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"> <Customer id="c32" xsi:type="xsd:string">Chez Fred</Customer> <Product> <Name xsi:type="xsd:string">Birdsong Clock</Name> <SKU xsi:type="xsd:string">244</SKU> <Quantity xsi:type="xsd:positiveInteger">12</Quantity> <Price currency="USD" xsi:type="xsd:decimal">21.95</Price> <ShipTo> <Street xsi:type="xsd:string">135 Airline Highway</Street> <City xsi:type="xsd:string">Narragansett</City> <State xsi:type="xsd:NMTOKEN">RI</State> <Zip xsi:type="xsd:string">02882</Zip> </ShipTo> </Product> <Product> <Name xsi:type="xsd:string">Brass Ship's Bell</Name> <SKU xsi:type="xsd:string">258</SKU> <Quantity xsi:type="xsd:positiveInteger">1</Quantity> <Price currency="USD" xsi:type="xsd:decimal">144.95</Price> <Discount xsi:type="xsd:decimal">.10</Discount> <ShipTo> <GiftRecipient xsi:type="xsd:string"> Samuel Johnson </GiftRecipient> <Street xsi:type="xsd:string">271 Old Homestead Way</Street> <City xsi:type="xsd:string">Woonsocket</City> <State xsi:type="xsd:NMTOKEN">RI</State> <Zip xsi:type="xsd:string">02895</Zip> </ShipTo> <GiftMessage xsi:type="xsd:string"> Happy Father's Day to a great Dad! Love, Sam and Beatrice </GiftMessage> </Product> <Subtotal currency='USD' xsi:type="xsd:decimal"> 393.85 </Subtotal> <Tax rate="7.0" currency='USD' xsi:type="xsd:decimal">28.20</Tax> <Shipping method="USPS" currency='USD' xsi:type="xsd:decimal">8.95</Shipping> <Total currency='USD' xsi:type="xsd:decimal">431.00</Total> </Order>
As well as explicit labeling, a document can use a schema to indicate the type. However, right now the APIs for such things aren’t finished so it’s best to explicitly label elements when the types are important.
XML-RPC only uses the int, boolean, decimal, dateTime, and base64 types as well as a string type that’s restricted to ASCII. It also does not allow the NaN, Inf, and -Inf values for double. It does not use xsi:type attributes, relying instead on predefined semantics for particular elements. SOAP allows all 44 types and does use xsi:type attributes to label elements.
|Copyright 2001, 2002 Elliotte Rusty Haroldemail@example.com||Last Modified September 08, 2002|
|Up To Cafe con Leche|