Effective XML
Elliotte Rusty Harold
elharo@metalab.unc.edu
http://www.cafeconleche.org/

Part I: Syntax
Stay with XML 1.0
XML 1.1:
New name characters
C0 control characters
C1 control characters
NEL
Undeclare namespace prefixes
Incompatible with
Most XML parsers
W3C and RELAX NG schema languages
XOM, JDOM

Part II: Structure
The XML Stack
Allow All XML syntax
CDATA sections
Entity references
Processing instructions
Comments
Numeric character references
Document type declarations
Different ways of representing the same core content; not different information

Distinguish text from markup
A DocBook element
<programlisting><![CDATA[<value>
  <double>28657</double>
</value>]]></programlisting>
The content is:
<value>
  <double>28657</double>
</value>
This is the same:
<programlisting>&lt;value&gt;
  &lt;double&gt;28657&lt;/double&gt;
 &lt;/value&gt;</programlisting>

The reverse problem
Tools that create XML from strings:
Tree-based editors like <Oxygen/> or XML Spy
WYSIWYG applications like OpenOffice Writer
Programming APIs such as DOM, JDOM, and XOM
The tool automatically escapes reserved characters like <, >, or &.
Just because something looks like an XML tag does not mean it is an XML tag.

White space matters
Parsers report all white space in element content, including boundary white space
An xml:space attribute is for the client application only, not the parser
White space in attribute values is normalized
Parsers do not report white space in the prolog, epilog, the document type declaration, and tags.

Make structure explicit through markup
Bad
<Transaction>Withdrawal 2003 12 15 200.00</Transaction>
Better
<Transaction type="withdrawal">
  <Date>2003-12-15</Date>
  <Amount>200.00</Amount>
</Transaction>

Store metadata in attributes
Material the reader doesn’t want to see
URLs
IDs
Styles
Revision dates
Authors name
No substructure
Revision tracking
Citations
No multiple elements

Remember mixed content
Narrative documents
Record-like documents
The RSS problem
<item>
  <title>Xerlin 1.3 released</title>
  <description>
    Xerlin 1.3, an open source XML Editor written in
    Java, has been released. Users can extend the
    application via custom editor interfaces for
    specific DTDs. New features in version 1.3 include
    XML Schema support, WebDAV capabilities, and
    various user interface enhancements. Java 1.2
    or later is required.
  </description>
<link>http://www.cafeconleche.org/#news2003April7</link>
</item>

What you really want is this:
What people do is this:
Prefer URLs to unparsed entities and notations
URLs are simple and well understood
Notations and unparsed entities are confusing and little used
URLs don’t require the DTD to be read
Many APIs don’t even support notations and unparsed entities

Part III: Semantics
Use processing instructions for process-specific content
For a very particular, even local, process
Describes how a particular process acts on the data in the document
Does not describe or add to the content itself
A unit that can be treated in isolation
Content is not XML-like.
Applies to the entire document

Processing instructions are not appropriate when:
Content is closely related to the content of the document itself
Structure extends beyond a single processing instruction
Needs to be validated

Include all information in instance documents
Not all parsers read the DTD
Especially browsers
Beware
Default attribute values
Parsed entity references
XInclude
ID type dependence (XPath, DOM, etc.)

Encode binary data using quoted printable and/or Base64
Quoted printable works well for mostly text
Base-64 for non-text data
Can you link to the data with a URL instead?

Use namespaces for modularity and extensibility
Not hard; simple cases can use one default namespace
http URIs are normally preferred
DTD validation is tricky
Code to namespace URIs, not prefixes
Avoid namespace prefixes in element content and attribute values

Reuse XHTML for generic narrative content
Choose the right schema language for the job
DTDs
The W3C XML Schema Language
RELAX NG
Schematron

Use only what you need
You need
Well-formed XML 1.0
A parser
You probably need:
Namespaces
You may not need:
DTDs
Schemas
XInclude
SOAP
WS-Kitchen-Sink
etc.

Always use a parser
Can’t use regular expressions:
Detecting encoding
Comments and processing instructions that contain tags
CDATA sections
Unexpected placement of spaces and line breaks within tags
Default attribute values
Character and entity references
Malformed documents
Internal DTD Subset
Why not?
Unfamiliarity with parsers
Too slow

Layer Functionality
Program to  standard APIs
Easier to deploy in Java 1.4/1.5
Different implementations have different performance characteristics
SAX is fast
DOM interoperates
Semi-standard:
JDOM
XOM
Bleeding edge
StAX
JAXB

Read the complete DTD
Be conservative in what you generate; liberal in what you accept
Important content from DTD:
Default attribute values
Namespace declarations
Entity references
ID types

Navigate with XPath
More robust against unexpected structure
Allow optimization by engine
Easier to code; enhanced programmer productivity

Validate inside your program with schemas
Part IV: Implementation
Write documents in Unicode
Prefer UTF-8
Smaller in English
ASCII compatible
Normalization
É, ü, ì and so forth
NFC
ICU

Avoid Vendor Lockin; Beware
Opaque, binary data used in place of marked up text.
Over-abbreviated, inobvious names like F17354 and grgyt
APIs that hide the XML
Products that focus on the "Infoset”
Alternate serializations of XML
Patented formats

Hang on to your relational database
Document Namespaces with RDDL
Pick the correct MIME type
application/xml
Not text/xml!
Don't use charset
application/mathml+xml
image/svg+xml
application/xslt+xml

TagSoup Your HTML
Catalog common resources
<?xml version="1.0"?>
<catalog xmlns=
  "urn:oasis:names:tc:entity:xmlns:xml:catalog"
>
  <public publicId=
     "-//OASIS//DTD DocBook XML V4.2//EN"
          uri=
   "file:///opt/xml/docbook/docbookx.dtd"/>
</catalog>

Compress if space is a problem
To Learn More
This Presentation: http://cafeconleche.org/slides/lxny/effectivexml
Effective XML: 50 Specific Ways to Improve Your XML Documents
Elliotte Rusty Harold
Addison-Wesley, 2003
ISBN 0-321-15040-6
$44.99
http://cafeconleche.org/books/
effectivexml