Cutting Edge XML Programming


Cutting Edge XML Programming

Elliotte Rusty Harold

XMLOne Europe

Wednesday, September 19, 2001

elharo@metalab.unc.edu

http://www.ibiblio.org/xml/


Outline


Part I: SAX 2.1

Actually, SAX2 has ** MUCH ** better infoset support than DOM does. Yes, I've done the detailed analysis.

--David Brownell on the xml-dev mailing list


Goals


Specified vs. Defaulted Attributes


standalone declaration

<?xml version="1.0" standalone="yes"?>


The version and encoding properties

<?xml version="1.0" encoding="UTF-16"?>


Feature/Property discovery


DefaultHandler infoset extensions


Parser identification


A Verifier Class as in JDOM

package org.jdom;

public final class Verifier {

    public static final String checkElementName(String name) {}
    public static final String checkAttributeName(String name) {}
    public static final String checkCharacterData(String text) {}
    public static final String checkNamespacePrefix(String prefix) {}
    public static final String checkNamespaceURI(String uri) {}
    public static final String checkProcessingInstructionTarget(String target) {}
    public static final String checkCommentData(String data) {}
 
    public static boolean isXMLCharacter(char c) {}
    public static boolean isXMLNameCharacter(char c) {}
    public static boolean isXMLNameStartCharacter(char c) {}
    public static boolean isXMLLetterOrDigit(char c) {}
    public static boolean isXMLLetter(char c) {}
    public static boolean isXMLCombiningChar(char c) {}
    public static boolean isXMLExtender(char c) {}
    public static boolean isXMLDigit(char c) {}

}

To Learn More


Part II: DOM Level 3


DOM Evolution


New Features in DOM Level 3


DOM Level 3 Core Additions


DOMKey


New methods in Node


New methods in Entity


New methods in Document


New methods in Text


Bootstrapping


DOM3 Bootstrapping


Load and Save


The DOM Process

  1. Library specific code creates a parser

  2. The parser parses the document and returns a DOM org.w3c.dom.Document object.

  3. The entire document is stored in memory.

  4. DOM methods and interfaces are used to extract data from this object


Parsing documents with DOM2

This program parses with Xerces. Other parsers are different.

import org.apache.xerces.parsers.*;
import org.w3c.dom.*;
import org.xml.sax.*;
import java.io.*;

public class DOMParserMaker {

  public static void main(String[] args) {
     
    DOMParser parser = new DOMParser();
    
    for (int i = 0; i < args.length; i++) {
      try {
        parser.parse(args[i]); 
       
        Document d = parser.getDocument();
      }
      catch (SAXException e) {
        System.err.println(e); 
      }
      catch (IOException e) {
        System.err.println(e); 
      }
      
    }
   
  }

}

Parsing documents with DOM3

import org.w3c.dom.*;

public class DOM3ParserMaker {

  public static void main(String[] args) {

    DOMImplementationFactoryLS impl =
      (DOMImplementationLS) DOMImplementationFactory.getDOMImplementation();
    DOMBuilder parser = impl.getDOMBuilder();

    for (int i = 0; i < args.length; i++) {
      try {
        Document d = parser.parseURI(args[i]);
      }
      catch (DOMSystemException e) {
        System.err.println(e);
      }
      catch (DOMException e) {
        System.err.println(e);
      }

    }

  }

}

This code will not actually compile or run until some parser supports DOM3 Load and Save.


Load and Save

DOMImplementationLS
A new DOMImplementation interface that provides the factory methods for creating the objects required for loading and saving.
DOMBuilder
A parser interface
DOMInputSource
Encapsulate information about the source of the XML to be loaded, like SAX's InputSource
DOMEntityResolver
During loading, provides a way for applications to redirect references to external entities.
DOMBuilderFilter
Provide the ability to examine and optionally remove Element nodes as they are being processed during the parsing of a document. like SAX filters.
DOMWriter
An interface for serializing DOM documents onto a stream.
DOMCMBuilder
an interface for parsing Content Models (e.g. DTDs and schemas) and building the corresponding CMModel tree.
DOMCMWriter
An interface for serializing content models
DocumentLS
A "mechanism by which the content of a document can be replaced with the DOM tree produced when loading a URL, or parsing a string."
ParserErrorEvent
Some sort of error detected in the input document (well-formedness? validity?)

DOMImplementationLS


DOMBuilder


DOMInputSource


DOMEntityResolver


DOMWriter


DOMBuilderFilter


DOMCMBuilder


DOMCMWriter


DocumentLS


ParserErrorEvent


Grammar Access/Abstract Schemas


Abstract Schema Interfaces


Abstract Schema and AS-Editing Interfaces


The ASModel Interface


The ASExternalModel Interface


The ASNode Interface


The ASNodeList Interface


The ASNamedNodeMap Interface


The ASDataType Interface


The ASPrimitiveDataType Interface


The ASElementDeclaration Interface


The ASChildren Interface


The ASAttributeDeclaration Interface


The EntityDeclaration Interface


The ASNotationDeclaration Interface


Validation and Other Interfaces:


The Document Interface


The DocumentAS Interface


The DOMImplementationAS Interface


Schema-guided Document-Editing Interfaces:


The NodeAS Interface


The ElementAS Interface


The CharacterDataAS Interface


The DocumentTypeAS Interface


The AttributeAS Interface


DOM Error Handler Interfaces


The DOMErrorHandler Interface


The DOMLocator Interface


To Learn More


Part III: XSLT 2.0 and Beyond

In SQL, the query language is not expressed in tables and rows. In XQuery, the query language is not expressed in XML. Why is this a problem?
--Jonathan Robie on the xml-dev mailing list


XPath 2.0


XPath 2.0 Goals


XPath 2.0 Requirements


XSLT 2.0


XSLT 2.0 Goals


XSLT 2.0 Non-goals


XSLT 2.0 Requirements


Some specific improvements that are likely


Identifying 2.0 compliant stylesheets

<xsl:stylesheet version="2.0"
                xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

  <!-- Top level elements -->

</xsl:stylesheet>

No result tree fragments


Multiple Output Documents


xsl:document Example

     <xsl:document method="html" encoding="ISO-8859-1" href="index.html">
       <html>
         <head>
           <title><xsl:value-of select="title"/></title>         
         </head>
         <body> 
           <h1 align="center"><xsl:value-of select="title"/></h1> 
           <ul>
             <xsl:for-each select="slide">
               <li><a href="{format-number(position(),'00')}.html"><xsl:value-of select="title"/></a></li>
             </xsl:for-each>    
           </ul>           
           
           <p><a href="{translate(title,' ', '_')}.html">Entire Presentation as Single File</a></p>
              
           <hr/>
           <div align="center">
             <A HREF="01.html">Start</A> | <A HREF="/xml/">Cafe con Leche</A>
           </div>
           <hr/>
           <font size="-1">
              Copyright 2001 
              <a href="http://www.macfaq.com/personal.html">Elliotte Rusty Harold</a><br/>       
              <a href="mailto:elharo@metalab.unc.edu">elharo@metalab.unc.edu</a><br/>
              Last Modified <xsl:apply-templates select="last_modified" mode="lm"/>
           </font>
         </body>     
       </html>     
     </xsl:document>  

xsl:script Top-level Element


xsl:script with Java

<?xml version="1.0"?>
<xsl:stylesheet version="1.1"
  xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
  xmlns:date="http://www.cafeconleche.org/ns/"
>

  <xsl:template match="/">
    <xsl:value-of select="date:new()"/>
  </xsl:template>

  <xsl:script
    implements-prefix="date"
    language="java"
    src="java:java.util.Date"
  />

</xsl:stylesheet>

xsl:script with JavaScript

<?xml version="1.0"?>
<xsl:stylesheet version="1.1"
  xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
  xmlns:date="http://www.cafeconleche.org/ns/date"
>

  <xsl:template match="/">
    <xsl:value-of select="date:clock()"/>
  </xsl:template>

  <xsl:script
    implements-prefix="date"
    language="javascript">
    
    function clock() {
      var time = new Date();
      var hours = time.getHours();
      var min = time.getMinutes();
      var sec = time.getSeconds();
      var status = "AM";
      if (hours > 11) {
        status = "PM";
      }
      if (hours < 11) {
        hours -= 12;
      }
      if (min < 10) {
        min = "0" + min;
      }
      if (sec < 10) {
        sec = "0" + sec;
      }
      return hours + ":" + min + ":" + sec + " " + status;
   }
   
  </xsl:script>  

</xsl:stylesheet>

XQuery

Three parts:


XQuery Language


Documents to Query


Physical Representations to Query


Where is XQuery used?


The XML Model vs. the Relational Model

A relational database contains tables An XML database contains collections
A relational table contains records with the same schema A collection contains XML documents with the same DTD
A relational record is an unordered list of named values An XML document is a tree of nodes
A SQL query returns an unordered set of records An XQuery returns an ordered node set

Query Data Types


An example document to query

Most of the examples in this talk query this bibliography document at the (relative) URL bib.xml:

<bib>
<book year="1994">
<title>TCP/IP Illustrated</title>
<author><last>Stevens</last><first>W.</first></author>
<publisher>Addison-Wesley</publisher>
<price> 65.95</price>
</book>

<book year="1992">
<title>Advanced Programming in the Unix Environment</title>
<author><last>Stevens</last><first>W.</first></author>
<publisher>Addison-Wesley</publisher>
<price>65.95</price>
</book>

<book year="2000">
<title>Data on the Web</title>
<author><last>Abiteboul</last><first>Serge</first></author>
<author><last>Buneman</last><first>Peter</first></author>
<author><last>Suciu</last><first>Dan</first></author>
<publisher>Morgan Kaufmann Publishers</publisher>
<price> 39.95</price>
</book>

<book year="1999">
<title>The Economics of Technology and Content for Digital TV</title>
<editor>
<last>Gerbarg</last><first>Darcy</first>
<affiliation>CITI</affiliation>
</editor>
<publisher>Kluwer Academic Publishers</publisher>
<price>129.95</price>
</book>

</bib>

Adapted from Mary Fernandez, Jerome Simeon, and Phil Wadler: XML Query Languages: Experiences and Exemplars, 1999, as adapted in XML Query Use Cases


The XQuery FLWR


Query: List titles of all books

   FOR $t IN document("bib.xml")/bib/book/title
   RETURN
      $t 

Adapted from XML Query Use Cases


Query Result: Book Titles

  <title>TCP/IP Illustrated</title>
  <title>Advanced Programming in the Unix Environment</title>
  <title>Data on the Web</title>
  <title>The Economics of Technology and Content for Digital TV</title>
 

Adapted from XML Query Use Cases


XQueryX


Element Constructors

List titles of all books in a bib element. Put each title in a book element.

<bib>
  {
   FOR $t IN document("bib.xml")/bib/book/title
   RETURN
    <book>
     { $t }
    </book>
  }
</bib>

Adapted from XML Query Use Cases


Query Result: Book Titles

<bib>
  <book>
    <title>TCP/IP Illustrated</title>
  </book>
  <book>
    <title>Advanced Programming in the Unix Environment</title>
  </book>
  <book>
    <title>Data on the Web</title>
  </book>
  <book>
    <title>The Economics of Technology and Content for Digital TV</title>
  </book>
</bib>
 

Adapted from XML Query Use Cases


Query with WHERE

Adapted from XML Query Use Cases


Query Result: Titles of books published by Addison-Wesley

<bib>
    <title>TCP/IP Illustrated</title>
    <title>Advanced Programming in the Unix Environment</title>
</bib>
 

Adapted from XML Query Use Cases


Query with Booleans

Adapted from XML Query Use Cases


Query Result: books published by Addison-Wesley after 1993

<bib>
    <title>Advanced Programming in the Unix Environment</title>
</bib>
 

Adapted from XML Query Use Cases


Attribute Constructors

Adapted from XML Query Use Cases


Query Result: books published by Addison-Wesley after 1993, including their year and title.

<bib>
  <book year="1992">
    <title>Advanced Programming in the Unix Environment</title>
  </book>
</bib>
 

Adapted from XML Query Use Cases


Query with multiple variables

Create a list of all the title-author pairs, with each pair enclosed in a result element.

<results>
 {
   FOR $b IN document("bib.xml")/bib/book,
     $t IN $b/title,
     $a IN $b/author
   RETURN
    <result>
    { $t }
    { $a }
    </result>
  }
</results>

Adapted from XML Query Use Cases


Query Result: A list of all the title-author pairs

<results>
    <result>
         <title>TCP/IP Illustrated</title>
         <author><last>Stevens</last><first>W.</first></author>
    </result>
    <result>
         <title>Advanced Programming in the Unix Environment</title>
         <author><last>Stevens</last><first>W.</first></author>
    </result>
    <result>
         <title>Data on the Web</title>
         <author><last>Abiteboul</last><first>Serge</first></author>
    </result>
    <result>
         <title> Data on the Web</title>
         <author><last>Buneman</last><first>Peter</first></author>
    </result>
    <result>
         <title>Data on the Web</title>
         <author><last>Suciu</last><first>Dan</first></author>
    </result>
</results>
 

Adapted from XML Query Use Cases


Nested Queries

For each book in the bibliography, list the title and authors, grouped inside a result element.

<results>
 {
   FOR $b IN document("bib.xml")/bib/book
   RETURN
    <result>
     { $b/title }
     {  
       FOR $a IN $b/author
       RETURN $a
     }
    </result>
 }
</results>

Adapted from XML Query Use Cases


Query Result: A list of the title and authors of each book in the bibliography

<?xml version="1.0"?>
<results xmlns:ino="http://namespaces.softwareag.com/tamino/response2" xmlns:xql="http://metalab.unc.edu/xql/">
  <result>
    <title>TCP/IP Illustrated</title>
    <author>
      <last>Stevens</last>
      <first>W.</first>
    </author>
  </result>
  <result>
    <title>Advanced Programming in the Unix Environment</title>
    <author>
      <last>Stevens</last>
      <first>W.</first>
    </author>
  </result>
  <result>
    <title>Data on the Web</title>
    <author>
      <last>Abiteboul</last>
      <first>Serge</first>
    </author>
    <author>
      <last>Buneman</last>
      <first>Peter</first>
    </author>
    <author>
      <last>Suciu</last>
      <first>Dan</first>
    </author>
  </result>
  <result>
    <title>The Economics of Technology and Content for Digital TV</title>
  </result>
</results> 

Adapted from XML Query Use Cases


Query with distinct

For each author in the bibliography, list the author's name and the titles of all books by that author, grouped inside a result element.

<results>
 {
   FOR $a IN distinct(document("bib.xml")//author)
   RETURN
    <result>
     { $a }
     {  FOR $b IN document("bib.xml")/bib/book[author=$a]
        RETURN $b/title
     }
    </result>
 }
</results>

Adapted from XML Query Use Cases


Query Result

<results>
  <result>
    <author><last>Stevens</last><first>W.</first></author>
    <title>TCP/IP Illustrated</title>
    <title>Advanced Programming in the Unix Environment</title>
  </result>

  <result>
    <author><last>Abiteboul</last><first>Serge</first></author>
    <title>Data on the Web</title>
  </result>

  <result>
    <author><last>Buneman</last><first>Peter</first></author>
    <title>Data on the Web</title>
  </result>

  <result>
    <author><last>Suciu</last><first>Dan</first></author>
      <title>Data on the Web</title>
  </result>
</results>
 

Adapted from XML Query Use Cases


Query with sorting

List the titles and years of all books published by Addison-Wesley after 1991, in alphabetic order.

<bib>
 {
   FOR $b IN document("bib.xml")//book
    [publisher = "Addison-Wesley" AND @year > "1991"]
   RETURN
    <book>
     { $b/@year } { $b/title }
    </book> SORTBY (title)
 }
</bib>

Adapted from XML Query Use Cases


Query Result

<bib>
  <book year="1992">
    <title>Advanced Programming in the Unix Environment</title>
  </book>
  <book year="1994">
    <title>TCP/IP Illustrated</title>
   </book>
</bib>
  

Adapted from XML Query Use Cases


Queries with functions

Adapted from XML Query Use Cases


Query Result

<result>
 <book>
  <title> Data on the Web </title>
  <author> <last> Suciu </last> <first> Dan </first> </author>
 </book>
</result>

Adapted from XML Query Use Cases


Tentative Function List

Numeric Constructors
Functions on Numeric Values
String Constructors
Equality and Comparison of Strings
Functions on String Values
Boolean Constructors
Functions on Boolean Values
Duration and Datetime Constructors
Component Extraction Functions on Datetime Values
Component Extraction Functions on Duration Values
Arithmetic Functions on Dates
Functions on TimePeriod Values
Constructors for QNames
Functions on QNames
Constructor for anyURI
NOTATION Constructor
Functions on Nodes
Constructors on Sequences
Functions on Sequences
Equals, Union, Intersection and Except
Aggregate Functions
Functions that Generate Sequences
Casting Functions
Miscellaneous casting functions

A different document about books

Sample data at "reviews.xml":

<reviews>
  <entry>
    <title>Data on the Web</title>
    <price>34.95</price>
    <review>
       A very good discussion of semi-structured database
       systems and XML.
    </review>
  </entry>
  <entry>
    <title>Advanced Programming in the Unix Environment</title>
    <price>65.95</price>
    <review>
      A clear and detailed discussion of UNIX programming.
    </review>
  </entry>
  <entry>
    <title>TCP/IP Illustrated</title>
    <price>65.95</price>
    <review>
      One of the best books on TCP/IP.
    </review>
  </entry>
</reviews>

Adapted from XML Query Use Cases


This document uses a different DTD

<!ELEMENT reviews (entry*)>
<!ELEMENT entry   (title, price, review)>
<!ELEMENT title   (#PCDATA)>
<!ELEMENT price   (#PCDATA)>
<!ELEMENT review  (#PCDATA)>

Query that joins two documents

For each book found in both bib.xml and reveiws.xml, list the title of the book and its price from each source.

<books-with-prices>
 {
   FOR $b IN document("bib.xml")//book,
     $a IN document("reviews.xml")//entry
   WHERE $b/title = $a/title
   RETURN
    <book-with-prices>
     { $b/title },
       <price-amazon> { $a/price/text() } </price-amazon>
       <price-bn> { $b/price/text() } </price-bn>
    </book-with-prices>
 }
</books-with-prices>

Adapted from XML Query Use Cases


Result

<books-with-prices>
  <book-with-prices>
    <title>TCP/IP Illustrated</title>
    <price-amazon>65.95</price-amazon>
    <price-bn>65.95</price-bn>
  </book-with-prices>

  <book-with-prices>
    <title>Advanced Programming in the Unix Environment</title>
    <price-amazon>65.95</price-amazon>
    <price-bn>65.95</price-bn>
  </book-with-prices>

  <book-with-prices>
    <title>Data on the Web</title>
    <price-amazon>34.95</price-amazon>
    <price-bn>39.95</price-bn>
  </book-with-prices>
</books-with-prices>
  

Adapted from XML Query Use Cases


prices.xml Query Sample Data

The next query also uses an input document named "prices.xml":

<prices>
  <book>
    <title>Advanced Programming in the Unix Environment</title>
    <source>www.amazon.com</source>
    <price>65.95</price>
  </book>
  <book>
    <title>Advanced Programming in the Unix Environment </title>
    <source>www.bn.com</source>
    <price>65.95</price>
  </book>
  <book>
    <title>TCP/IP Illustrated </title>
    <source>www.amazon.com</source>
    <price>65.95</price>
  </book>
  <book>
    <title>TCP/IP Illustrated </title>
    <source>www.bn.com</source>
    <price>65.95</price>
  </book>
  <book>
    <title>Data on the Web</title>
    <source>www.amazon.com</source>
    <price>34.95</price>
  </book>
  <book>
    <title>Data on the Web</title>
    <source>www.bn.com</source>
    <price>39.95</price>
  </book>
</prices>


Adapted from XML Query Use Cases


Query with reused variables

<results>
 {
   FOR $t IN distinct(document("prices.xml")/book/title)
   LET $p := $doc/book[title = $t]/price
   RETURN
    <minprice title = { $t/text() } >
     { min($p) }
    </minprice>
 }
</results>

Adapted from XML Query Use Cases


Query Result

<results>
  <minprice title="Advanced Programming in the Unix Environment"> 65.95 </minprice>
  <minprice title="TCP/IP Illustrated"> 65.95 </minprice>
  <minprice title="Data on the Web"> 34.95 </minprice>
</results>   

Adapted from XML Query Use Cases


Multiple FLWR Queries

<bib>
 {
   FOR $b IN document("bib.xml")//book[author]
   RETURN
    <book>
     { $b/title }
     { $b/author }
    </book>,
   FOR $b IN document("bib.xml")//book[editor]
   RETURN
    <reference>
     { $b/title }
     <org> { $b/editor/affiliation/text() } </org>
    </reference>
 }
</bib>

Adapted from XML Query Use Cases


Query Result

<bib>
    <book>
         <title>TCP/IP Illustrated</title>
         <author><last> Stevens </last> <first> W.</first></author>
    </book>

    <book>
         <title>Advanced Programming in the Unix Environment</title>
         <author><last>Stevens</last><first>W.</first></author>
    </book>

    <book>
         <title>Data on the Web</title>
         <author><last>Abiteboul</last><first>Serge</first></author>
         <author><last>Buneman</last><first>Peter</first></author>
         <author><last>Suciu</last><first>Dan</first></author>
    </book>

    <reference>
        <title>The Economics of Technology and Content for Digital TV</title>
        <org>CITI</org>
    </reference>
</bib>

  

Adapted from XML Query Use Cases


Query Software


To Learn More


To Learn More


Index | Cafe con Leche

Copyright 2000, 2001 Elliotte Rusty Harold
elharo@metalab.unc.edu
Last Modified September 13, 2001