XML Fundamentals


XML Fundamentals

Elliotte Rusty Harold

Software Development 2001 East

Wednesday, August 29, 2001

elharo@metalab.unc.edu

http://www.ibiblio.org/xml/


What is XML?


Extensible Markup Language

Language
Markup Language
Extensible

XML is a Meta Markup Language


XML Applications


Some XML Applications


XML describes structure and semantics, not formatting


A Song Description in HTML

<dt>Hot Cop
<dd> by Jacques Morali, Henri Belolo, and Victor Willis
<ul>
<li>Producer: Jacques Morali
<li>Publisher: PolyGram Records
<li>Length: 6:20
<li>Written: 1978
<li>Artist: Village People
</ul>
View Document in Browser

A Song Description in XML

<SONG>
  <TITLE>Hot Cop</TITLE>
  <COMPOSER>Jacques Morali</COMPOSER>
  <COMPOSER>Henri Belolo</COMPOSER>
  <COMPOSER>Victor Willis</COMPOSER>
  <PRODUCER>Jacques Morali</PRODUCER>
  <PUBLISHER>PolyGram Records</PUBLISHER>
  <LENGTH>6:20</LENGTH>
  <YEAR>1978</YEAR>
  <ARTIST>Village People</ARTIST>
</SONG>
View Document in Browser

Editing and Saving XML Files


Style Sheets provide formatting

SONG {display: block; font-family: New York, Times New Roman, serif}
TITLE {display: block; font-size: 24pt; 
       font-weight: bold; font-family: Helvetica, sans}
COMPOSER {display: block}
PRODUCER {display: block}
YEAR {display: block}
PUBLISHER {display: block}
LENGTH {display: block}
ARTIST {display: block; font-style: italic}

Attaching style sheets to documents

<?xml-stylesheet type="text/css" href="song.css"?>
<SONG>
  <TITLE>Hot Cop</TITLE>
  <COMPOSER>Jacques Morali</COMPOSER>
  <COMPOSER>Henri Belolo</COMPOSER>
  <COMPOSER>Victor Willis</COMPOSER>
  <PRODUCER>Jacques Morali</PRODUCER>
  <PUBLISHER>PolyGram Records</PUBLISHER>
  <LENGTH>6:20</LENGTH>
  <YEAR>1978</YEAR>
  <ARTIST>Village People</ARTIST>
</SONG>

View Document in Browser

Style Sheet Languages


An XSLT stylesheet

<xsl:stylesheet version="1.0"
  xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

  <xsl:template match="/">
    <html>
      <head><title>Song</title></head>
      <body>
        <xsl:apply-templates select="SONG"/>    
      </body>
    </html>
  </xsl:template>
  
  <xsl:template match="SONG">
    <h1>
      <xsl:value-of select="TITLE"/> 
      by the 
      <xsl:value-of select="ARTIST"/>
    </h1>
    
    <ul>
      <li>Length: <xsl:value-of select="LENGTH"/></li>
      <li>Producer: <xsl:value-of select="PRODUCER"/></li>
      <li>Publisher: <xsl:value-of select="PUBLISHER"/></li>
      <li>Year: <xsl:value-of select="YEAR"/></li>
      <xsl:apply-templates select="COMPOSER"/>
    </ul>
  </xsl:template>

  <xsl:template match="COMPOSER">
    <li>Composer: <xsl:value-of select="."/></li>
  </xsl:template>

</xsl:stylesheet>

Transforming the Document

D:\fundamentals\examples>saxon hotcop.xml song3.xsl
<html>
   <head>
      <meta http-equiv="Content-Type" content="text/html; charset=utf-8">

      <title>Song</title>
   </head>
   <body>
      <h1>Hot Cop
         by the
         Village People
      </h1>
      <ul>
         <li>Length: 6:20</li>
         <li>Producer: Jacques Morali</li>
         <li>Publisher: PolyGram Records</li>
         <li>Year: 1978</li>
         <li>Composer: Jacques Morali</li>
         <li>Composer: Henri Belolo</li>
         <li>Composer: Victor Willis</li>
      </ul>
   </body>
</html>

Or alternately:

% java com.icl.saxon.StyleSheet hotcop.xml song3.xsl
<html>
...


View Document in Browser

CSS or XSL?


Well-formedness

Rules:


Validity

To be valid an XML document must be

  1. Well-formed

  2. Must have a Document Type Definition (DTD)

  3. Must comply with the constraints specified in the DTD


A DTD for Songs

<!ELEMENT SONG (TITLE, COMPOSER+, PRODUCER*, 
 PUBLISHER*, LENGTH?, YEAR?, ARTIST+)>

<!ELEMENT TITLE (#PCDATA)>

<!ELEMENT COMPOSER (#PCDATA)>
<!ELEMENT PRODUCER (#PCDATA)>
<!ELEMENT PUBLISHER (#PCDATA)>
<!ELEMENT LENGTH (#PCDATA)>
<!-- This should be a four digit year like "1999",
     not a two-digit year like "99" -->
<!ELEMENT YEAR (#PCDATA)>

<!ELEMENT ARTIST (#PCDATA)>

A Valid Song Document

<!DOCTYPE SONG SYSTEM "song.dtd">
<SONG>
  <TITLE>Hot Cop</TITLE>
  <COMPOSER>Jacques Morali</COMPOSER>
  <COMPOSER>Henri Belolo</COMPOSER>
  <COMPOSER>Victor Willis</COMPOSER>
  <PRODUCER>Jacques Morali</PRODUCER>
  <PUBLISHER>PolyGram Records</PUBLISHER>
  <LENGTH>6:20</LENGTH>
  <YEAR>1978</YEAR>
  <ARTIST>Village People</ARTIST>
</SONG>

Checking Validity

To check validity you pass the document through a validating parser which should report any errors it finds. For example,

% java dom.DOMCount -v validhotcop.xml
[Error] validhotcop.xml:13:9: The content of element type "SONG" must match "(TI
TLE,COMPOSER+,PRODUCER*,PUBLISHER*,LENGTH?,YEAR?)".
validhotcop.xml: 550 ms (10 elems, 0 attrs, 28 spaces, 98 chars)

A valid document:

% java dom.DOMCount -v validhotcop.xml
validhotcop.xml: 291 ms (10 elems, 0 attrs, 28 spaces, 98 chars)

A More Complex Example

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<?xml-stylesheet type="text/css" href="song.css"?>
<!DOCTYPE SONG SYSTEM "expanded_song.dtd">
<SONG xmlns="http://metalab.unc.edu/xml/namespace/song"
      xmlns:xlink="http://www.w3.org/1999/xlink">
  <TITLE>Hot Cop</TITLE>
  <PHOTO 
    xlink:type="simple" xlink:show="onLoad" xlink:href="hotcop.jpg"
    ALT="Victor Willis in Cop Outfit" WIDTH="100" HEIGHT="200"/>
  <COMPOSER>Jacques Morali</COMPOSER>
  <COMPOSER>Henri Belolo</COMPOSER>
  <COMPOSER>Victor Willis</COMPOSER>
  <PRODUCER>Jacques Morali</PRODUCER>
  <!-- The publisher is actually Polygram but I needed 
       an example of a general entity reference. -->
  <PUBLISHER xlink:type="simple" xlink:href="http://www.amrecords.com/">
    A &amp; M Records
  </PUBLISHER>
  <LENGTH>6:20</LENGTH>
  <YEAR>1978</YEAR>
  <ARTIST>Village People</ARTIST>
</SONG>
<!-- You can tell what album I was 
     listening to when I wrote this example -->

The XML Declaration

<?xml version="1.0" encoding="UTF-8" standalone="no"?>


Attributes

<PHOTO xlink:type="simple" xlink:show="onLoad" xlink:href="hotcop.jpg" ALT="Victor Willis in Cop Outfit" WIDTH="100" HEIGHT="200" />


Empty Element Tags

<PHOTO xlink:type="simple" xlink:show="onLoad" xlink:href="hotcop.jpg" ALT="Victor Willis in Cop Outfit" WIDTH="100" HEIGHT="200" />


Comments

<!-- You can tell what album I was listening to when I wrote this example -->


Namespaces

<SONG xmlns="http://www.ibiblio.org/xml/namespace/song"
      xmlns:xlink="http://www.w3.org/1999/xlink">
  <TITLE>Hot Cop</TITLE>
  <PHOTO 
    xlink:type="simple" xlink:show="onLoad" xlink:href="hotcop.jpg"
    ALT="Victor Willis in Cop Outfit" WIDTH="100" HEIGHT="200"/>
  <COMPOSER>Jacques Morali</COMPOSER>
  <PUBLISHER xlink:type="simple" xlink:href="http://www.amrecords.com/">
    A &amp; M Records
  </PUBLISHER>
  <ARTIST>Village People</ARTIST>
</SONG>

Entity References

A & M Records


A More Complex DTD

<!ELEMENT SONG (TITLE, PHOTO?, COMPOSER+, PRODUCER*, 
 PUBLISHER*, LENGTH?, YEAR?, ARTIST+)>
<!ATTLIST SONG xmlns       CDATA #REQUIRED
               xmlns:xlink CDATA #REQUIRED>
<!ELEMENT TITLE (#PCDATA)>

<!ELEMENT PHOTO EMPTY>
<!ATTLIST PHOTO xlink:type CDATA #FIXED "simple"
                xlink:href CDATA #REQUIRED
                xlink:show CDATA #IMPLIED
                ALT        CDATA #REQUIRED
                WIDTH      CDATA #REQUIRED
                HEIGHT     CDATA #REQUIRED
>

<!ELEMENT COMPOSER (#PCDATA)>
<!ELEMENT PRODUCER (#PCDATA)>
<!ELEMENT PUBLISHER (#PCDATA)>
<!ATTLIST PUBLISHER xlink:type CDATA #IMPLIED
                    xlink:href CDATA #IMPLIED
>

<!ELEMENT LENGTH (#PCDATA)>
<!-- This should be a four digit year like "1999",
     not a two-digit year like "99" -->
<!ELEMENT YEAR (#PCDATA)>

<!ELEMENT ARTIST (#PCDATA)>

What is XML used for?


Domain-Specific Markup Languages


Self-Describing Data


An XML Fragment

<PERSON ID="p1100" SEX="M">
  <NAME>
    <GIVEN>Judson</GIVEN>
    <SURNAME>McDaniel</SURNAME>
  </NAME>
  <BIRTH>
    <DATE>21 Feb 1834</DATE>
  </BIRTH>
  <DEATH>
    <DATE>9 Dec 1905</DATE>
  </DEATH>
</PERSON>

Interchange of Data Among Applications


Example XML Applications


Mathematical Markup Language

<?xml version="1.0"?>
<html xmlns="http://www.w3.org/TR/REC-html40"
      xmlns:m="http://www.w3.org/TR/REC-MathML/"
>
<head>
<title>Fiat Lux</title>
<meta name="GENERATOR" content="amaya V1.3b" />
</head>
<body>

<P>
And God said,
</P>

<math>
  <m:mrow>
    <m:msub>
      <m:mi>&delta;</m:mi>
      <m:mi>&alpha;</m:mi>
    </m:msub>
    <m:msup>
      <m:mi>F</m:mi>
      <m:mi>&alpha;&beta;</m:mi>
    </m:msup>
    <m:mi></m:mi>
    <m:mo>=</m:mo>
    <m:mi></m:mi>
    <m:mfrac>
      <m:mrow>
        <m:mn>4</m:mn>
        <m:mi>&pi;</m:mi>
      </m:mrow>
      <m:mi>c</m:mi>
    </m:mfrac>
    <m:mi></m:mi>
    <m:msup>
      <m:mi>J</m:mi>
      <m:mrow>
        <m:mi>&beta;</m:mi>
        <m:mo></m:mo>
      </m:mrow>
    </m:msup>
  </m:mrow>
</math>

<P>
and there was light
</P>
</body>
</html>

Channel Definition Format

<?xml version="1.0"?>
<CHANNEL HREF="http://www.ibiblio.org/xml/index.html">
  <TITLE>Cafe con Leche</TITLE>
  <ITEM HREF="http://www.ibiblio.org/xml/books.html">
    <TITLE>Books about XML</TITLE>
  </ITEM>
  <ITEM HREF="http://www.ibiblio.org/xml/tradeshows.html">
    <TITLE>Trade shows and conferences about XML</TITLE>
  </ITEM>
  <ITEM HREF="http://www.ibiblio.org/xml/lists.htm">
    <TITLE>Mailing Lists dedicated to XML</TITLE>
  </ITEM>
</CHANNEL>

Classic Literature


Vector Graphics

A VML document

The Resource Description Framework (RDF)


An Example of RDF

<rdf:RDF 
 xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
  xmlns:dc="http://purl.org/DC/>
  <rdf:Description about="http://www.ibiblio.org/xml/>
    <dc:CREATOR>Elliotte Rusty Harold</dc:CREATOR>
    <dc:TITLE>Cafe con Leche</dc:TITLE>
  </rdf:Description>
</rdf:RDF>

File Formats, in-house applications, and other behind the scenes uses


XML for XML


XSL: The Extensible Stylesheet Language


An XML document

<?xml version="1.0"?>
<?xml-stylesheet type="text/xml" href="atoms.xsl"?>
<PERIODIC_TABLE>

  <ATOM STATE="GAS">
    <NAME>Hydrogen</NAME>
    <SYMBOL>H</SYMBOL>
    <ATOMIC_NUMBER>1</ATOMIC_NUMBER>
    <ATOMIC_WEIGHT>1.00794</ATOMIC_WEIGHT>
    <BOILING_POINT UNITS="Kelvin">20.28</BOILING_POINT>
    <MELTING_POINT UNITS="Kelvin">13.81</MELTING_POINT>
    <DENSITY UNITS="grams/cubic centimeter">
      <!-- At 300K, 1 atm -->
      0.0000899
    </DENSITY>
  </ATOM>

  <ATOM STATE="GAS">
    <NAME>Helium</NAME>
    <SYMBOL>He</SYMBOL>
    <ATOMIC_NUMBER>2</ATOMIC_NUMBER>
    <ATOMIC_WEIGHT>4.0026</ATOMIC_WEIGHT>
    <BOILING_POINT UNITS="Kelvin">4.216</BOILING_POINT>
    <MELTING_POINT UNITS="Kelvin">0.95</MELTING_POINT>
    <DENSITY UNITS="grams/cubic centimeter"><!-- At 300K -->
      0.0001785
    </DENSITY>
  </ATOM>

</PERIODIC_TABLE>

An XSLT style sheet that converts to XSL-FO

<?xml version="1.0"?>
<xsl:stylesheet version="1.0"
  xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
  xmlns:fo="http://www.w3.org/1999/XSL/Format">

  <xsl:output indent="yes"/>

  <xsl:template match="/">
    <fo:root xmlns:fo="http://www.w3.org/1999/XSL/Format">

      <fo:layout-master-set>
        <fo:simple-page-master master-name="only">
          <fo:region-body/>
        </fo:simple-page-master>
      </fo:layout-master-set>

      <fo:page-sequence master-name="only">

        <fo:flow flow-name="xsl-region-body">
          <xsl:apply-templates select="//ATOM"/>
        </fo:flow>

      </fo:page-sequence>

    </fo:root>
  </xsl:template>

  <xsl:template match="ATOM">
    <fo:block font-size="20pt" font-family="serif"
              line-height="30pt">
      <xsl:value-of select="NAME"/>
    </fo:block>
  </xsl:template>

</xsl:stylesheet>

The XSL-FO Output

<?xml version="1.0"?>
<fo:root xmlns:fo="http://www.w3.org/1999/XSL/Format">

  <fo:layout-master-set>
    <fo:simple-page-master master-name="only">
      <fo:region-body/>
    </fo:simple-page-master>
  </fo:layout-master-set>

  <fo:page-sequence master-name="only">

    <fo:flow flow-name="xsl-region-body">
      <fo:block font-size="20pt" font-family="serif"
                line-height="30pt">
        Hydrogen
      </fo:block>
      <fo:block font-size="20pt" font-family="serif"
                line-height="30pt" >
        Helium
      </fo:block>
    </fo:flow>

  </fo:page-sequence>

</fo:root>
The PDF Result

W3C XML Schemas


XML Hypertext

Linking in XML is divided into multiple parts:


XML Hypertext Example

<?xml version="1.0"?>
<story date="January 9, 2001"
       xmlns:xlink="http://www.w3.org/1999/xlink"
       xmlns:xi="http://www.w3.org/2001/XInclude"
       xml:base="http://www.cafeaulait.org/">

  <p>
    The W3C XML Linking Working Group has pushed the 
    <cite xlink:href="http://www.w3.org/TR/2001/WD-xptr-20010108"
          xlink:type="simple">XPointer specification</cite> 
    back to working draft status. The specific issue that was 
    uncovered during Candidate Recommendation was some 
    <em xlink:type="simple"
      xlink:href="http://www.w3.org/TR/xptr#xpointer(//div[@class='div3'][7])">
      confusion
    </em> 
    over how to integrate XPointers, particularly those  
   in non-XML documents, with namespaces. 
   </p>

   <p>
     It's also come to light in this draft that Sun has 
     <em xlink:type="simple"
      xlink:href=
      "http://lists.w3.org/Archives/Public/www-xml-linking-comments/2000OctDec/0092.html"
      >
      claimed a patent</em> on some of the technologies needed to 
      implement XPointer. I think this is particularly offensive 
      because Eve L. Maler, a Sun employee, served as co-chair of 
      the XML Linking Working Group and a co-editor of the XPointer 
      specification. As usual Sun wants to use this as a club to lock 
      implementers and users into a licensing agreement that goes 
      beyond what Sun and the W3C could otherwise demand. The specific 
      patent is <cite>United States Patent No. 5,659,729, Method and 
      system for implementing hypertext scroll attributes</cite>, issued 
      to Jakob Nielsen in 1997. The patent was filed on February 1, 1996. 
      It claims:
  </p>
  <blockquote>
    <xi:include href=
      "http://www.delphion.com/details?&pn=US05659729__#xpointer(//abstract)"
    ></xi:include>
  </blockquote>
  
</story>

XLinks: The Extensible Linking Language

<footnote xlink:type="simple" xlink:href="footnote7.xml">7</footnote>

Extended Links


Extended Link Example

<WEBSITE xmlns:xlink="http://www.w3.org/1999/xlink"
         xlink:type="extended" xlink:title="Cafe au Lait">

  <NAME xlink:type="resource" xlink:label="source">
    Cafe au Lait
  </NAME>

  <HOMESITE xlink:type="locator"
           xlink:href="http://www.cafeaulait.org/"
           xlink:label="ny"/>

  <MIRROR xlink:type="locator"
         xlink:title="Cafe au Lait Swedish Mirror"
         xlink:label="se"
         xlink:href="http://sunsite.kth.se/javafaq"/>

  <MIRROR xlink:type="locator"
         xlink:title="Cafe au Lait U.S. Mirror"
         xlink:label="nc"
         xlink:href="http://ibiblio.org/javafaq/"/>

  <MIRROR xlink:type="locator"
         xlink:title="Cafe au Lait Swiss Mirror"
         xlink:label="ch"
         xlink:href="http://sunsite.cnlab-switch.ch/javafaq"/>

  <CONNECTION xlink:type="arc" xlink:from="source"
              xlink:to="ch"    xlink:show="replace"
              xlink:actuate="onRequest"/>
  <CONNECTION xlink:type="arc" xlink:from="source"
              xlink:to="ny"    xlink:show="replace"
              xlink:actuate="onRequest"/>
  <CONNECTION xlink:type="arc" xlink:from="source"
              xlink:to="se"    xlink:show="replace"
              xlink:actuate="onRequest"/>
  <CONNECTION xlink:type="arc" xlink:from="source"
              xlink:to="nc"    xlink:show="replace"
              xlink:actuate="onRequest"/>

</WEBSITE>

Diagram of an Extended Link

An extended link with arcs

XInclude

<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE book SYSTEM "book.dtd" >
<book xmlns:xinclude="http://www.w3.org/2001/XInclude">
  <title>The Java Developer's Resource</title>
  <last_modified>December 3, 2000</last_modified>
    
  <xinclude:include href="getting_started.xml"/>
  <xinclude:include href="procedural_java.xml"/>
  
</book>

Non-XML for XML


XPath

descendant::language[position()=2]
/child::spec/child::body/child::*/child::language[2]
/spec/body/*/language[2]

XPointers

xpointer(id("ebnf"))
xpointer(descendant::language[position()=2])
ebnf
xpointer(/child::spec/child::body/child::*/child::language[2])
xpointer(/spec/body/*/language[2])
/1/14/2
xpointer(id("ebnf"))xpointer(id("EBNF"))

XPointers and URIs

http://www.w3.org/TR/1998/REC-xml-19980210.xml#xpointer(id("ebnf"))
http://www.w3.org/TR/1998/REC-xml-19980210.xml#xpointer(descendant::language[position()=2])
http://www.w3.org/TR/1998/REC-xml-19980210.xml#ebnf
http://www.w3.org/TR/1998/REC-xml-19980210.xml#xpointer(/child::spec/child::body/child::*/child::language[2])
http://www.w3.org/TR/1998/REC-xml-19980210.xml#xpointer(/spec/body/*/language[2])
http://www.w3.org/TR/1998/REC-xml-19980210.xml#/1/14/2
http://www.w3.org/TR/1998/REC-xml-19980210.xml#xpointer(id("ebnf"))xpointer(id("EBNF"))


Programming with XML


Several APIs to choose from


SAX


SAX2


The SAX Process


Parsing a Document with XMLReader

import org.xml.sax.*;
import org.xml.sax.helpers.*;
import java.io.*;


public class SAX2Checker {

  public static void main(String[] args) {
    
    if (args.length == 0) {
      System.out.println("Usage: java SAX2Checker URL1 URL2..."); 
    } 
    
    // set up the parser 
    XMLReader parser;
    try {
      parser = XMLReaderFactory.createXMLReader();
    } 
    catch (SAXException e) {
      try {
        parser = XMLReaderFactory.createXMLReader("org.apache.xerces.parsers.SAXParser");
      }
      catch (SAXException e2) {
        System.err.println("Error: could not locate a parser.");
        return;
      }
    }
     
    // start parsing... 
    for (int i = 0; i < args.length; i++) {
      
      // command line should offer URIs or file names
      try {
        parser.parse(args[i]);
        // If there are no well-formedness errors
        // then no exception is thrown
        System.out.println(args[i] + " is well formed.");
      }
      catch (SAXParseException e) { // well-formedness error
        System.out.println(args[i] + " is not well formed.");
        System.out.println(e.getMessage()
         + " at line " + e.getLineNumber() 
         + ", column " + e.getColumnNumber());
      }
      catch (SAXException e) { // some other kind of error
        System.out.println(e.getMessage());
      }
      catch (IOException e) {
        System.out.println("Could not check " + args[i] 
         + " because of the IOException " + e);
      }
      
    }  
  
  }

}

The ContentHandler interface

package org.xml.sax;


public interface ContentHandler {

    public void setDocumentLocator(Locator locator);
    
    public void startDocument() throws SAXException;
    
    public void endDocument()	throws SAXException;
    
    public void startPrefixMapping(String prefix, String uri) 
     throws SAXException;

    public void endPrefixMapping(String prefix) throws SAXException;

    public void startElement(String namespaceURI, String localName,
		 String rawName, Attributes atts) throws SAXException;

    public void endElement(String namespaceURI, String localName,
     String rawName) throws SAXException;

    public void characters(char[] ch, int start, int length) 
     throws SAXException;

    public void ignorableWhitespace(char ch[], int start, int length)
     throws SAXException;

    public void processingInstruction(String target, String data)
     throws SAXException;

    public void skippedEntity(String name) throws SAXException;
     
}

SAX Example

import org.apache.xerces.parsers.*;
import org.w3c.dom.*;
import org.xml.sax.*;
import java.io.IOException;
import java.util.StringTokenizer;


public class SAXWordCount implements ContentHandler {

  private int numWords;
    
  public void startDocument() throws SAXException {
    this.numWords = 0; 
  }

  public void endDocument() throws SAXException {
    System.out.println(numWords + " words");
    System.out.flush();
  }
  
  private StringBuffer sb = new StringBuffer();
  
  public void characters(char[] text, int start, int length) 
   throws SAXException {
    
    sb.append(text, start, length);
    
  }
  
  private void flush() {
    numWords += countWords(sb.toString());
    sb = new StringBuffer();    
  }
  
  // methods that signify a word break
  public void startElement(String namespaceURI, String localName,
	 String rawName, Attributes atts) throws SAXException {
    this.flush(); 
  }
  
  public void endElement(String namespaceURI, String localName,
	 String rawName) throws SAXException {
    this.flush(); 
  }
  
  public void processingInstruction(String target, String data)
   throws SAXException {
    this.flush(); 
  }

  // methods that aren't necessary in this example
  public void startPrefixMapping(String prefix, String uri) 
   throws SAXException {
    // ignore; 
  }

  public void ignorableWhitespace(char[] text, int start, int length)
   throws SAXException {
    // ignore; 
  }
  
  public void endPrefixMapping(String prefix) throws SAXException {
    // ignore; 
  }

  public void skippedEntity(String name) throws SAXException {
    // ignore; 
  }   
  
  public void setDocumentLocator(Locator locator) {}

  private static int countWords(String s) {
    
    if (s == null) return 0;
    s = s.trim();
    if (s.length() == 0) return 0;
    
    StringTokenizer st = new StringTokenizer(s);
    return st.countTokens();
    
  } 

  public static void main(String[] args) {
     
    SAXParser parser = new SAXParser();
    SAXWordCount counter = new SAXWordCount();
    parser.setContentHandler(counter);
    
    for (int i = 0; i < args.length; i++) {
      try {
        parser.parse(args[i]); 
      }
      catch (SAXException e) {
        System.err.println(e); 
      }
      catch (IOException e) {
        System.err.println(e); 
      }
      
    }
  
  } // end main

}
% java SAXWordCount hotcop.xml
16 words

Event Based API Caveats


Document Object Model


The Design of the DOM API


DOM Evolution


Eight Modules:


DOM Trees


org.w3c.dom


The DOM Process


Parsing documents with a DOM Parser Example

import org.apache.xerces.parsers.DOMParser;
import org.xml.sax.SAXException;
import java.io.IOException;
import org.w3c.dom.*;


public class DOMChecker {

  public static void main(String[] args) {
     
    // This is simpler but less flexible than the SAX approach.
    // Perhaps a good creational design pattern is needed here?   
  
    DOMParser parser = new DOMParser();
    
    for (int i = 0; i < args.length; i++) {
      try {
        // Read the entire document into memory
        parser.parse(args[i]); 
       
        Document d = parser.getDocument();
        // work with the document...
      }
      catch (SAXException e) {
        System.err.println(e); 
      }
      catch (IOException e) {
        System.err.println(e); 
      }
      
    }
  
  }

}

DOM Example

import org.apache.xerces.parsers.*;
import org.w3c.dom.*;
import org.xml.sax.*;
import java.io.IOException;
import java.util.StringTokenizer;


public class DOMWordCount {

  public static void main(String[] args) {
     
    DOMParser parser = new DOMParser();
    DOMWordCount counter = new DOMWordCount();
    
    for (int i = 0; i < args.length; i++) {
      try {
        // Read the entire document into memory
        parser.parse(args[i]); 
       
        Document d = parser.getDocument();
        int numWords = countWordsInNode(d);
        System.out.println(numWords + " words");

      }
      catch (SAXException e) {
        System.err.println(e); 
      }
      catch (IOException e) {
        System.err.println(e); 
      }
      
    }
  
  } // end main

  // note use of recursion
  public static int countWordsInNode(Node node) {
    
    int numWords = 0;
    
    if (node.hasChildNodes()) {
      NodeList children = node.getChildNodes();
      for (int i = 0; i < children.getLength(); i++) {
        numWords += countWordsInNode(children.item(i));
      } 
    }  

    int type = node.getNodeType();
    if (type == Node.TEXT_NODE) {
      String s = node.getNodeValue();
      numWords += countWordsInString(s);
    }
    
    return numWords;  
    
  }
  
  private static int countWordsInString(String s) {
    
    if (s == null) return 0;
    s = s.trim();
    if (s.length() == 0) return 0;
    
    StringTokenizer st = new StringTokenizer(s);
    return st.countTokens();
    
  } 

}
% java DOMWordCount hotcop.xml
16 words

JDOM


The JDOM Process


Parsing a Document with JDOM

import org.jdom.JDOMException;
import org.jdom.input.SAXBuilder;


public class JDOMChecker {

  public static void main(String[] args) {
  
    if (args.length == 0) {
      System.out.println("Usage: java JDOMChecker URL1 URL2..."); 
    } 
      
    SAXBuilder builder = new SAXBuilder();
     
    // start parsing... 
    for (int i = 0; i < args.length; i++) {
      
      // command line should offer URIs or file names
      try {
        builder.build(args[i]);
        // If there are no well-formedness errors, 
        // then no exception is thrown
        System.out.println(args[i] + " is well formed.");
      }
      catch (JDOMException e) { // indicates a well-formedness or other error
        System.out.println(args[i] + " is not well formed.");
        System.out.println(e.getMessage());
      }
      
    }   
  
  }

}

Parser Results

% java JDOMChecker shortlogs.xml HelloJDOM.java
shortlogs.xml is well formed.
HelloJDOM.java is not well formed.
The markup in the document preceding the root element must be well-formed.: 
Error on line 1 of XML document: The markup in the document preceding the 
root element must be well-formed.

JDOM Example

import org.jdom.*;
import org.jdom.input.SAXBuilder;
import java.util.*;


public class JDOMWordCount {

  public static void main(String[] args) {
  
    if (args.length == 0) {
      System.out.println("Usage: java JDOMWordCount URL1 URL2..."); 
    } 
      
    SAXBuilder builder = new SAXBuilder();
     
    // start parsing... 
    for (int i = 0; i < args.length; i++) {
      
      // command line should offer URIs or file names
      try {
        Document doc = builder.build(args[i]);
        Element root = doc.getRootElement();
        int numWords = countWordsInElement(root);
        System.out.println(numWords + " words");

      }
      catch (JDOMException e) { // indicates a well-formedness or other error
        System.out.println(args[i] + " is not well formed.");
        System.out.println(e.getMessage());
      }
      
    }   
  
  }

  public static int countWordsInElement(Element element) {
    
    int numWords = 0;
    
    List children = element.getMixedContent();
    Iterator iterator = children.iterator();
    while (iterator.hasNext()) {
      Object o = iterator.next();
      if (o instanceof String) {
        numWords += countWordsInString((String) o);
      } 
      else if (o instanceof Element) {
        // note use of recursion
        numWords += countWordsInElement((Element) o); 
      } 
    }
    
    return numWords;  
    
  }

  private static int countWordsInString(String s) {
    
    if (s == null) return 0;
    s = s.trim();
    if (s.length() == 0) return 0;
    
    StringTokenizer st = new StringTokenizer(s);
    return st.countTokens();
    
  }

}
% java JDOMWordCount hotcop.xml
16 words

XML and Databases


Integrating XML with Databases


Middleware


Database Exchange and Integration


To Learn More


Index | Cafe con Leche

Copyright 2000, 2001 Elliotte Rusty Harold
elharo@metalab.unc.edu
Last Modified August 29, 2001