Processing XML with SAX and DOM


Processing XML with SAX and DOM

Elliotte Rusty Harold

Software Development 2004 West

Tuesday, March 16, 2004

elharo@metalab.unc.edu

http://www.cafeconleche.org/


Where we're going


Processing XML with Java is easy


Prerequisites


XML API Styles


Parser APIs


Part I: XML Infoset

The Infoset is the unfortunate standard to which those in retreat from the radical and most useful implications of well-formedness have rallied. At its core the Infoset insists that there is 'more' to XML than the straightforward syntax of well-formedness. By imposing its canonical semantics the Infoset obviates the infinite other semantic outcomes which might be elaborated in particular unique circumstances from an instance of well-formed XML 1.0 syntax. The question we should be asking is not whether the Infoset has chosen the correct canonical semantics, but whether the syntactic possibilities of XML 1.0 should be curtailed in this way at all.
--Walter Perry on the xml-dev mailing list


A simple example

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<?xml-stylesheet type="text/css" href="song.css"?>
<!DOCTYPE SONG SYSTEM "song.dtd">
<SONG xmlns="http://www.cafeconleche.org/namespace/song"
      xmlns:xlink="http://www.w3.org/1999/xlink">
  <TITLE>Hot Cop</TITLE>
  <PHOTO 
    xlink:type="simple" xlink:show="onLoad" xlink:href="hotcop.jpg"
    ALT="Victor Willis in Cop Outfit" WIDTH="100" HEIGHT="200"/>
  <COMPOSER>Jacques Morali</COMPOSER>
  <COMPOSER>Henri Belolo</COMPOSER>
  <COMPOSER>Victor Willis</COMPOSER>
  <PRODUCER>Jacques Morali</PRODUCER>
  <!-- The publisher is actually Polygram but I needed 
       an example of a general entity reference. -->
  <PUBLISHER xlink:type="simple" xlink:href="http://www.amrecords.com/">
    A &amp; M Records
  </PUBLISHER>
  <LENGTH>6:20</LENGTH>
  <YEAR>1978</YEAR>
  <ARTIST>Village People</ARTIST>
</SONG>
<!-- You can tell what album I was 
     listening to when I wrote this example -->
View in Browser

Markup and Character Data


Markup and Character Data Example

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<?xml-stylesheet type="text/css" href="song.css"?>
<!DOCTYPE SONG SYSTEM "song.dtd">
<SONG xmlns="http://www.cafeconleche.org/namespace/song"
      xmlns:xlink="http://www.w3.org/1999/xlink">
  <TITLE>Hot Cop</TITLE>
  <PHOTO 
    xlink:type="simple" xlink:show="onLoad" xlink:href="hotcop.jpg"
    ALT="Victor Willis in Cop Outfit" WIDTH="100" HEIGHT="200"/>
  <COMPOSER>Jacques Morali</COMPOSER>
  <COMPOSER>Henri Belolo</COMPOSER>
  <COMPOSER>Victor Willis</COMPOSER>
  <PRODUCER>Jacques Morali</PRODUCER>
  <!-- The publisher is actually Polygram but I needed 
       an example of a general entity reference. -->
  <PUBLISHER xlink:type="simple" xlink:href="http://www.amrecords.com/">
    A &amp; M Records
  </PUBLISHER>
  <LENGTH>6:20</LENGTH>
  <YEAR>1978</YEAR>
  <ARTIST>Village People</ARTIST>
</SONG>
<!-- You can tell what album I was 
     listening to when I wrote this example -->

Elements and Tags


Entities


Parsed Character Data


CDATA sections


Comments


Processing Instructions


The XML Declaration

<?xml version="1.0" encoding="UTF-8" standalone="no"?>

Document Type Declaration

<!DOCTYPE SONG SYSTEM "song.dtd">


Document Type Definition (DTD)

<!ELEMENT SONG (TITLE, PHOTO?, COMPOSER+, PRODUCER*, 
 PUBLISHER*, LENGTH?, YEAR?, ARTIST+)>

<!ELEMENT TITLE     (#PCDATA)>
<!ELEMENT COMPOSER  (#PCDATA)>
<!ELEMENT PRODUCER  (#PCDATA)>
<!ELEMENT PUBLISHER (#PCDATA)>
<!ELEMENT LENGTH    (#PCDATA)>
<!-- This should be a four digit year like "1999",
     not a two-digit year like "99" -->
<!ELEMENT YEAR   (#PCDATA)>
<!ELEMENT ARTIST (#PCDATA)>
<!ELEMENT PHOTO EMPTY>
<!ATTLIST PHOTO xlink:type (simple) #FIXED "simple" 
                xlink:show (onLoad) #FIXED "onLoad" 
                xlink:href CDATA #REQUIRED
                ALT CDATA #REQUIRED
                WIDTH NMTOKEN #REQUIRED
                HEIGHT NMTOKEN #REQUIRED
>
<!ATTLIST PUBLISHER xlink:type (simple) #FIXED "simple" 
                    xlink:href CDATA #REQUIRED

>
<!ATTLIST SONG xmlns CDATA       #FIXED "http://www.cafeconleche.org/namespace/song"
               xmlns:xlink CDATA #FIXED "http://www.w3.org/1999/xlink"
>

XML Names


XML Namespaces


Namespace Syntax


Namespace URIs


Binding Prefixes to Namespace URIs


The Default Namespace


How Parsers Handle Namespaces


The Five Layers of XML Processing

Semantics
Structure
Syntax
Lexical
Binary

To Learn More


Part II: Writing XML Documents with Java

I have learned to be even more skeptical than before about the slew of APIs doing the rounds in the XML development community. An XML instance is just a documents, guys; you need to understand the document structure and document interchange choreography of your systems. Don't let some API get in the way of your understanding of XML systems at the document level. If you do, you run the risk becoming a slave to the APIs and hitting a wall when the APIs fail you.

--Sean McGrath
Read the rest in ITworld.com - XML IN PRACTICE - APIs Considered Harmful


You don't always need a new API


Unicode


Readers and Writers


A Java program that writes Fibonacci numbers into a text file

import java.math.BigInteger;
import java.io.*;


public class FibonacciText {

  public static void main(String[] args) {

    try {
      OutputStream fout = new FileOutputStream("fibonacci.txt");
      Writer out = new OutputStreamWriter(fout, "8859_1");

      BigInteger low  = BigInteger.ONE;
      BigInteger high = BigInteger.ONE;

      for (int i = 1; i <= 25; i++) {
        out.write(low.toString() + "\r\n");
        BigInteger temp = high;
        high = high.add(low);
        low = temp;
      }
      out.write(high.toString() + "\r\n");

      out.close();
    }
    catch (IOException e) {
      System.err.println(e);
    }

  }

}

fibonacci.txt

1
1
2
3
5
8
13
21
34
55
89
144
233
377
610
987
1597
2584
4181
6765
10946
17711
28657
46368
75025
121393
317811

A Java program that writes Fibonacci numbers into an XML file

import java.math.BigInteger;
import java.io.*;


public class FibonacciXML {

  public static void main(String[] args) {
   
    try {
      OutputStream  fout = new FileOutputStream("fibonacci.xml");
      Writer out = new OutputStreamWriter(fout);      
      
      BigInteger low  = BigInteger.ONE;
      BigInteger high = BigInteger.ONE;      
      
      out.write("<?xml version=\"1.0\"?>\r\n");  
      out.write("<Fibonacci_Numbers>\r\n");  
      for (int i = 1; i <= 25; i++) {
        out.write("  <fibonacci index=\"" + i + "\">");
        out.write(low.toString());
        out.write("</fibonacci>\r\n");
        BigInteger temp = high;
        high = high.add(low);
        low = temp;
      }
      out.write("</Fibonacci_Numbers>");  
 
      out.close();
    }
    catch (IOException e) {
      System.err.println(e);
    }

  }

}

fibonacci.xml

<?xml version="1.0"?>
<Fibonacci_Numbers>
  <fibonacci index="1">1</fibonacci>
  <fibonacci index="2">1</fibonacci>
  <fibonacci index="3">2</fibonacci>
  <fibonacci index="4">3</fibonacci>
  <fibonacci index="5">5</fibonacci>
  <fibonacci index="6">8</fibonacci>
  <fibonacci index="7">13</fibonacci>
  <fibonacci index="8">21</fibonacci>
  <fibonacci index="9">34</fibonacci>
  <fibonacci index="10">55</fibonacci>
  <fibonacci index="11">89</fibonacci>
  <fibonacci index="12">144</fibonacci>
  <fibonacci index="13">233</fibonacci>
  <fibonacci index="14">377</fibonacci>
  <fibonacci index="15">610</fibonacci>
  <fibonacci index="16">987</fibonacci>
  <fibonacci index="17">1597</fibonacci>
  <fibonacci index="18">2584</fibonacci>
  <fibonacci index="19">4181</fibonacci>
  <fibonacci index="20">6765</fibonacci>
  <fibonacci index="21">10946</fibonacci>
  <fibonacci index="22">17711</fibonacci>
  <fibonacci index="23">28657</fibonacci>
  <fibonacci index="24">46368</fibonacci>
  <fibonacci index="25">75025</fibonacci>
</Fibonacci_Numbers>

Single quoted attribute values are a little cleaner

import java.math.BigInteger;
import java.io.*;


public class FibonacciApos {

  public static void main(String[] args) {
   
    try {
      OutputStream  fout = new FileOutputStream("fibonacci_apos.xml");
      Writer out = new OutputStreamWriter(fout);      
      
      BigInteger low  = BigInteger.ONE;
      BigInteger high = BigInteger.ONE;      
      
      out.write("<?xml version='1.0'?>\r\n");  
      out.write("<Fibonacci_Numbers>\r\n");  
      for (int i = 1; i <= 25; i++) {
        out.write("  <fibonacci index='" + i + "'>");
        out.write(low.toString());
        out.write("</fibonacci>\r\n");
        BigInteger temp = high;
        high = high.add(low);
        low = temp;
      }
      out.write("</Fibonacci_Numbers>");  
 
      out.close();
    }
    catch (IOException e) {
      System.err.println(e);
    }

  }

}

Suppose we want to use a different encoding than UTF-8

import java.math.BigInteger;
import java.io.*;


public class FibonacciLatin1 {

  public static void main(String[] args) {
   
    try {
      OutputStream fout = new FileOutputStream("fibonacci_Latin_1.xml");
      Writer out = new OutputStreamWriter(fout, "8859_1");      
      
      BigInteger low  = BigInteger.ONE;
      BigInteger high = BigInteger.ONE;      
      
      out.write("<?xml version=\"1.0\" encoding=\"ISO-8859-1\"?>\r\n");  
      out.write("<Fibonacci_Numbers>\r\n");  
      for (int i = 1; i <= 25; i++) {
        out.write("  <fibonacci index=\"" + i + "\">");
        out.write(low.toString());
        out.write("</fibonacci>\r\n");
        
        BigInteger temp = high;
        high = high.add(low);
        low = temp;
      }
      out.write("</Fibonacci_Numbers>");  
 
      out.close();
    }
    catch (IOException e) {
      System.err.println(e);
    }

  }

}

fibonacci_Latin_1.xml

<?xml version="1.0" encoding="ISO-8859-1"?>
<Fibonacci_Numbers>
  <fibonacci index="1">1</fibonacci>
  <fibonacci index="2">1</fibonacci>
  <fibonacci index="3">2</fibonacci>
  <fibonacci index="4">3</fibonacci>
  <fibonacci index="5">5</fibonacci>
  <fibonacci index="6">8</fibonacci>
  <fibonacci index="7">13</fibonacci>
  <fibonacci index="8">21</fibonacci>
  <fibonacci index="9">34</fibonacci>
  <fibonacci index="10">55</fibonacci>
  <fibonacci index="11">89</fibonacci>
  <fibonacci index="12">144</fibonacci>
  <fibonacci index="13">233</fibonacci>
  <fibonacci index="14">377</fibonacci>
  <fibonacci index="15">610</fibonacci>
  <fibonacci index="16">987</fibonacci>
  <fibonacci index="17">1597</fibonacci>
  <fibonacci index="18">2584</fibonacci>
  <fibonacci index="19">4181</fibonacci>
  <fibonacci index="20">6765</fibonacci>
  <fibonacci index="21">10946</fibonacci>
  <fibonacci index="22">17711</fibonacci>
  <fibonacci index="23">28657</fibonacci>
  <fibonacci index="24">46368</fibonacci>
  <fibonacci index="25">75025</fibonacci>
</Fibonacci_Numbers>

Suppose you want to include a DTD

import java.math.BigInteger;
import java.io.*;


public class FibonacciDTD {

  public static void main(String[] args) {
   
    try {
      OutputStream fout = new FileOutputStream("valid_fibonacci.xml");
      Writer out = new OutputStreamWriter(fout, "UTF-8");      
      
      BigInteger low  = BigInteger.ONE;
      BigInteger high = BigInteger.ONE;      
      
      out.write("<?xml version=\"1.0\"?>\r\n");  
      out.write("<!DOCTYPE Fibonacci_Numbers [\r\n");
      out.write("  <!ELEMENT Fibonacci_Numbers (fibonacci*)>\r\n");      
      out.write("  <!ELEMENT fibonacci (#PCDATA)>\r\n");      
      out.write("  <!ATTLIST fibonacci index CDATA #IMPLIED>\r\n");      
      out.write("]>\r\n");  
      out.write("<Fibonacci_Numbers>\r\n");  
      for (int i = 1; i <= 25; i++) {
        out.write("  <fibonacci index=\"" + i + "\">");
        out.write(low.toString());
        out.write("</fibonacci>\r\n");
        
        BigInteger temp = high;
        high = high.add(low);
        low = temp;
      }
      out.write("</Fibonacci_Numbers>");  
 
      out.close();
    }
    catch (IOException e) {
      System.err.println(e);
    }

  }

}

valid_fibonacci.xml

<?xml version="1.0"?>
<!DOCTYPE Fibonacci_Numbers [
  <!ELEMENT Fibonacci_Numbers (fibonacci*)>
  <!ELEMENT fibonacci (#PCDATA)>
  <!ATTLIST fibonacci index CDATA #IMPLIED>
]>
<Fibonacci_Numbers>
  <fibonacci index="0">0</fibonacci>
  <fibonacci index="1">1</fibonacci>
  <fibonacci index="2">1</fibonacci>
  <fibonacci index="3">2</fibonacci>
  <fibonacci index="4">3</fibonacci>
  <fibonacci index="5">5</fibonacci>
  <fibonacci index="6">8</fibonacci>
  <fibonacci index="7">13</fibonacci>
  <fibonacci index="8">21</fibonacci>
  <fibonacci index="9">34</fibonacci>
  <fibonacci index="10">55</fibonacci>
  <fibonacci index="11">89</fibonacci>
  <fibonacci index="12">144</fibonacci>
  <fibonacci index="13">233</fibonacci>
  <fibonacci index="14">377</fibonacci>
  <fibonacci index="15">610</fibonacci>
  <fibonacci index="16">987</fibonacci>
  <fibonacci index="17">1597</fibonacci>
  <fibonacci index="18">2584</fibonacci>
  <fibonacci index="19">4181</fibonacci>
  <fibonacci index="20">6765</fibonacci>
  <fibonacci index="21">10946</fibonacci>
  <fibonacci index="22">17711</fibonacci>
  <fibonacci index="23">28657</fibonacci>
  <fibonacci index="24">46368</fibonacci>
</Fibonacci_Numbers>

To Learn More


Part III: Reading XML Documents with SAX

Actually, SAX2 has ** MUCH ** better infoset support than DOM does. Yes, I've done the detailed analysis.

--David Brownell on the xml-dev mailing list


Reading XML Documents


SAX


SAX Parsers for Java

Parser URL Validating Namespaces DOM1 DOM2 SAX1 SAX2 License
Yuval Oren's Piccolo http://piccolo.sourceforge.net/ X X X LGPL
Apache XML Project's Xerces Java http://xml.apache.org/xerces2-j/index.html X X X X X X Apache Software License, Version 1.1
IBM's XML for Java http://www.alphaworks.ibm.com/formula/xml X X X X X X Apache Software License, Version 1.1
Microstar/David Brownell's Ælfred http://www.gnu.org/software/classpathx/jaxp/jaxp.html X X X   X X GPL with library exception
Silfide's SXP http://www.loria.fr/projets/XSilfide/EN/sxp/    X   X  Non-GPL viral open source license
Sun's Crimson http://xml.apache.org/crimson/ X X X   X   Apache
Oracle's XML Parser for Java http://technet.oracle.com/ X X X X X Xfree beer
Caucho Resin http://www.caucho.com/products/resin-xml/index.xtp ? X ? ? X Xpayware
Saxon's AElfred http://saxon.sourceforge.net/aelfred.html   X X   X XBSD-ish license

The Horrors of the CLASSPATH


SAX1


SAX2


The SAX2 Process

  1. Use the factory method XMLReaderFactory.createXMLReader() to retrieve a parser-specific implementation of the XMLReader interface

  2. Your code registers a ContentHandler with the parser

  3. An InputSource feeds the document into the parser

  4. As the document is read, the parser calls back to the methods of the ContentHandler to tell it what it's seeing in the document.


Making an XMLReader


Parsing a Document with XMLReader

import org.xml.sax.*;
import org.xml.sax.helpers.*;
import java.io.*;


public class SAX2Checker {

  public static void main(String[] args) {

    XMLReader parser;
    try {
      parser = XMLReaderFactory.createXMLReader();
    }
    catch (SAXException ex) {
      try {
        parser = XMLReaderFactory.createXMLReader(
         "org.apache.xerces.parsers.SAXParser");
      }
      catch (SAXException ex2) {
        System.out.println("Could not locate a parser."
         + "Please set the the org.xml.sax.driver property.");
        return;
      }
    }

    if (args.length == 0) {
      System.out.println("Usage: java SAX2Checker URL1 URL2...");
    }

    // start parsing...
    for (int i = 0; i < args.length; i++) {

      // command line should offer URIs or file names
      try {
        parser.parse(args[i]);
        // If there are no well-formedness errors
        // then no exception is thrown
        System.out.println(args[i] + " is well formed.");
      }
      catch (SAXParseException e) { // well-formedness error
        System.out.println(args[i] + " is not well formed.");
        System.out.println(e.getMessage()
         + " at line " + e.getLineNumber()
         + ", column " + e.getColumnNumber());
      }
      catch (SAXException e) { // some other kind of error
        System.out.println(e.getMessage());
      }
      catch (IOException e) {
        System.out.println("Could not check " + args[i]
         + " because of the IOException " + e);
      }

    }

  }

}

Sample Output from SAX2Checker

C:\>java SAX2Checker http://www.cafeconleche.org/
http://www.cafeconleche.org/ is not well formed.
The element type "dt" must be terminated by the 
matching end-tag "</dt>". 
at line 186, column 5

JAXP Brain Damage


The ContentHandler interface

package org.xml.sax;


public interface ContentHandler {

    public void setDocumentLocator(Locator locator);
    
    public void startDocument() throws SAXException;
    
    public void endDocument() throws SAXException;
    
    public void startPrefixMapping(String prefix, String uri) 
     throws SAXException;

    public void endPrefixMapping(String prefix) throws SAXException;

    public void startElement(String namespaceURI, String localName,
     String qualifiedName, Attributes atts) throws SAXException;

    public void endElement(String namespaceURI, String localName,
     String qualifiedName) throws SAXException;

    public void characters(char[] text, int start, int length) 
     throws SAXException;

    public void ignorableWhitespace(char[] text, int start, int length)
     throws SAXException;

    public void processingInstruction(String target, String data)
     throws SAXException;

    public void skippedEntity(String name) throws SAXException;
     
}

SAX2 Event Reporter

import org.xml.sax.*;
import org.xml.sax.helpers.*;
import java.io.*;

public class EventReporter implements ContentHandler {

  public void setDocumentLocator(Locator locator) {
    System.out.println("setDocumentLocator(" + locator + ")");
  }

  public void startDocument() throws SAXException {
    System.out.println("startDocument()");
  }

  public void endDocument() throws SAXException {
    System.out.println("endDocument()");
  }

  public void startElement(String namespaceURI, String localName, 
   String qualifiedName, Attributes atts)
   throws SAXException {
    namespaceURI = '"' + namespaceURI + '"';
    localName = '"' + localName + '"';
    qualifiedName = '"' + qualifiedName + '"';
    String attributeString = "{";
    for (int i = 0; i < atts.getLength(); i++) {
      attributeString += atts.getQName(i) + "=\"" 
       + atts.getValue(i) + "\"";
      if (i != atts.getLength()-1) attributeString += ", ";
    }
    attributeString += "}";
    System.out.println("startElement(" + namespaceURI + ", " 
     + localName + ", " + qualifiedName + ", " + attributeString + ")");
  }

  public void endElement(String namespaceURI, String localName, 
   String qualifiedName)
   throws SAXException {
    namespaceURI = '"' + namespaceURI + '"';
    localName = '"' + localName + '"';
    qualifiedName = '"' + qualifiedName + '"';
    System.out.println("endElement(" + namespaceURI + ", " 
     + localName + ", " + qualifiedName + ")");
  }

  public void characters(char[] text, int start, int length)
   throws SAXException {
    String textString = "[" + new String(text) + "]";
    System.out.println("characters(" + textString + ", " 
     + start + ", " +  length + ")");
  }

  public void ignorableWhitespace(char[] text, int start, int length)
   throws SAXException {
    System.out.println("ignorableWhitespace()");
  }

  public void processingInstruction(String target, String data)
   throws SAXException {
    System.out.println("processingInstruction(" + target + ", " 
     + data + ")");
  }

  public void startPrefixMapping(String prefix, String uri)
   throws SAXException {
    System.out.println("startPrefixMapping(\"" + prefix + "\", \"" 
     + uri + "\")");
  }

  public void endPrefixMapping(String prefix) throws SAXException {
    System.out.println("endPrefixMapping(\"" + prefix + "\")");
  }

  public void skippedEntity(String name) throws SAXException {
    System.out.println("skippedEntity(" + name + ")");
  }

  // Could easily have put main() method in a separate class
  public static void main(String[] args) {

    XMLReader parser;
    try {
     parser = XMLReaderFactory.createXMLReader();
    }
    catch (Exception e) {
      // fall back on Xerces parser by name
      try {
        parser = XMLReaderFactory.createXMLReader(
         "org.apache.xerces.parsers.SAXParser");
      }
      catch (Exception ee) {
        System.err.println("Couldn't locate a SAX parser");
        return;
      }
    }


    if (args.length == 0) {
      System.out.println(
       "Usage: java EventReporter URL1 URL2...");
    }

    // Install the content handler
    parser.setContentHandler(new EventReporter());

    // start parsing...
    for (int i = 0; i < args.length; i++) {

      // command line should offer URIs or file names
      try {
        parser.parse(args[i]);
      }
      catch (SAXParseException e) { // well-formedness error
        System.out.println(args[i] + " is not well formed.");
        System.out.println(e.getMessage()
         + " at line " + e.getLineNumber()
         + ", column " + e.getColumnNumber());
      }
      catch (SAXException e) { // some other kind of error
        System.out.println(e.getMessage());
      }
      catch (IOException e) {
        System.out.println("Could not report on " + args[i]
         + " because of the IOException " + e);
      }

    }

  }

}

Event Reporter Output

View in Browser

A Sample Application

Full list

Goal: Return a list of all the URLs in this list as java.net.URL objects

Design Decisions


SAX Design


User Interface Class

import org.xml.sax.*;
import org.xml.sax.helpers.XMLReaderFactory;
import java.util.*;
import java.io.*;


public class WeblogsSAX {
     
  public static List listChannels() 
   throws IOException, SAXException {
    return listChannels(
     "http://static.userland.com/weblogMonitor/logs.xml"); 
  }
  
  public static List listChannels(String uri) 
   throws IOException, SAXException {
    
    XMLReader parser;
    try {
      parser = XMLReaderFactory.createXMLReader();
    }
    catch (SAXException ex) {
      parser = XMLReaderFactory.createXMLReader(
       "org.apache.xerces.parsers.SAXParser"
      );
    }
    Vector urls = new Vector(1000);
    ContentHandler handler = new URIGrabber(urls);
    parser.setContentHandler(handler);
    parser.parse(uri);
    return urls;
    
  }
  
  public static void main(String[] args) {
   
    try {
      List urls;
      if (args.length > 0) urls = listChannels(args[0]);
      else urls = listChannels();
      Iterator iterator = urls.iterator();
      while (iterator.hasNext()) {
        System.out.println(iterator.next()); 
      }
    }
    catch (IOException e) {
      System.err.println(e); 
    }
    catch (SAXParseException e) {
      System.err.println(e); 
      System.err.println("at line " + e.getLineNumber() 
       + ", column " + e.getColumnNumber()); 
    }
    catch (SAXException e) {
      System.err.println(e); 
    }
    catch (/* Unexpected */ Exception e) {
      e.printStackTrace(); 
    }
    
  }
  
}

ContentHandler Class

import org.xml.sax.*;
import java.net.*;
import java.util.Vector;

             // conflicts with java.net.ContentHandler
class URIGrabber implements org.xml.sax.ContentHandler {

  private Vector urls;

  URIGrabber(Vector urls) {
    this.urls = urls;
  }

  // do nothing methods
  public void setDocumentLocator(Locator locator) {}
  public void startDocument() throws SAXException {}
  public void endDocument() throws SAXException {}
  public void startPrefixMapping(String prefix, String uri)
   throws SAXException {}
  public void endPrefixMapping(String prefix) throws SAXException {}
  public void skippedEntity(String name) throws SAXException {}
  public void ignorableWhitespace(char[] text, int start, int length)
   throws SAXException {}
  public void processingInstruction(String target, String data)
   throws SAXException {}


  // Remember, there's no guarantee all the text of the
  // url element will be returned in a single call to characters
  private StringBuffer urlBuffer;
  private boolean collecting = false;

  public void startElement(String namespaceURI, String localName,
   String qualifiedName, Attributes atts) throws SAXException {

    if (qualifiedName.equals("url")) {
      collecting = true;
      urlBuffer = new StringBuffer();
    }

  }

  public void characters(char[] text, int start, int length)
   throws SAXException {

    if (collecting) {
      urlBuffer.append(text, start, length);
    }

  }

  public void endElement(String namespaceURI, String localName,
   String qualifiedName) throws SAXException {

    if (qualifiedName.equals("url")) {
      collecting = false;
      String url = urlBuffer.toString();
      try {
        urls.addElement(new URL(url));
      }
      catch (MalformedURLException e) {
        // skip this url
      }
    }

  }

}

Weblogs Output

% java Weblogs shortlogs.xml
http://www.mozillazine.org
http://www.salonherringwiredfool.com/
http://www.slashdot.org/

Features and Properties


Feature/Property SAXExceptions

SAXNotRecognizedException
The parser never allows you to set or get this feature or property
SAXNotSupportedException
The parser does not allow this value for a requested feature/property, or the feature/property is read-only, or the feature/property cannot be read/written at this moment in the parsing process.

Required Features


Core Features

adapted from SAX2 documentation by David Megginson


Turning on Validation


Three Levels of Errors

In increasing order of severity:

  1. A warning; e.g. ambiguous content model, a constraint for compatibility

  2. A recoverable error: typically a validity error

  3. A fatal error: typically a well-formedness error


The ErrorHandler interface

package org.xml.sax;

public interface ErrorHandler {
 
  public void warning(SAXParseException exception)
   throws SAXException;

  public void error(SAXParseException exception)
   throws SAXException;
    
  public void fatalError(SAXParseException exception)
   throws SAXException;
    
}

An ErrorHandler for Reporting Validity Errors

import org.xml.sax.*;
import java.io.*;


public class ValidityErrorReporter implements ErrorHandler {
 
  private Writer out;
 
  public ValidityErrorReporter(Writer out) {
    this.out = out;
  }
 
  public ValidityErrorReporter() {
    this(new OutputStreamWriter(System.out));
  }
 
  public void warning(SAXParseException ex)
   throws SAXException {

    try {
      out.write(ex.getMessage() + "\r\n");
      out.write(" at line " + ex.getLineNumber() + ", column " 
       + ex.getColumnNumber() + "\r\n");
      out.flush();
    }
    catch (IOException e) {
      throw new SAXException(e); 
    }
    
  }

  public void error(SAXParseException ex)
   throws SAXException {
    
    try {
      out.write(ex.getMessage() + "\r\n");
      out.write(" at line " + ex.getLineNumber() + ", column " 
       + ex.getColumnNumber() + "\r\n");
      out.flush();
    }
    catch (IOException e) {
      throw new SAXException(e); 
    }
    
  }
    
  public void fatalError(SAXParseException ex)
   throws SAXException {
    
    try {
      out.write(ex.getMessage() + "\r\n");
      out.write(" at line " + ex.getLineNumber() + ", column " 
       + ex.getColumnNumber() + "\r\n");
      out.flush();
    }
    catch (IOException e) {
      throw new SAXException(e); 
    }
    
  }
    
}

Validating

import org.xml.sax.*;
import org.xml.sax.helpers.*;
import org.apache.xerces.parsers.*; 
import java.io.*;


public class SAX2Validator {

  public static void main(String[] args) {
    
    XMLReader parser;
    try {
      parser = XMLReaderFactory.createXMLReader();
    }
    catch (SAXException ex) {
      try {
        parser = XMLReaderFactory.createXMLReader(
         "org.apache.xerces.parsers.SAXParser"
        );
      }
      catch (SAXException ex2) {
        System.err.println("Could not locate a SAX2 Parser");
        return;
      }
    }
     
    // turn on validation
    try {
      parser.setFeature(
       "http://xml.org/sax/features/validation", true);
      parser.setErrorHandler(new ValidityErrorReporter());
    }
    catch (SAXNotRecognizedException e) {
      System.err.println(
       "Installed XML parser cannot validate;"
       + " checking for well-formedness instead...");
    } 
    catch (SAXNotSupportedException e) {
      System.err.println(
       "Cannot turn on validation here; "
       + "checking for well-formedness instead...");
    } 
     
    if (args.length == 0) {
      System.out.println("Usage: java SAX2Validator URL1 URL2..."); 
    } 
      
    // start parsing... 
    for (int i = 0; i < args.length; i++) {
      
      // command line should offer URIs or file names
      try {
        parser.parse(args[i]);
        // If there are no well-formedness errors, 
        // then no exception is thrown
        System.out.println(args[i] + " is well formed.");
      }
      catch (SAXParseException e) { // well-formedness error
        System.out.println(args[i] + " is not well formed.");
        System.out.println(e.getMessage()
         + " at line " + e.getLineNumber() 
         + ", column " + e.getColumnNumber());
      }
      catch (SAXException e) { // some other kind of error
        System.out.println(e.getMessage());
      }
      catch (IOException e) {
        System.out.println("Could not check " + args[i] 
         + " because of the IOException " + e);
      }
      
    }  
  
  }

}

Core Properties

adapted from SAX2 documentation by David Megginson


Nonstandard Features in Xerces


Nonstandard Properties in Xerces

http://apache.org/xml/properties/schema/external-schemaLocation
http://apache.org/xml/properties/schema/external-noNamespaceSchemaLocation

Properties for Extension Handlers


Handling Attributes in SAX2


Attributes Example

import org.xml.sax.*;
import org.apache.xerces.parsers.*;
import java.io.*;
import java.util.*;
import org.xml.sax.helpers.*;


public class XLinkSpider extends DefaultHandler {

  public static Enumeration listURIs(String systemId) 
   throws SAXException, IOException {
    
    // set up the parser 
    XMLReader parser;
    try {
      parser = XMLReaderFactory.createXMLReader();
    } 
    catch (SAXException e) {
      try {
        parser = XMLReaderFactory.createXMLReader(
         "org.apache.xerces.parsers.SAXParser");
      }
      catch (SAXException e2) {
        System.err.println("Error: could not locate a parser.");
        return null;
      }
    }
      
    // Install the Content Handler   
    XLinkSpider spider = new XLinkSpider();   
    parser.setContentHandler(spider);
    parser.parse(systemId);
    return spider.uris.elements();
      
  }
  
  private Vector uris = new Vector();

  public void startElement(String namespaceURI, String localName, 
   String rawName, Attributes atts) throws SAXException {
    
     String uri = atts.getValue(
      "http://www.w3.org/1999/xlink", "href");
     if (uri != null) uris.addElement(uri);
    
  }
  

  public static void main(String[] args) {
    
    if (args.length == 0) {
      System.out.println("Usage: java XLinkSpider URL1 URL2..."); 
    } 
      
    // start parsing... 
    for (int i = 0; i < args.length; i++) {
      
      try {
        Enumeration uris = listURIs(args[i]);
        while (uris.hasMoreElements()) {
          String s = (String) uris.nextElement();
          System.out.println(s);
        }
      }
      catch (Exception e) {
        System.err.println(e);
        e.printStackTrace(); 
      }
      
    } // end for
  
  } // end main

} // end XLinkSpider

Resolving Entities


EntityResolver Example

import org.xml.sax.*;

public class RSSResolver implements EntityResolver {

  public InputSource resolveEntity(String publicID, String systemID) {

    if ( publicID.equals(
          "-//Netscape Communications//DTD RSS 0.91//EN")
     || systemID.equals(
          "http://my.netscape.com/publish/formats/rss-0.91.dtd")) {
      return new InputSource(
       "http://www.cafeconleche.org/dtds/rss.dtd");
    } 
    else {
      // use the default behaviour
      return null;
    }
    
  }
   
}
 

The NamespaceSupport class


Filtering XML


XMLFilter Example

import org.xml.sax.*;
import org.xml.sax.helpers.*;
import java.util.*;
import java.io.IOException;


public class UnparsedTextFilter extends XMLFilterImpl {

  private TextEntityReplacer replacer;

  public UnparsedTextFilter(XMLReader parent) {
    super(parent);
  }

  public void parse(InputSource input) 
   throws IOException, SAXException {
                 System.out.println("parsing");

    replacer = new TextEntityReplacer(input.getPublicId(), 
     input.getSystemId());
    this.setDTDHandler(replacer); 
    this.setContentHandler(this); 
  }
  // The other parse() method just calls this one 

  public void parse(String systemId) 
   throws IOException, SAXException {
    parse(new InputSource(systemId)); 
  }

  public void startElement(String uri, String localName, 
   String qualifiedName, Attributes attributes) throws SAXException {
              System.out.println("startElement");

    Vector extraText = new Vector();

    // Are there any unparsed entities in the attributes?
    for (int i = 0; i < attributes.getLength(); i++) {
      if (attributes.getType(i).equals("ENTITY")) {
        try {
          System.out.println("replacing");
          String s = replacer.getText(attributes.getValue(i));
          if (s != null) extraText.addElement(s);
        }
        catch (IOException e) {
          System.err.println(e); 
        }
      } 
      
    }    

    super.startElement(uri, localName, qualifiedName, attributes);
    
    // Now spew out the values of the unparsed entities:
    Enumeration e = extraText.elements();
    while (e.hasMoreElements()) {
      Object o = e.nextElement();
      String s = (String) o;
      super.characters(s.toCharArray(), 0, s.length()); 
    }
    
  }

}

TextMerger

import org.xml.sax.*;
import org.xml.sax.helpers.*;
import java.util.*;
import java.io.IOException;
import org.apache.xml.serialize.*;


public class TextMerger {

  public static void main(String[] args) {
  
    XMLReader base;
    try {
     base = XMLReaderFactory.createXMLReader(
      "org.apache.xerces.parsers.SAXParser");
    }
    catch (Exception e) {
      // fall back on default parser
      try {
        base = XMLReaderFactory.createXMLReader();
      }
      catch (Exception ee) {
        System.err.println("Couldn't locate a SAX parser");
        return;          
      }
    }
    
    XMLReader parser = new UnparsedTextFilter(base);
    
    //essentially a pretty printer
    XMLSerializer printer 
     = new XMLSerializer(System.out, new OutputFormat());
    
    base.setContentHandler(printer);
    
    for (int i = 0; i < args.length; i++) {
      try {
        System.out.println("Parsing " + args[i]);
        parser.parse(args[i]);
      }
      catch (SAXParseException e) { // well-formedness error
        System.out.println(args[i] + " is not well formed.");
        System.out.println(e.getMessage()
         + " at line " + e.getLineNumber() 
         + ", column " + e.getColumnNumber());
      }
      catch (SAXException e) { // some other kind of error
        System.out.println(e.getMessage());
      }
      catch (IOException e) {
        System.out.println("Could not report on " + args[i] 
         + " because of the IOException " + e);
      }      
    } // end for
    System.out.flush();
  
  }

}

InputSource


The InputSource interface

package org.xml.sax;

import java.io.*;

public class InputSource {

  public InputSource() 
  public InputSource(String systemID) 
  public InputSource(InputStream in)
  public InputSource(Reader in)

  public void   setPublicId(String publicID)
  public String getPublicId()
  public void   setSystemId(String systemID)
  public String getSystemId()

  public void        setByteStream(InputStream byteStream)
  public InputStream getByteStream()
  public void        setEncoding(String encoding)
  public String      getEncoding()
  public void        setCharacterStream(Reader characterStream)
  public Reader      getCharacterStream()

}

Example of InputSource

import org.xml.sax;
import java.io.*;
import java.net.*;
import java.util.zip.*;
...
try {

  URL u = new URL(
   "http://www.cafeconleche.org/examples/1998validstats.xml.gz"); 
  InputStream raw = u.openStream();
  InputStream decompressed = new GZIPInputStream(raw);
  InputSource in = new InputSource(decompressed);
  // read the document... 

}
catch (IOException e) {
  System.err.println(e);
}
catch (SAXException e) {
  System.err.println(e);
}

What SAX2 doesn't do


Event Based API Caveats


To Learn More



Part IV: DOM, The Document Object Model

The DOM (like XML) is not a triumph of elegance; it's a triumph of "if we do not hang together, we shall hang separately." At least the Browser Wars were not followed by API Wars. Better a common API that we all love to hate than a bazillion contending APIs that carve the Web up into contending enclaves of True Believers.

--Mike Champion on the xml-dev mailing list, Thursday, September 27, 2001


Where we're going


Trees


Document Object Model


DOM Evolution


DOM Implementations for Java


Eight Modules:


DOM Trees


org.w3c.dom


The DOM Process

  1. Library specific code creates a parser

  2. The parser parses the document and returns a DOM org.w3c.dom.Document object.

  3. The entire document is stored in memory.

  4. DOM methods and interfaces are used to extract data from this object


Parsing documents with a DOM Parser Example

import org.apache.xerces.parsers.*;
import org.w3c.dom.*;
import org.xml.sax.*;
import java.io.*;

public class DOMParserMaker {

  public static void main(String[] args) {
     
    // This is simpler but less flexible than the SAX approach.
    // Perhaps a good creational design pattern is needed here?   
  
    DOMParser parser = new DOMParser();
    
    for (int i = 0; i < args.length; i++) {
      try {
        // Read the entire document into memory
        parser.parse(args[i]); 
       
        Document d = parser.getDocument();
        // work with the document...
      }
      catch (SAXException e) {
        System.err.println(e); 
      }
      catch (IOException e) {
        System.err.println(e); 
      }
      
    }
   
  }

}

The JAXP Process

  1. javax.xml.parsers.DocumentBuilderFactory.newInstance() creates a DocumentBuilderFactory

  2. Configure the factory

  3. The factory's newBuilder() method creates a DocumentBuilder

  4. Configure the builder

  5. The builder parses the document and returns a DOM org.w3c.dom.Document object.

  6. The entire document is stored in memory.

  7. DOM methods and interfaces are used to extract data from this object


Parsing documents with a JAXP DocumentBuilder

import javax.xml.parsers.*;
import org.w3c.dom.*;
import org.xml.sax.*;
import java.io.*;

public class JAXPParserMaker {

  public static void main(String[] args) {
     
    try {       
      DocumentBuilderFactory builderFactory 
       = DocumentBuilderFactory.newInstance();
      builderFactory.setNamespaceAware(true);
      DocumentBuilder parser 
       = builderFactory.newDocumentBuilder();
    
      for (int i = 0; i < args.length; i++) {
        try {
          // Read the entire document into memory
          Document d = parser.parse(args[i]); 
          // work with the document...
        }
        catch (SAXException e) {
        System.err.println(e); 
        }
        catch (IOException e) {
          System.err.println(e); 
        }
      
      } // end for
      
    }
    catch (ParserConfigurationException e) {
      System.err.println("You need to install a JAXP aware parser.");
    }
   
  }

}

The Node Interface

package org.w3c.dom;

public interface Node {

  // NodeType
  public static final short ELEMENT_NODE                = 1;
  public static final short ATTRIBUTE_NODE              = 2;
  public static final short TEXT_NODE                   = 3;
  public static final short CDATA_SECTION_NODE          = 4;
  public static final short ENTITY_REFERENCE_NODE       = 5;
  public static final short ENTITY_NODE                 = 6;
  public static final short PROCESSING_INSTRUCTION_NODE = 7;
  public static final short COMMENT_NODE                = 8;
  public static final short DOCUMENT_NODE               = 9;
  public static final short DOCUMENT_TYPE_NODE          = 10;
  public static final short DOCUMENT_FRAGMENT_NODE      = 11;
  public static final short NOTATION_NODE               = 12;

  public String       getNodeName();
  public String       getNodeValue() throws DOMException;
  public void         setNodeValue(String nodeValue) throws DOMException;
  public short        getNodeType();
  public Node         getParentNode();
  public NodeList     getChildNodes();
  public Node         getFirstChild();
  public Node         getLastChild();
  public Node         getPreviousSibling();
  public Node         getNextSibling();
  public NamedNodeMap getAttributes();
  public Document     getOwnerDocument();
  public Node         insertBefore(Node newChild, Node refChild) throws DOMException;
  public Node         replaceChild(Node newChild, Node oldChild) throws DOMException;
  public Node         removeChild(Node oldChild) throws DOMException;
  public Node         appendChild(Node newChild) throws DOMException;
  public boolean      hasChildNodes();
  public Node         cloneNode(boolean deep);
  public void         normalize();
  public boolean      supports(String feature, String version);
  public String       getNamespaceURI();
  public String       getPrefix();
  public void         setPrefix(String prefix) throws DOMException;
  public String       getLocalName();
  
}

The NodeList Interface

package org.w3c.dom;

public interface NodeList {
  public Node item(int index);
  public int  getLength();
}

Now we're really ready to read a document


Node Reporter

import javax.xml.parsers.*;
import org.w3c.dom.*;
import org.xml.sax.*;
import java.io.*;


public class NodeReporter {

  public static void main(String[] args) {
     
    try {       
      DocumentBuilderFactory builderFactory 
       = DocumentBuilderFactory.newInstance();
      DocumentBuilder parser 
       = builderFactory.newDocumentBuilder();
      NodeReporter iterator = new NodeReporter();
        
      for (int i = 0; i < args.length; i++) {
        try {
          // Read the entire document into memory
          Document doc = parser.parse(args[i]); 
          iterator.followNode(doc);
        }
        catch (SAXException ex) {
          System.err.println(args[i] + " is not well-formed."); 
        }
        catch (IOException ex) {
          System.err.println(ex); 
        }
      }
    }
    catch (ParserConfigurationException ex) {
      System.err.println("You need to install a JAXP aware parser.");
    }
  
  } // end main

  // note use of recursion
  public void followNode(Node node) {
    
    processNode(node);
    if (node.hasChildNodes()) {
      NodeList children = node.getChildNodes();
      for (int i = 0; i < children.getLength(); i++) {
        followNode(children.item(i));
      } 
    }
    
  }

  public void processNode(Node node) {
    
    String name = node.getNodeName();
    String type = getTypeName(node.getNodeType());
    System.out.println("Type " + type + ": " + name);
    
  }
  
  public static String getTypeName(int type) {
    
    switch (type) {
      case Node.ELEMENT_NODE: 
        return "Element";
      case Node.ATTRIBUTE_NODE: 
        return "Attribute";
      case Node.TEXT_NODE: 
        return "Text";
      case Node.CDATA_SECTION_NODE: 
        return "CDATA Section";
      case Node.ENTITY_REFERENCE_NODE: 
        return "Entity Reference";
      case Node.ENTITY_NODE: 
        return "Entity";
      case Node.PROCESSING_INSTRUCTION_NODE: 
        return "Processing Instruction";
      case Node.COMMENT_NODE : 
        return "Comment";
      case Node.DOCUMENT_NODE: 
        return "Document";
      case Node.DOCUMENT_TYPE_NODE: 
        return "Document Type Declaration";
      case Node.DOCUMENT_FRAGMENT_NODE: 
        return "Document Fragment";
      case Node.NOTATION_NODE: 
        return "Notation";
      default: 
        return "Unknown Type"; 
    }
    
  }

}

Node Reporter Output

% java NodeReporter hotcop.xml
Type Document: #document
Type Processing Instruction: xml-stylesheet
Type Document Type Declaration: SONG
Type Element: SONG
Type Text: #text
Type Element: TITLE
Type Text: #text
Type Text: #text
Type Element: PHOTO
Type Text: #text
Type Element: COMPOSER
Type Text: #text
Type Text: #text
Type Element: COMPOSER
Type Text: #text
Type Text: #text
Type Element: COMPOSER
Type Text: #text
Type Text: #text
Type Element: PRODUCER
Type Text: #text
Type Text: #text
Type Comment: #comment
Type Text: #text
Type Element: PUBLISHER
Type Text: #text
Type Text: #text
Type Element: LENGTH
Type Text: #text
Type Text: #text
Type Element: YEAR
Type Text: #text
Type Text: #text
Type Element: ARTIST
Type Text: #text
Type Text: #text
Type Comment: #comment

Attributes are missing from this output. They are not nodes. They are properties of nodes.


Node Values as returned by getNodeValue()

Node TypeNode Value
element nodenull
attribute nodeattribute value
text nodetext of the node
CDATA section nodetext of the section
entity reference nodenull
entity nodenull
processing instruction nodecontent of the processing instruction, not including the target
comment nodetext of the comment
document nodenull
document type declaration nodenull
document fragment nodenull
notation nodenull

The Document Node


The Document Interface

package org.w3c.dom;

  public interface Document extends Node {
  
    public DocumentType      getDoctype();
    public DOMImplementation getImplementation();
    public Element           getDocumentElement();

    public NodeList        getElementsByTagName(String tagname);
    public NodeList        getElementsByTagNameNS(String namespaceURI, String localName);
    public Element         getElementById(String elementId);

    // Factory methods    
    public Element           createElement(String tagName) throws DOMException;
    public Element           createElementNS(String namespaceURI, String qualifiedName) throws DOMException;
    public DocumentFragment  createDocumentFragment();
    public Text              createTextNode(String data);
    public Comment           createComment(String data);
    public CDATASection      createCDATASection(String data) throws DOMException;
    public ProcessingInstruction createProcessingInstruction(String target, String data)
     throws DOMException;
    public Attr            createAttribute(String name) throws DOMException;
    public Attr            createAttributeNS(String namespaceURI, String qualifiedName) throws DOMException;
    public EntityReference createEntityReference(String name) throws DOMException;

    public Node            importNode(Node importedNode, boolean deep) throws DOMException;
    
}

A Sample Application

Full list

DOM Design


Weblogs with DOM

import org.w3c.dom.*;
import org.xml.sax.SAXException;
import java.io.IOException;
import java.util.*;
import java.net.*;


public class WeblogsDOM {

  public static String DEFAULT_URL
   = "http://static.userland.com/weblogMonitor/logs.xml";

  public static List listChannels() throws DOMException {
    return listChannels(DEFAULT_URL);
  }

  public static List listChannels(String uri) throws DOMException {

    if (uri == null) {
      throw new NullPointerException("URL must be non-null");
    }

    org.apache.xerces.parsers.DOMParser parser
     = new org.apache.xerces.parsers.DOMParser();

    Vector urls = null;

    try {
      // Read the entire document into memory
      parser.parse(uri);
      Document doc = parser.getDocument();
      NodeList logs = doc.getElementsByTagName("url");

      urls = new Vector(logs.getLength());

      for (int i = 0; i < logs.getLength(); i++) {
        try {
          Node element = logs.item(i);
          Node text = element.getFirstChild();
          String content = text.getNodeValue();
          URL u = new URL(content);
          urls.addElement(u);
        }
        catch (MalformedURLException e) {
          // bad input data from one third party; just ignore it
        }
      }
    }
    catch (SAXException e) {
      System.err.println(e);
    }
    catch (IOException e) {
      System.err.println(e);
    }

    return urls;

  }

  public static void main(String[] args) {

    try {
      List urls;
      if (args.length > 0) {
        try {
          URL url = new URL(args[0]);
          urls = listChannels(args[0]);
        }
        catch (MalformedURLException e) {
          System.err.println("Usage: java WeblogsDOM url");
          return;
        }
      }
      else {
        urls = listChannels();
      }
      Iterator iterator = urls.iterator();
      while (iterator.hasNext()) {
        System.out.println(iterator.next());
      }
    }
    catch (/* Unexpected */ Exception e) {
      e.printStackTrace();
    }

  } // end main

}

Weblogs Output

% java WeblogsDOM
http://2020Hindsight.editthispage.com/
http://www.sff.net/people/mitchw/weblog/weblog.htp
http://nate.weblogs.com/
http://plugins.launchpoint.net
http://404.psistorm.net
http://home.att.net/~geek9000
http://daubnet.tzo.com/weblog
several hundred more...

Element Nodes


The Element Interface

package org.w3c.dom;

public interface Element extends Node {

  public String   getTagName();

  public NodeList getElementsByTagName(String name);
  public NodeList getElementsByTagNameNS(String namespaceURI, 
   String localName);

  public String   getAttribute(String name);
  public String   getAttributeNS(String namespaceURI, 
   String localName);
  public void     setAttribute(String name, String value) 
   throws DOMException;
  public void     setAttributeNS(String namespaceURI, 
   String qualifiedName, String value) throws DOMException;
  public void     removeAttribute(String name) throws DOMException;
  public void     removeAttributeNS(String namespaceURI, 
   String localName) throws DOMException;
  public Attr     getAttributeNode(String name);
  public Attr     getAttributeNodeNS(String namespaceURI, String localName);
  public Attr     setAttributeNode(Attr newAttr) throws DOMException;
  public Attr     setAttributeNodeNS(Attr newAttr) throws DOMException;
  public Attr     removeAttributeNode(Attr oldAttr) throws DOMException;

}

IDTagger

import org.apache.xerces.parsers.DOMParser;
import org.w3c.dom.*;
import org.xml.sax.*;
import java.io.IOException;
import org.apache.xml.serialize.*;


public class IDTagger {

  int id = 1;

  public void processNode(Node node) {
    
    if (node.getNodeType() == Node.ELEMENT_NODE) {
      
      Element element = (Element) node;
      String currentID = element.getAttribute("ID");
      if (currentID == null || currentID.equals("")) {
        element.setAttribute("ID", "_" + id);
        id = id + 1; 
      }
    }
    
  }

  // note use of recursion
  public void followNode(Node node) {
    
    processNode(node);
    if (node.hasChildNodes()) {
      NodeList children = node.getChildNodes();
      for (int i = 0; i < children.getLength(); i++) {
        followNode(children.item(i));
      } 
    }
    
  }

  public static void main(String[] args) {
     
    DOMParser parser  = new DOMParser();
    IDTagger iterator = new IDTagger();
    
    for (int i = 0; i < args.length; i++) {
      try {
        // Read the entire document into memory
        parser.parse(args[i]); 
       
        Document document = parser.getDocument();
        iterator.followNode(document);
        
        // now we serialize the document...
        OutputFormat format = new OutputFormat(document);
        XMLSerializer serializer 
         = new XMLSerializer(System.out, format);
        serializer.serialize(document);       
        
      }
      catch (SAXException e) {
        System.err.println(e); 
      }
      catch (IOException e) {
        System.err.println(e); 
      }
      
    }
  
  } // end main

}

Output from IDTagger

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE SONG SYSTEM "song.dtd">
<?xml-stylesheet type="text/css" href="song.css"?><!-- This should be a four digit year like "1999",
     not a two-digit year like "99" --><SONG xmlns="http://www.cafeconleche.org/namespace/song" ID="_1" xmlns:xlink="http://www.w3.org/1999/xlink">   <TITLE ID="_2">Hot Cop</TITLE>   <PHOTO ALT="Victor Willis in Cop Outfit" HEIGHT="200" ID="_3" WIDTH="100" xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="hotcop.jpg" xlink:show="onLoad" xlink:type="simple"/>   <COMPOSER ID="_4">Jacques Morali</COMPOSER>   <COMPOSER ID="_5">Henri Belolo</COMPOSER>   <COMPOSER ID="_6">Victor Willis</COMPOSER>   <PRODUCER ID="_7">Jacques Morali</PRODUCER>   <!-- The publisher is actually Polygram but I needed 
       an example of a general entity reference. -->   <PUBLISHER ID="_8" xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="http://www.amrecords.com/" xlink:type="simple">     A &amp; M Records   </PUBLISHER>   <LENGTH ID="_9">6:20</LENGTH>   <YEAR ID="_10">1978</YEAR>   <ARTIST ID="_11">Village People</ARTIST> </SONG><!-- You can tell what album I was 
     listening to when I wrote this example -->
View Output in Browser

CharacterData interface


The CharacterData Interface

package org.w3c.dom;

public interface CharacterData extends Node {

  public String getData() throws DOMException;
  public void   setData(String data) throws DOMException;
  public int    getLength();
  public String substringData(int offset, int count) 
   throws DOMException;
  public void   appendData(String arg) 
   throws DOMException;
  public void   insertData(int offset, String arg) 
   throws DOMException;
  public void   deleteData(int offset, int count) 
   throws DOMException;
  public void   replaceData(int offset, int count, String arg) 
   throws DOMException;
  
}

ROT13 XML Text

import org.apache.xerces.parsers.DOMParser;
import org.apache.xml.serialize.*;
import org.w3c.dom.*;
import org.xml.sax.SAXException;
import java.io.IOException;


public class ROT13XML {

  public void processNode(Node node) {
    
    if (node.getNodeType() == Node.TEXT_NODE
     || node.getNodeType() == Node.COMMENT_NODE
     || node.getNodeType() == Node.CDATA_SECTION_NODE) {
      CharacterData text = (CharacterData) node;
      String data = text.getData();
      text.setData(rot13(data));
    }
    
  }

  // note use of recursion
  public void followNode(Node node) {
    
    processNode(node);
    if (node.hasChildNodes()) {
      NodeList children = node.getChildNodes();
      for (int i = 0; i < children.getLength(); i++) {
        followNode(children.item(i));
      } 
    }
    
  }
  
  public static String rot13(String s) {
    
    StringBuffer result = new StringBuffer(s.length());
    for (int i = 0; i < s.length(); i++) {
      int c = s.charAt(i);
      if (c >= 'A' && c <= 'M') result.append((char) (c+13));
      else if (c >= 'N' && c <= 'Z') result.append((char) (c-13));
      else if (c >= 'a' && c <= 'm') result.append((char) (c+13));
      else if (c >= 'n' && c <= 'z') result.append((char) (c-13));
      else result.append((char) c);
      
    } 
    return result.toString();
    
  }

  public static void main(String[] args) {
     
    DOMParser parser   = new DOMParser();
    ROT13XML  iterator = new ROT13XML();
    
    for (int i = 0; i < args.length; i++) {
      try {
        // Read the entire document into memory
        parser.parse(args[i]); 
       
        Document document = parser.getDocument();
        iterator.followNode(document);
        
        // now we serialize the document...
        OutputFormat format = new OutputFormat(document);
        XMLSerializer serializer 
         = new XMLSerializer(System.out, format);
        serializer.serialize(document);
               
      }
      catch (SAXException e) {
        System.err.println(e); 
      }
      catch (IOException e) {
        System.err.println(e); 
      }
      
    }
  
  } // end main

}

ROT13 XML Output

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE SONG SYSTEM "song.dtd">
<?xml-stylesheet type="text/css" href="song.css"?>
<SONG xmlns="http://metalab.unc.edu/xml/namespace/song"
xmlns:xlink="http://www.w3.org/1999/xlink">   <TITLE>Ubg Pbc</TITLE>
<PHOTO ALT="Victor Willis in Cop Outfit" HEIGHT="200" WIDTH="100"
xlink:href="hotcop.jpg" xlink:show="onLoad" xlink:type="simple"/>
<COMPOSER>Wnpdhrf Zbenyv</COMPOSER>   <COMPOSER>Uraev Orybyb</COMPOSER>
<COMPOSER>Ivpgbe Jvyyvf</COMPOSER>   <PRODUCER>Wnpdhrf Zbenyv</PRODUCER>
<!-- Gur choyvfure vf npghnyyl Cbyltenz ohg V arrqrq         na rknzcyr
bs n trareny ragvgl ersrerapr. -->   <PUBLISHER
xlink:href="http://www.amrecords.com/" xlink:type="simple">     N &amp;
Z Erpbeqf   </PUBLISHER>   <LENGTH>6:20</LENGTH>   <YEAR>1978</YEAR>
<ARTIST>Ivyyntr Crbcyr</ARTIST> </SONG>
<!-- Lbh pna gryy jung nyohz V jnf 
     yvfgravat gb jura V jebgr guvf rknzcyr -->

Text Nodes


The Text Interface

package org.w3c.dom;

public interface Text extends CharacterData {

  public Text splitText(int offset) throws DOMException;
  
}

CDATA section Nodes


The CDATASection Interface

package org.w3c.dom;

public interface CDATASection extends Text {
}

DocumentType Nodes


The DocumentType Interface

package org.w3c.dom;

public interface DocumentType extends Node {

  public String       getName();
  public NamedNodeMap getEntities();
  public NamedNodeMap getNotations();
  public String       getPublicId();
  public String       getSystemId();
  public String       getInternalSubset();
  
}

Example of the DocumentType Interface


XHTMLValidator

import org.w3c.dom.*;
import javax.xml.parsers.*;
import java.io.*;
import org.xml.sax.*;


public class XHTMLValidator {

  public static void main(String[] args) {
    
    if (args.length == 0) {
       System.err.println("Usage: java XHTMLValidator URL");
       return;   
    }

    try {
      DocumentBuilderFactory builderFactory 
       = DocumentBuilderFactory.newInstance();
      builderFactory.setNamespaceAware(true);
      builderFactory.setValidating(true);
      DocumentBuilder parser 
       = builderFactory.newDocumentBuilder();
      parser.setErrorHandler(new ValidityErrorReporter());

      Document document;
      try {
        document = parser.parse(args[0]); 
        // ValidityErrorReporter prints any validity errors detected
      }
      catch (SAXException e) {  
        System.out.println(args[0] + " is not valid."); 
        return; 
      }
      
      // If we get this far, then the document is valid XML.
      // Check to see whether the document is actually XHTML    
      DocumentType doctype = document.getDoctype();
  
      if (doctype == null) {
        System.out.println("No DOCTYPE"); 
        return;
      }
  
      String name     = doctype.getName();
      String systemID = doctype.getSystemId();
      String publicID = doctype.getPublicId();
    
      if (!name.equals("html")) {
        System.out.println("Incorrect root element name " + name); 
      }
  
      if (publicID == null
       || (!publicID.equals("-//W3C//DTD XHTML 1.0 Strict//EN")
           && !publicID.equals(
                "-//W3C//DTD XHTML 1.0 Transitional//EN")
           && !publicID.equals(
                "-//W3C//DTD XHTML 1.0 Frameset//EN"))) {
        System.out.println(args[0] 
         + " does not seem to use an XHTML 1.0 DTD");
      }
  
      // Check the namespace on the root element
      Element root = document.getDocumentElement();
      String xmlnsValue = root.getAttribute("xmlns");
      if (!xmlnsValue.equals("http://www.w3.org/1999/xhtml")) {
        System.out.println(args[0] 
         + " does not properly declare the"
         + " http://www.w3.org/1999/xhtml"
         + " namespace on the root element");  
        return;      
      }
      
      System.out.println(args[0] + " is valid XHTML.");
      
    }
    catch (IOException e) {
      System.err.println("Could not read " + args[0]);
    }
    catch (Exception e) {
      System.err.println(e);
      e.printStackTrace();
    }
    
  }

}

Attr Nodes


The Attr Interface

package org.w3c.dom;

public interface Attr extends Node {

  public String   getName();
  public boolean  getSpecified();
  public String   getValue();
  public void     setValue(String value) throws DOMException;
  public Element  getOwnerElement();
  
}

XLinkSpider with DOM

import org.xml.sax.*;
import java.io.*;
import java.util.*;
import org.w3c.dom.*;
import javax.xml.parsers.*;


public class DOMSpider {

  private static DocumentBuilder parser;
  
  // namespace support is turned off by default in JAXP
  static {
    try {
      DocumentBuilderFactory builderFactory 
       = DocumentBuilderFactory.newInstance();
      builderFactory.setNamespaceAware(true);
      parser = builderFactory.newDocumentBuilder();
    }
    catch (Exception ex) {
      throw new RuntimeException("Couldn't build a parser!");
    }
  }
  
  private static Vector visited = new Vector();
  
  private static int maxDepth = 5;
  private static int currentDepth = 0; 
  
  public static void listURIs(String systemId) {
    
    currentDepth++;
    try {
      if (currentDepth < maxDepth) {
        Document document = parser.parse(systemId);
    
        Vector uris = new Vector();
        // search the document for uris, 
        // store them in vector, and print them
        searchForURIs(document.getDocumentElement(), uris); 
    
        Enumeration e = uris.elements();
        while (e.hasMoreElements()) {
          String uri = (String) e.nextElement();
          visited.addElement(uri);
          listURIs(uri); 
        }
      
      }
    
    }
    catch (SAXException e) {
      // couldn't load the document, 
      // probably not well-formed XML, skip it 
    }
    catch (IOException e) {
      // couldn't load the document, 
      // likely network failure, skip it 
    }
    finally { 
      currentDepth--;
      System.out.flush();     
    }
      
  }
  
  // use recursion 
  public static void searchForURIs(Element element, Vector uris) {
    
    // look for XLinks in this element
    String uri = element.getAttributeNS("http://www.w3.org/1999/xlink", "href");

    if (uri != null && !uri.equals("") 
         && !visited.contains(uri) 
         && !uris.contains(uri)) {
      System.out.println(uri);
      uris.addElement(uri);
    }
    
    // process child elements recursively
    NodeList children = element.getChildNodes();
    for (int i = 0; i < children.getLength(); i++) {
      Node n = children.item(i);
      if (n instanceof Element) {
        searchForURIs((Element) n, uris);
      } 
    }
    
  }
  

  public static void main(String[] args) {
    
    if (args.length == 0) {
      System.out.println("Usage: java DOMSpider URL1 URL2..."); 
    } 
      
    // start parsing... 
    for (int i = 0; i < args.length; i++) {
      
      try {
        listURIs(args[i]);
      }
      catch (Exception e) {
        System.err.println(e);
        e.printStackTrace(); 
      }
      
    } // end for
  
  } // end main

} // end DOMSpider

ProcessingInstruction Nodes


The ProcessingInstruction Interface

package org.w3c.dom;

public interface ProcessingInstruction extends Node {

  public String getTarget();
  public String getData();
  public void   setData(String data) throws DOMException;
  
}

XLinkSpider that Respects robots processing instruction

import org.xml.sax.*;
import java.io.*;
import java.util.*;
import org.w3c.dom.*;
import javax.xml.parsers.*;


public class PoliteDOMSpider {

  private static DocumentBuilder parser;
  
  // namespace support is turned off by default in JAXP
  static {
    try {
      DocumentBuilderFactory builderFactory 
       = DocumentBuilderFactory.newInstance();
      builderFactory.setNamespaceAware(true);
      parser = builderFactory.newDocumentBuilder();
    }
    catch (Exception ex) {
      throw new RuntimeException("Couldn't build a parser!");
    }
  }
  
  private static Vector visited = new Vector();
  
  private static int maxDepth = 5;
  private static int currentDepth = 0; 

  public static boolean robotsAllowed(Document document) {
    
    NodeList children = document.getChildNodes();
    for (int i = 0; i < children.getLength(); i++) {
      Node n = children.item(i);
      if (n instanceof ProcessingInstruction) {
        ProcessingInstruction pi = (ProcessingInstruction) n;
        if (pi.getTarget().equals("robots")) {
          String data = pi.getData();
          if (data.indexOf("follow=\"no\"") >= 0) {
            return false; 
          } 
        }
      }
    }
    
    return true;
    
  }
  
  public static void listURIs(String systemId) {
    
    currentDepth++;
    try {
      if (currentDepth < maxDepth) {
        Document document = parser.parse(systemId);
    
        if (robotsAllowed(document)) {
          Vector uris = new Vector();
          // search the document for uris,
          // store them in vector, print them
          searchForURIs(document.getDocumentElement(), uris);
    
          Enumeration e = uris.elements();
          while (e.hasMoreElements()) {
            String uri = (String) e.nextElement();
            visited.addElement(uri);
            listURIs(uri); 
          }
          
        }
      
      }
    
    }
    catch (SAXException e) {
      // couldn't load the document, 
      // probably not well-formed XML, skip it 
    }
    catch (IOException e) {
      // couldn't load the document, 
      // likely network failure, skip it 
    }
    finally { 
      currentDepth--;
      System.out.flush();     
    }
      
  }
  
  // use recursion 
  public static void searchForURIs(Element element, Vector uris) {
    
    // look for XLinks in this element
    String uri = element.getAttributeNS("http://www.w3.org/1999/xlink", "href");

    if (uri != null && !uri.equals("") 
         && !visited.contains(uri) 
         && !uris.contains(uri)) {
      System.out.println(uri);
      uris.addElement(uri);
    }
    
    // process child elements recursively
    NodeList children = element.getChildNodes();
    for (int i = 0; i < children.getLength(); i++) {
      Node n = children.item(i);
      if (n instanceof Element) {
        searchForURIs((Element) n, uris);
      } 
    }
    
  }
  

  public static void main(String[] args) {
    
    if (args.length == 0) {
      System.out.println("Usage: java PoliteDOMSpider URL1 URL2..."); 
    } 
      
    // start parsing... 
    for (int i = 0; i < args.length; i++) {
      
      try {
        listURIs(args[i]);
      }
      catch (Exception e) {
        System.err.println(e);
        e.printStackTrace(); 
      }
      
    } // end for
  
  } // end main

} // end PoliteDOMSpider

Comment Nodes


The Comment Interface

package org.w3c.dom;

public interface Comment extends CharacterData {
}

Comment Example

import org.apache.xerces.parsers.*;
import org.w3c.dom.*;
import org.xml.sax.*;
import java.io.*;


public class DOMCommentReader {

  public static void main(String[] args) {
     
    DOMParser parser = new DOMParser();
    
    for (int i = 0; i < args.length; i++) {
      try {
        // Read the entire document into memory
        parser.parse(args[i]); 
       
        Document d = parser.getDocument();
        processNode(d);
      }
      catch (SAXException e) {
        System.err.println(e); 
      }
      catch (IOException e) {
        System.err.println(e); 
      }
      
    }
  
  } // end main

  // note use of recursion
  public static void processNode(Node node) {
    
    int type = node.getNodeType();
    if (type == Node.COMMENT_NODE) {
      System.out.println(node.getNodeValue());
      System.out.println();
    }
    else {
      if (node.hasChildNodes()) {
        NodeList children = node.getChildNodes();
        for (int i = 0; i < children.getLength(); i++) {
          processNode(children.item(i));
        } 
      }
    }
    
  }

}

DOMCommentReader Output

% java DOMCommentReader hotcop.xml
 The publisher is actually Polygram but I needed
       an example of a general entity reference.

 You can tell what album I was
     listening to when I wrote this example

Or try http://www.w3.org/TR/1998/REC-xml-19980210.xml for more interesting output


DOMException


The org.w3c.dom.traversal Package

Four interfaces:


NodeIterator

package org.w3c.dom.traversal;

public interface NodeIterator {

  public int        getWhatToShow();
  public NodeFilter getFilter();
  public boolean    getExpandEntityReferences();
  public Node       nextNode() throws DOMException;
  public Node       previousNode() throws DOMException;
  public void       detach();
    
}

ValueReporter

import org.apache.xerces.parsers.*;
import org.apache.xerces.dom.*;
import org.w3c.dom.*;
import org.w3c.dom.traversal.*;
import org.xml.sax.*;
import java.io.*;


public class ValueReporter {

  public static void main(String[] args) {
     
    DOMParser parser = new DOMParser();
    
    for (int i = 0; i < args.length; i++) {
      try {
        // Read the entire document into memory
        parser.parse(args[i]); 
       
        Document doc = parser.getDocument();
        DocumentImpl impl = (DocumentImpl) doc;
        NodeIterator iterator = impl.createNodeIterator(//filter
         doc.getDocumentElement(), NodeFilter.SHOW_ALL, null
        );
        Node node;
        while ((node = iterator.nextNode()) != null) {
          processNode(node);      
        }
      }
      catch (SAXException e) {
        System.err.println(e); 
      }
      catch (IOException e) {
        System.err.println(e); 
      }
      
    }
  
  } // end main

  public static void processNode(Node node) {
    
    String name = node.getNodeName();
    String type = getTypeName(node.getNodeType());
    String value = node.getNodeValue();
    System.out.println("Type " + type + ": " + name 
     + " \"" + value + "\"");
    
  }
  
  public static String getTypeName(int type) {
    
    switch (type) {
      case Node.ELEMENT_NODE: 
        return "Element";
      case Node.ATTRIBUTE_NODE: 
        return "Attribute";
      case Node.TEXT_NODE: 
        return "Text";
      case Node.CDATA_SECTION_NODE: 
        return "CDATA Section";
      case Node.ENTITY_REFERENCE_NODE: 
        return "Entity Reference";
      case Node.ENTITY_NODE: 
        return "Entity";
      case Node.PROCESSING_INSTRUCTION_NODE: 
        return "Processing Instruction";
      case Node.COMMENT_NODE: 
        return "Comment";
      case Node.DOCUMENT_NODE: 
        return "Document";
      case Node.DOCUMENT_TYPE_NODE: 
        return "Document Type Declaration";
      case Node.DOCUMENT_FRAGMENT_NODE: 
        return "Document Fragment";
      case Node.NOTATION_NODE: 
        return "Notation";
      default: 
        return "Unknown Type"; 
    }
    
  }

}

ValueReporter Output

% java ValueReporter hotcop.xml
Type Element: SONG "null"
Type Text: #text "
  "
Type Element: TITLE "null"
Type Text: #text "Hot Cop"
Type Text: #text "
  "
Type Element: PHOTO "null"
Type Text: #text "
  "
Type Element: COMPOSER "null"
Type Text: #text "Jacques Morali"
Type Text: #text "
  "
Type Element: COMPOSER "null"
Type Text: #text "Henri Belolo"
Type Text: #text "
  "
Type Element: COMPOSER "null"
Type Text: #text "Victor Willis"
Type Text: #text "
  "
Type Element: PRODUCER "null"
Type Text: #text "Jacques Morali"
Type Text: #text "
  "
Type Comment: #comment " The publisher is actually Polygram but I needed
       an example of a general entity reference. "
Type Text: #text "
  "
Type Element: PUBLISHER "null"
Type Text: #text "
    A & M Records
  "
Type Text: #text "
  "
Type Element: LENGTH "null"
Type Text: #text "6:20"
Type Text: #text "
  "
Type Element: YEAR "null"
Type Text: #text "1978"
Type Text: #text "
  "
Type Element: ARTIST "null"
Type Text: #text "Village People"
Type Text: #text "
"

Attributes are missing from this output. They are not children. They are properties of nodes.


NodeFilter

package org.w3c.dom.traversal;

public interface NodeFilter {

  // Constants returned by acceptNode
  public static final short FILTER_ACCEPT             = 1;
  public static final short FILTER_REJECT             = 2;
  public static final short FILTER_SKIP               = 3;

  // Constants for whatToShow
  public static final int   SHOW_ALL                  = 0x0000FFFF;
  public static final int   SHOW_ELEMENT              = 0x00000001;
  public static final int   SHOW_ATTRIBUTE            = 0x00000002;
  public static final int   SHOW_TEXT                 = 0x00000004;
  public static final int   SHOW_CDATA_SECTION        = 0x00000008;
  public static final int   SHOW_ENTITY_REFERENCE     = 0x00000010;
  public static final int   SHOW_ENTITY               = 0x00000020;
  public static final int   SHOW_PROCESSING_INSTRUCTION = 0x00000040;
  public static final int   SHOW_COMMENT              = 0x00000080;
  public static final int   SHOW_DOCUMENT             = 0x00000100;
  public static final int   SHOW_DOCUMENT_TYPE        = 0x00000200;
  public static final int   SHOW_DOCUMENT_FRAGMENT    = 0x00000400;
  public static final int   SHOW_NOTATION             = 0x00000800;

  public short        acceptNode(Node n);
    
}

DOM based TagStripper

import org.apache.xerces.parsers.*;
import org.apache.xerces.dom.*;
import org.w3c.dom.*;
import org.w3c.dom.traversal.*;
import org.xml.sax.SAXException;
import java.io.IOException;


public class DOMTagStripper {

  public static void main(String[] args) {
     
    DOMParser parser = new DOMParser();
    
    for (int i = 0; i < args.length; i++) {
      try {
        // Read the entire document into memory
        parser.parse(args[i]); 
       
        Document doc = parser.getDocument();
        DocumentImpl impl = (DocumentImpl) doc;
        NodeIterator iterator = impl.createNodeIterator(
         doc.getDocumentElement(), NodeFilter.SHOW_TEXT, null, true
        );
        Node node;
        while ((node = iterator.nextNode()) != null) {
          System.out.print(node.getNodeValue());      
        }
      }
      catch (SAXException e) {
        System.err.println(e); 
      }
      catch (IOException e) {
        System.err.println(e); 
      }
      
    }
  
  } // end main

}

Output from a DOM based TagStripper

% java DOMTagStripper hotcop.xml

  Hot Cop
  Jacques Morali
  Henri Belolo
  Victor Willis
  Jacques Morali

  A & M Records
  6:20
  1978
  Village People

TreeWalker

package org.w3c.dom.traversal;

public interface TreeWalker {

  public int        getWhatToShow();
  public NodeFilter getFilter();
  public boolean    getExpandEntityReferences();
  
  public Node getCurrentNode();
  public void setCurrentNode(Node currentNode) throws DOMException;
   
  public Node parentNode();
  public Node firstChild();
  public Node lastChild();
  public Node previousSibling();
  public Node nextSibling();
  public Node previousNode();
  public Node nextNode();
  
}


Writing XML Documents with DOM


The DOMImplementation interface

package org.w3c.dom;

public interface DOMImplementation {

  public boolean hasFeature(String feature, String version) 
  
  public DocumentType createDocumentType(String qualifiedName, 
   String publicID, String systemID, String internalSubset)
                                          
  public Document createDocument(String namespaceURI, 
   String qualifiedName, DocumentType doctype)
   throws DOMException

} 

org.apache.xerces.dom.DOMImplementationImpl

package org.apache.xerces.dom;

public class DOMImplementationImpl implements DOMImplementation {

  public boolean hasFeature(String feature, String version) 
  
  public static DOMImplementation getDOMImplementation()
  
  public DocumentType createDocumentType(String qualifiedName, 
   String publicID, String systemID, String internalSubset)
                                          
  public Document createDocument(String namespaceURI, 
   String qualifiedName, DocumentType doctype)
   throws DOMException

} 

A Xerces/DOM program that writes Fibonacci numbers into an XML document

import java.math.BigInteger;
import java.io.*;
import org.w3c.dom.*;
import org.apache.xerces.dom.*;


public class FibonacciDOM {

  public static void main(String[] args) {

    try {

      DOMImplementation impl 
       = DOMImplementationImpl.getDOMImplementation();

      Document fibonacci 
       = impl.createDocument(null, "Fibonacci_Numbers", null);

      BigInteger low  = BigInteger.ONE;
      BigInteger high = BigInteger.ONE;

      Element root = fibonacci.getDocumentElement();

      for (int i = 1; i <= 25; i++) {
        Element number = fibonacci.createElement("fibonacci");
        number.setAttribute("index", Integer.toString(i));
        Text text = fibonacci.createTextNode(low.toString());
        number.appendChild(text);
        root.appendChild(number);

        BigInteger temp = high;
        high = high.add(low);
        low = temp;
      }

      // Now the document has been created and exists in memory
    }
    catch (DOMException e) {
      e.printStackTrace();
    }

  }

}

A JAXP/DOM program that writes Fibonacci numbers into an XML document

import java.math.BigInteger;
import java.io.*;
import org.w3c.dom.*;
import javax.xml.parsers.*;


public class FibonacciJAXP {

  public static void main(String[] args) {

    try {       
      DocumentBuilderFactory factory 
       = DocumentBuilderFactory.newInstance();
      DocumentBuilder builder = factory.newDocumentBuilder();
      DOMImplementation impl = builder.getDOMImplementation();

      Document fibonacci 
       = impl.createDocument(null, "Fibonacci_Numbers", null);

      BigInteger low  = BigInteger.ONE;
      BigInteger high = BigInteger.ONE;

      Element root = fibonacci.getDocumentElement();

      for (int i = 1; i <= 25; i++) {
        Element number = fibonacci.createElement("fibonacci");
        number.setAttribute("index", Integer.toString(i));
        Text text = fibonacci.createTextNode(low.toString());
        number.appendChild(text);
        root.appendChild(number);

        BigInteger temp = high;
        high = high.add(low);
        low = temp;
      }

      // Now the document has been created and exists in memory
    }
    catch (DOMException e) {
      e.printStackTrace();
    }
    catch (ParserConfigurationException e) {
      System.err.println("You need to install a JAXP aware DOM implementation.");
    }
    
  }

}

Serialization


A DOM program that writes Fibonacci numbers onto System.out

import java.math.BigInteger;
import java.io.*;
import org.w3c.dom.*;
import org.apache.xerces.dom.*;
import org.apache.xml.serialize.*; 


public class FibonacciDOMSerializer {

  public static void main(String[] args) {
   
    try {
      
      DOMImplementation impl 
       = DOMImplementationImpl.getDOMImplementation();

      Document fibonacci 
       = impl.createDocument(null, "Fibonacci_Numbers", null);
      
      BigInteger low  = BigInteger.ONE;
      BigInteger high = BigInteger.ONE;      
      
      Element root = fibonacci.getDocumentElement(); 

      for (int i = 1; i <= 25; i++) {
        Element number = fibonacci.createElement("fibonacci");
        number.setAttribute("index", Integer.toString(i));
        Text text = fibonacci.createTextNode(low.toString());
        number.appendChild(text);
        root.appendChild(number);
        
        BigInteger temp = high;
        high = high.add(low);
        low = temp;
      }
      
      try {
        // Now that the document is created we need to *serialize* it
        OutputFormat format = new OutputFormat(fibonacci);
        XMLSerializer serializer 
         = new XMLSerializer(System.out, format);
        serializer.serialize(fibonacci);
      }
      catch (IOException e) {
        System.err.println(e); 
      }
    }
    catch (DOMException e) {
      e.printStackTrace();
    }

  }

}

fibonacci.xml

<?xml version="1.0" encoding="UTF-8"?>
<Fibonacci_Numbers><fibonacci index="0">0</fibonacci><fibonacci index="1">1</fibonacci><fibonacci index="2">1</fibonacci><fibonacci index="3">2</fibonacci><fibonacci index="4">3</fibonacci><fibonacci index="5">5</fibonacci><fibonacci index="6">8</fibonacci><fibonacci index="7">13</fibonacci><fibonacci index="8">21</fibonacci><fibonacci index="9">34</fibonacci><fibonacci index="10">55</fibonacci><fibonacci index="11">89</fibonacci><fibonacci index="12">144</fibonacci><fibonacci index="13">233</fibonacci><fibonacci index="14">377</fibonacci><fibonacci index="15">610</fibonacci><fibonacci index="16">987</fibonacci><fibonacci index="17">1597</fibonacci><fibonacci index="18">2584</fibonacci><fibonacci index="19">4181</fibonacci><fibonacci index="20">6765</fibonacci><fibonacci index="21">10946</fibonacci><fibonacci index="22">17711</fibonacci><fibonacci index="23">28657</fibonacci><fibonacci index="24">46368</fibonacci><fibonacci index="25">75025</fibonacci></Fibonacci_Numbers>

OutputFormat

package org.apache.xml.serialize;

public class OutputFormat extends Object {

  public OutputFormat()
  public OutputFormat(String method, 
   String encoding, boolean indenting)
  public OutputFormat(Document doc)
  public OutputFormat(Document doc, 
   String encoding, boolean indenting)
  
  public String   getMethod()
  public void     setMethod(String method)
  public String   getVersion()
  public void     setVersion(String version)
  public int      getIndent()
  public boolean  getIndenting()
  public void     setIndent(int indent)
  public void     setIndenting(boolean on)
  public String   getEncoding()
  public void     setEncoding(String encoding)
  public String   getMediaType()
  public void     setMediaType(String mediaType)
  public void     setDoctype(String publicID, String systemID)
  public String   getDoctypePublic()
  public String   getDoctypeSystem()
  public boolean  getOmitXMLDeclaration()
  public void     setOmitXMLDeclaration(boolean omit)
  public boolean  getStandalone()
  public void     setStandalone(boolean standalone)
  public String[] getCDataElements()
  public boolean  isCDataElement(String tagName)
  public void     setCDataElements(String[] cdataElements)
  public String[] getNonEscapingElements()
  public boolean  isNonEscapingElement(String tagName)
  public void     setNonEscapingElements(String[] nonEscapingElements)
  public String   getLineSeparator()
  public void     setLineSeparator(String lineSeparator)
  public boolean  getPreserveSpace()
  public void     setPreserveSpace(boolean preserve)
  public int      getLineWidth()
  public void     setLineWidth(int lineWidth)
  public char     getLastPrintable()
  
  public static String whichMethod(Document doc)
  public static String whichDoctypePublic(Document doc)
  public static String whichDoctypeSystem(Document doc)
  public static String whichMediaType(String method)
  
}

Better formatted output

 try {
  // Now that the document is created we need to *serialize* it
  OutputFormat format = new OutputFormat(fibonacci, "8859_1", true);
  format.setLineSeparator("\r\n");
  format.setLineWidth(72);
  format.setDoctype(null, "fibonacci.dtd");
  XMLSerializer serializer = new XMLSerializer(System.out, format);
  serializer.serialize(root);
}
catch (IOException e) {
  System.err.println(e); 
}

formatted_fibonacci.xml

<?xml version="1.0" encoding="8859_1"?>
<!DOCTYPE Fibonacci_Numbers SYSTEM "fibonacci.dtd">
<Fibonacci_Numbers>
    <fibonacci index="0">0</fibonacci>
    <fibonacci index="1">1</fibonacci>
    <fibonacci index="2">1</fibonacci>
    <fibonacci index="3">2</fibonacci>
    <fibonacci index="4">3</fibonacci>
    <fibonacci index="5">5</fibonacci>
    <fibonacci index="6">8</fibonacci>
    <fibonacci index="7">13</fibonacci>
    <fibonacci index="8">21</fibonacci>
    <fibonacci index="9">34</fibonacci>
    <fibonacci index="10">55</fibonacci>
    <fibonacci index="11">89</fibonacci>
    <fibonacci index="12">144</fibonacci>
    <fibonacci index="13">233</fibonacci>
    <fibonacci index="14">377</fibonacci>
    <fibonacci index="15">610</fibonacci>
    <fibonacci index="16">987</fibonacci>
    <fibonacci index="17">1597</fibonacci>
    <fibonacci index="18">2584</fibonacci>
    <fibonacci index="19">4181</fibonacci>
    <fibonacci index="20">6765</fibonacci>
    <fibonacci index="21">10946</fibonacci>
    <fibonacci index="22">17711</fibonacci>
    <fibonacci index="23">28657</fibonacci>
    <fibonacci index="24">46368</fibonacci>
    <fibonacci index="25">75025</fibonacci>
</Fibonacci_Numbers>

DOM based XMLPrettyPrinter

import org.apache.xerces.parsers.*;
import org.w3c.dom.*;
import org.xml.sax.*;
import java.io.*;
import org.apache.xerces.dom.*;
import org.apache.xml.serialize.*; 


public class DOMPrettyPrinter {

  public static void main(String[] args) { 
     
    DOMParser parser = new DOMParser();
    
    for (int i = 0; i < args.length; i++) {
      try {
        // Read the entire document into memory
        parser.parse(args[i]); 
       
        Document document = parser.getDocument();
        
        OutputFormat format 
         = new OutputFormat(document, "UTF-8", true);
        format.setLineSeparator("\r\n");
        format.setIndenting(true);
        format.setIndent(2);
        format.setLineWidth(72);
        format.setPreserveSpace(false);
        XMLSerializer serializer 
         = new XMLSerializer(System.out, format);
        serializer.serialize(document);     
      }
      catch (SAXException e) {
        System.err.println(e); 
      }
      catch (IOException e) {
        System.err.println(e); 
      }
      
    }
  
  } // end main

}

Output from a DOM based XMLPrettyPrinter

<?xml version="1.0" encoding="UTF-8"?>
<!-- <!DOCTYPE foo SYSTEM "http://msdn.microsoft.com/xml/general/htmlentities.dtd"> -->
<weblogs>
  <log>
    <name>MozillaZine</name>
    <url>http://www.mozillazine.org</url>
    <changesUrl>http://www.mozillazine.org/contents.rdf</changesUrl>
    <ownerName>Jason Kersey</ownerName>
    <ownerEmail>kerz@en.com</ownerEmail>
    <description>THE source for news on the Mozilla Organization.
      DevChats, Reviews, Chats, Builds, Demos, Screenshots, and more.</description>
    <imageUrl/>
    <adImageUrl>http://static.userland.com/weblogMonitor/ads/kerz@en.com.gif</adImageUrl>
  </log>
  <log>
    <name>SalonHerringWiredFool</name>
    <url>http://www.salonherringwiredfool.com/</url>
    <ownerName>Some Random Herring</ownerName>
    <ownerEmail>salonfool@wiredherring.com</ownerEmail>
    <description/>
  </log>
  <log>
    <name>Scripting News</name>
    <url>http://www.scripting.com/</url>
    <ownerName>Dave Winer</ownerName>
    <ownerEmail>dave@userland.com</ownerEmail>
    <description>News and commentary from the cross-platform scripting community.</description>
    <imageUrl>http://www.scripting.com/gifs/tinyScriptingNews.gif</imageUrl>
    <adImageUrl>http://static.userland.com/weblogMonitor/ads/dave@userland.com.gif</adImageUrl>
  </log>
  <log>
    <name>SlashDot.Org</name>
    <url>http://www.slashdot.org/</url>
    <ownerName>Simply a friend</ownerName>
    <ownerEmail>afriendofweblogs@weblogs.com</ownerEmail>
    <description>News for Nerds, Stuff that Matters.</description>
  </log>
</weblogs>

The point is this:


To Learn More



Index | Cafe con Leche

Copyright 2000-2004 Elliotte Rusty Harold
elharo@metalab.unc.edu
Last Modified February 23, 2004