Chapter 11. The Document Object Model Core

Chapter 11. The Document Object Model Core
Prev		Next

Table of Contents

The Element Interface

Extracting Elements
Attributes

The NamedNodeMap Interface

The CharacterData interface

The Text Interface

The CDATASection Interface

The EntityReference Interface

The Attr Interface

The ProcessingInstruction Interface

The Comment Interface

The DocumentType Interface

The Entity Interface

The Notation Interface

Summary

The last two chapters have considered a DOM document to be mostly a tree of nodes. That is, it is composed of instances of the Node interface; and indeed for many purposes this is all you need to know. However, not all nodes are the same. Elements have properties that attributes don’t have. Attributes have properties that processing instructions don’t have. Processing instructions have properties comments don’t have, and so forth. In this chapter, we look at the unique properties and methods of the individual interfaces that make up an XML document.

The Element Interface

The Element interface is perhaps the most important of all the DOM component interfaces. After all, XML documents can be written without any comments, processing instructions, attributes, CDATA sections, entity references, or even text nodes. However, every XML document has at least one element and most have many more. Elements, more than any other component, define the structure of an XML document.

Example 11.1 summarizes the Element interface. This interface includes methods to get the prefixed name of the element, manipulate the attributes on the element, and select from the element’s descendants. Of course, Element objects also have all the methods of the Node super-interface such as appendChild() and getNamespaceURI().

Example 11.1. The Element interface

package org.w3c.dom;

public interface Element extends Node {

  public String  getTagName();
  
  public boolean hasAttribute(String name);
  public boolean hasAttributeNS(String namespaceURI, 
   String localName);
  public String  getAttribute(String name);
  public void    setAttribute(String name, String value)
   throws DOMException;
  public void    removeAttribute(String name) 
   throws DOMException;
  public Attr    getAttributeNode(String name);
  public Attr    setAttributeNode(Attr newAttr) 
   throws DOMException;
  public Attr    removeAttributeNode(Attr oldAttr) 
   throws DOMException;
  public String  getAttributeNS(String namespaceURI, 
   String localName);
  public void    setAttributeNS(String namespaceURI, 
   String qualifiedName, String value) throws DOMException;
  public void    removeAttributeNS(String namespaceURI, 
   String localName) throws DOMException;
  public Attr    getAttributeNodeNS(String namespaceURI, 
   String localName);
  public Attr    setAttributeNodeNS(Attr newAttr) 
   throws DOMException;
   
  public NodeList getElementsByTagName(String name);
  public NodeList getElementsByTagNameNS(String namespaceURI, 
   String localName);
 
}

The aesthetics of this interface are seriously marred by DOM’s requirement to avoid method overloading. The differences in the argument lists are redundantly repeated in the method names. For instance, if DOM had been written in pure Java, there’d probably be three setAttribute() methods with these signatures:

public void setAttribute(String name, String value)
    throws DOMException;

public void setAttribute(String namespaceURI, String localName, String value)
    throws DOMException;

public void setAttribute(Attr attribute)
    throws DOMException;

Instead, Element has these four methods with slightly varying names:

public void setAttribute(String name, String value)
    throws DOMException;

public void setAttributeNS(String namespaceURI, String localName, String value)
    throws DOMException;

public void setAttributeNode(Attr attribute)
    throws DOMException;

public void setAttributeNodeNS(Attr attribute)
    throws DOMException;

The distinction between setAttributeNode() and setAttributeNodeNS is unnecessary. setAttributeNode() can only be used for attributes in no namespace, whereas setAttributeNodeNS() can only be used with attributes in a namespace. The only motivation I can imagine for this is symmetry with the getter methods, where the distinction is relevant because the argument lists are different. For the setter methods though, this is frankly a mistake. Attr objects include their own namespace information. There’s no need for separate methods to set nodes with and without namespaces.

Extracting Elements

The getElementsByTagName() and getElementsByTagNameNS() methods behave the same as the similarly named methods in Document that you encountered in the last chapter. The only difference is that they search through a single element instead of the entire document. These methods return a NodeList containing all the elements with the specified name.

An asterisk (*) can be passed as either argument to indicate that all names or namespaces are desired. This is particularly useful for the local name passed to getElementsByTagNameNS(). For example, this NodeList would contain all RDF elements that are descendants of element:

NodeList rdfs = element.getElementsByTagNameNS(
 "http://www.w3.org/1999/02/22-rdf-syntax-ns#", "*");

The list returned is sorted in document order. In other words, elements are arranged in order of the appearance of their start-tags. If the start-tag for element A appears earlier in the document than the start-tag for element B, then element A comes before element B in the list.

The next example was inspired by the source code for this very book. Before publishing, I had to extract all the code examples from the source text and put them in separate files. The examples from each chapter go into separate directories. That is, the examples from Chapter 1 go into examples/1; the examples from Chapter 2 go into examples/2; and so forth. XSLT 1.0 isn’t quite up to this task, but DOM and Java are more powerful.^[1]

The source code for this book is structured like this:

<book>
  …
  <chapter>
    …
    <example id="filename.java">
      <title>Some Java Program</title>
      <programlisting>import javax.xml.parsers;
        // more Java code…      
      </programlisting>
    </example>  
    …
    <example id="filename.xml">
      <title>Some XML document</title>
      <programlisting><![CDATA[<?xml version="1.0"?>
<root>
  …     
</root>]]></programlisting>
     </example>  
    …
  </chapter>
  more chapters…
</book>

At least, that’s the part that’s relevant to this example. The advantage to getElementsByTagName() and getElementsByTagNameNS() is that a program can extract just the parts that interest it very straightforwardly without explicitly walking the entire tree.^[2] These methods effectively flatten the hierarchy to just the elements of interest. In this case those elements are chapter and example. Inside each example, the complete structure is somewhat more relevant so the normal tree-walking methods of Node are indicated.

The program follows these steps:

Parse the entire book into a Document object.
Use Document’s getElementsByTagName() method to retrieve a list of all the chapter elements in the document. (DocBook doesn’t use namespaces so getElementsByTagName() is chosen over getElementsByTagNameNS().)
For each element in that list, use Element’s getElementsByTagName() method to retrieve a list of all the example elements in that chapter.
From each element in that list, extract its programlisting child element.
Write the text content of that programlisting element into a new file named by the ID of the example element

This example is quite specific to one XML application, DocBook. Indeed it won’t even work with all DocBook documents because it relies on various private conventions of this particular DocBook document, especially that the id attribute of each example element contains a file name. However, that’s OK. Most programs you write will be designed to process only certain XML documents in certain situations.

To increase robustness, I do require that the DocBook document be valid, and the parser does validate the document. If validation fails, this program aborts without extracting the examples, since it can’t be sure whether the document meets the preconditions. Example 11.2 demonstrates.

Example 11.2. Extracting examples from DocBook

import javax.xml.parsers.*;
import org.w3c.dom.*;
import org.xml.sax.*;
import java.io.*;


public class ExampleExtractor {
  
  public static void extract(Document doc) throws IOException {

    NodeList chapters = doc.getElementsByTagName("chapter");
    
    for (int i = 0; i < chapters.getLength(); i++) {
      
      Element chapter = (Element) chapters.item(i);
      NodeList examples = chapter.getElementsByTagName("example");
      
      for (int j = 0; j < examples.getLength(); j++) {
        
        Element example = (Element) examples.item(j);
        String fileName = example.getAttribute("id");
        // All examples should have id attributes but it's safer
        // not to assume that
        if (fileName == null) {
          throw 
           new IllegalArgumentException("Missing id on example"); 
        }
        NodeList programlistings 
         = example.getElementsByTagName("programlisting");
        // Each example is supposed to contain exactly one 
        // programlisting, but we should verify that
        if (programlistings.getLength() != 1) {
          throw new 
           IllegalArgumentException("Missing programlisting"); 
        }
        Element programlisting = (Element) programlistings.item(0);
        
        // Extract text content; this is a little tricky because
        // these often contain CDATA sections and entity
        // references which can be represented as separate nodes
        // so we can't just ask for the first text node child of
        // each program listing.
        String code = getText(programlisting);
        
        // write code into a file
        File dir = new File("examples2/" + i);
        dir.mkdirs();
        File file = new File(dir, fileName);
        System.out.println(file);
        FileOutputStream fout = new FileOutputStream(file);
        Writer out = new OutputStreamWriter(fout, "UTF-8");
        // Buffering almost always helps performance a lot
        out = new BufferedWriter(out);
        out.write(code);
        // Be sure to remember to flush and close your streams
        out.flush();
        out.close();
        
      } // end examples loop
      
    } // end chapters loop

  }
  
  public static String getText(Node node) {
    
    // We need to retrieve the text from elements, entity
    // references, CDATA sections, and text nodes; but not
    // comments or processing instructions
    int type = node.getNodeType();
    if (type == Node.COMMENT_NODE 
     || type == Node.PROCESSING_INSTRUCTION_NODE) {
       return "";
    } 
    
    StringBuffer text = new StringBuffer();

    String value = node.getNodeValue();
    if (value != null) text.append(value);
    if (node.hasChildNodes()) {
      NodeList children = node.getChildNodes();
      for (int i = 0; i < children.getLength(); i++) {
        Node child = children.item(i);  
        text.append(getText(child));
      }
    }
    
    return text.toString();
    
  }
  
  public static void main(String[] args) {

    if (args.length <= 0) {
      System.out.println("Usage: java ExampleExtractor URL");
      return;
    }
    String url = args[0];
    
    try {
      DocumentBuilderFactory factory 
       = DocumentBuilderFactory.newInstance();
      factory.setValidating(true);
      
      DocumentBuilder parser = factory.newDocumentBuilder();
      parser.setErrorHandler(new ValidityRequired());
      
      // Read the document
     Document document = parser.parse(url); 
     
     // Extract the examples
     extract(document);

    }
    catch (SAXException e) {
      System.out.println(e);
    }
    catch (IOException e) { 
      System.out.println(
       "Due to an IOException, the parser could not read " + url
      ); 
      System.out.println(e);
    }
    catch (FactoryConfigurationError e) { 
      System.out.println("Could not locate a factory class"); 
    }
    catch (ParserConfigurationException e) { 
      System.out.println("Could not locate a JAXP parser"); 
    } 
     
  } // end main
  
}

// Make validity errors fatal
class ValidityRequired implements ErrorHandler {

  public void warning(SAXParseException e)
    throws SAXException {
    // ignore warnings  
  }
  
  public void error(SAXParseException e)
   throws SAXException {
    // Mostly validity errors. Make them fatal.
    throw e;
  }
  
  public void fatalError(SAXParseException e)
   throws SAXException {
    throw e;
  }
  
}

Since ExampleExtractor is fairly involved, I’ve factored it into several relatively independent pieces. The main() method builds the document and parses the document as usual. The non-public class ValidityRequired more or less converts all errors into fatal errors by rethrowing the exception it’s been passed. Assuming validation succeeds, the document is then passed to the extract() method.

The extract() method iterates through all the chapters and examples in the book using getElementsByTagName(). Each example is supposed to have an id attribute and a single programlisting child element. However, since this is just a convention for this one document rather than a rule enforced by the DTD, the code checks to make sure that’s true. If it isn’t, it throws an IllegalArgumentException.

Next comes one of the trickiest part of working with elements in DOM. I need to extract the text content of the programlisting element, which sounds simple enough. However, there’s no method in either Element or Node that performs this routine task. You might expect getNodeValue() to do this, especially if you’re used to XPath. However, in DOM, unlike XPath, the value of an element is null. Only its children have values. Thus we need to recursively descend through the children of the programlisting element, accumulating the values of all text nodes, entity references, CDATA sections, and other elements as we go. The getText() method accomplishes this.

Once we’ve got the actual example code from the programlisting element, it can be written into a file. The file location is relative to the current working directory and the chapter number. The file name has been read from the id attribute. UTF-8 works well as the default encoding.

Attributes

Although DOM has an Attr interface, the Element interface is the primary means of reading and writing attributes. Since each element can have no more than one attribute with the same name, attributes can be stored and retrieved just by their names. There’s no need to manage complex list structures, as there is with other kinds of nodes.

Here are a few tips that explain how the attribute methods work in DOM:

Most attributes are not in any namespace. In particular, unprefixed attributes are never in any namespace. For these attributes, just use the name and value.
For attributes that are in a namespace, specify the prefixed name and URI when setting them. Specify the local name and namespace URI when getting them.
Getting the value of an attribute that doesn’t exist returns the empty string.
Setting an attribute that already exists changes the value of the existing attribute.

With these few principles in mind, it’s straightforward to write programs that read attributes. I'll demonstrate by revising last chapter’s Fibonacci program. That example just used elements. Now let's add a index attribute to each fibonacci element as shown in Example 11.3:

Example 11.3. A document that uses attributes

<?xml version="1.0"?>
<Fibonacci_Numbers>
  <fibonacci index="1">1</fibonacci>
  <fibonacci index="2">1</fibonacci>
  <fibonacci index="3">2</fibonacci>
  <fibonacci index="4">3</fibonacci>
  <fibonacci index="5">5</fibonacci>
  <fibonacci index="6">8</fibonacci>
  <fibonacci index="7">13</fibonacci>
  <fibonacci index="8">21</fibonacci>
  <fibonacci index="9">34</fibonacci>
  <fibonacci index="10">55</fibonacci>
</Fibonacci_Numbers>

This is really quite simple to implement. You just need to calculate a string name and value for the attribute and call setAttribute() in the right place. Example 11.4 demonstrates.

Example 11.4. A DOM program that adds attributes

import org.w3c.dom.*;
import javax.xml.parsers.*;
import javax.xml.transform.*;
import javax.xml.transform.dom.DOMSource;
import javax.xml.transform.stream.StreamResult;
import java.math.BigInteger;


public class FibonacciAttributeDOM {

  public static void main(String[] args) {

    try {

      // Find the implementation
      DocumentBuilderFactory factory 
       = DocumentBuilderFactory.newInstance();
      factory.setNamespaceAware(true);
      DocumentBuilder builder = factory.newDocumentBuilder();
      DOMImplementation impl = builder.getDOMImplementation();
      
      // Create the document
      Document doc = impl.createDocument(null, 
       "Fibonacci_Numbers", null);
       
      // Fill the document
      BigInteger low  = BigInteger.ONE;
      BigInteger high = BigInteger.ONE;

      Element root = doc.getDocumentElement();

      for (int i = 0; i < 10; i++) {
        Element number = doc.createElement("fibonacci");
        String value = Integer.toString(i);
        number.setAttribute("index", value);
        Text text = doc.createTextNode(low.toString());
        number.appendChild(text);
        root.appendChild(number);

        BigInteger temp = high;
        high = high.add(low);
        low = temp;
      }

      // Serialize the document onto System.out
      TransformerFactory xformFactory 
       = TransformerFactory.newInstance();  
      Transformer idTransform = xformFactory.newTransformer();
      Source input = new DOMSource(doc);
      Result output = new StreamResult(System.out);
      idTransform.transform(input, output);
      
    }
    catch (FactoryConfigurationError e) { 
      System.out.println("Could not locate a JAXP factory class"); 
    }
    catch (ParserConfigurationException e) { 
      System.out.println(
        "Could not locate a JAXP DocumentBuilder class"
      ); 
    }
    catch (DOMException e) {
      System.err.println(e); 
    }
    catch (TransformerConfigurationException e) {
      System.err.println(e); 
    }
    catch (TransformerException e) {
      System.err.println(e); 
    }
    
  }

}

^[1]XSLT 2.0 could handle this, and many XSLT engines include extension functions that could pull this off in XSLT 1.0. However, I needed the example. :-)

^[2]A naive DOM implementation probably would implement getElementsByTagName() and getElementsByTagNameNS() by walking the tree or sub-tree, but more efficient implementations based on detailed knowledge of the data structures that implement the various interfaces also exist. For instance, a DOM that sits on top of a native XML database might have access to an index of all the elements in the document.

Copyright 2001, 2002 Elliotte Rusty Harold	elharo@metalab.unc.edu	Last Modified January 10, 2002
	Up To Cafe con Leche