DOM Level 3

DOM Level 3 will finally add a standard Load and Save package so that it will be possible to write completely implementation independent DOM programs. This package, org.w3c.dom.ls, is identified by the feature strings LS-Load and LS-Save. The loading parts includes the DOMBuilder interface you've already encountered. The saving part is based on the DOMWriter interface. DOMWriter is more powerful than XMLSerializer Whereas XMLSerializer is limited to outputting documents, document fragments, and elements, DOMWriter can output any kind of node at all. Furthermore, you can install a filter into a DOMWriter that controls its output.

Caution

This section is based on very early bleeding edge technology and specifications, particularly the July 25, 2002 Working Draft of the Document Object Model (DOM) Level 3 Abstract Schemas and Load and Save Specification and Xerces-J 2.0.2. Even with Xerces-J 2.1, most of the code in this section won’t even compile, much less run. Furthermore, it’s virtually guaranteed that the details in this section will change before DOM3 becomes a final recommendation.

As shown by the method signatures in Example 13.3, DOMWriter can copy a Node object from memory into serialized bytes or characters. It has methods to write XML nodes onto a Java OutputStream or a String. The most common kind of node you’ll write is a Document, but you can write all the other kinds of node as well such as Element, Attr, and Text. This interface also has methods to control exactly how the output is formatted and how errors are reported.

Example 13.3. The DOM3 DOMWriter interface

package org.w3c.dom.ls;

public interface DOMWriter {

  public void    setFeature(String name, boolean state)
   throws DOMException;
  public boolean canSetFeature(String name, boolean state);
  public boolean getFeature(String name) throws DOMException;

  public String  getEncoding();
  public void    setEncoding(String encoding);
  public String  getNewLine();
  public void    setNewLine(String newLine);
  
  public boolean writeNode(OutputStream out, Node node)
   throws Exception;
  public String writeToString(Node node) throws DOMException;
  
  public DOMErrorHandler getErrorHandler();
  public void setErrorHandler(DOMErrorHandler errorHandler);

  public DOMWriterFilter getFilter();
  public void setFilter(DOMWriterFilter filter);

}

Note

DOMWriter is not a java.io.Writer. In fact, it even prefers OutputStreams to Writers. The name is just a coincidence.

The primary purpose of this interface is to write nodes into strings or onto streams. These nodes can be complete documents or parts thereof like elements or text nodes. For example, this code fragment uses the DOMWriter object writer to copy the Document object doc onto System.out and copy its root element into a String:

try {
  DOMWriter writer;
  // initialize the DOMWriter...
  writer.writeNode(document, System.out);
  String root = writer.writeToString(document.getDocumentElement());
}
catch (Exception e) {
  System.err.println(e);
}

DOMWriter also has several methods to configure the output. The setNewLine() method can choose the line separator used for output. The only legal values are carriage return, a line feed, or both; that is, in Java parlance, "\r", "\n", or "\r\n". You can also set this to null to indicate you want the platform’s default value.

The setEncoding() method changes the character encoding used for the output. Which encodings any given serializer supports varies from implementation to implementation, but common values include UTF-8, UTF-16, and ISO-8859-1. UTF-8 is the default if a value is not supplied. For example, this writer sets up the output for use on a Macintosh:

DOMWriter writer;
// initialize the DOMWriter...
writer.setNewLine("\r");
writer.setEncoding("MacRoman");

More detailed control of the output can be achieved by getting and setting features of the DOMWriter, as you’ll see shortly.

The setErrorHandler() method can install an org.w3c.dom.DOMErrorHandler object that receives notification of any problems that arise when outputting a node such as an element that uses the same prefix for two different namespace URIs on two attributes. This is a callback interface, similar to org.xml.sax.ErrorHandler but even simpler since it doesn’t use different methods for different kinds of errors. Example 13.4 shows this interface. The handleError() method returns true if processing should continue after the error, false if it shouldn’t.

Example 13.4. The DOM3 DOMErrorHandler interface

package org.w3c.dom;

public interface DOMErrorHandler {

  public boolean handleError(DOMError error);

}

In Xerces-2, the XMLSerializer class implements the DOMWriter interface, so if you prefer you can use these methods instead of the ones discussed in the last section. Example 13.5 demonstrates a complete program that builds a simple SVG document in memory and writes it into the file circle.svg in the current working directory using a \r\n line end and the UTF-16 encoding. The error handler is set to an anonymous inner class that prints error messages on System.err and returns false to indicate that processing should stop when an error is detected.

Example 13.5. Serializing with DOMWriter

import org.w3c.dom.*;
import org.apache.xerces.dom3.*;
import org.apache.xerces.dom3.ls.DOMWriter;
import org.apache.xml.serialize.XMLSerializer;
import java.io.IOException;
import javax.xml.parsers.*;


public class SVGCircle {

  public static void main(String[] args) {
     
    try {
      // Find the implementation
      DocumentBuilderFactory factory 
       = DocumentBuilderFactory.newInstance();
      factory.setNamespaceAware(true);
      DocumentBuilder builder = factory.newDocumentBuilder();
      DOMImplementation impl = builder.getDOMImplementation();
      
      // Create the document
      DocumentType svgDOCTYPE = impl.createDocumentType(
       "svg", "-//W3C//DTD SVG 1.0//EN", 
       "http://www.w3.org/TR/2001/REC-SVG-20010904/DTD/svg10.dtd"
      );
      Document doc = impl.createDocument(
       "http://www.w3.org/2000/svg", "svg", svgDOCTYPE);
       
      // Fill the document
      Node rootElement = doc.getDocumentElement();
      Element circle = doc.createElementNS(
       "http://www.w3.org/2000/svg", "circle");
      circle.setAttribute("r", "100");
      rootElement.appendChild(circle);

      // Serialize the document onto System.out
      DOMWriter writer = new XMLSerializer();
      writer.setNewLine("\r\n");
      writer.setEncoding("UTF-16");
      writer.setErrorHandler(
        new DOMErrorHandler() {
          public boolean handleError(DOMError error) {
            System.err.println(error.getMessage());
            return false;
          }
        }
      );
      writer.writeNode(System.out, doc);
      
    }
    catch (Exception e) {
      System.err.println(e);
    }
  
  }

}

Note

Xerces-J 2.1 currently puts the DOMWriter interface in the org.apache.xerces.dom3.ls package instead of the org.w3c.dom.ls package. The Xerces team is trying to keep the experimental DOM3 classes separate from the main API until DOM3 is more stable.

Creating DOMWriters

Example 13.5 depends on Xerces-specific classes. It won’t work with GNU-JAXP or Oracle or other parsers, even after these parsers are upgraded to support DOM3. However, you can write the code in a much more parser-independent fashion by using the DOMImplementationLS interface, shown in Example 13.6, to create concrete implementations of DOMWriter, rather than constructing the implementation classes directly. DOMImplementationLS is a sub-interface of DOMImplementation that adds three methods to create new DOMBuilders, DOMWriters, and DOMInputSources.

Example 13.6. The DOM3 DOMImplementationLS interface

package org.w3c.dom.ls;

public interface DOMImplementationLS {

  public static final short MODE_SYNCHRONOUS  = 1;
  public static final short MODE_ASYNCHRONOUS = 2;

  public DOMWriter      createDOMWriter();
  public DOMInputSource createDOMInputSource();
  public DOMBuilder     createDOMBuilder(short mode, 
   String schemaType) throws DOMException;

}

You retrieve a concrete instance of this factory interface by using the DOM3 DOMImplementationRegistry factory class introduced in Chapter 10 to request a DOMImplementation object that supports the LS-Save feature. Then you cast that object to DOMImplementationLS. For example,

try {
  DOMImplementation impl = DOMImplementationRegistry
   .getDOMImplementation("Core 2.0 LS-Save 3.0");
  if (impl != null) {
      DOMImplementationLS implls = (DOMImplementationLS) impl;
      DOMWriter writer = implls.createDOMWriter();
      writer.writeNode(System.out, document);
  }
  else {
    System.out.println(
     "Could not find a DOM3 Save compliant parser.");
  }  
}
catch (Exception e) {
  System.err.println(e);   
}

Using this technique, it’s straightforward to write a completely implementation independent program to generate and serialize XML documents. Example 13.7 demonstrates. It uses the DOMImplementationRegistry class to load the DOMImplementationLS and the DOMWriter class to output the final result. Otherwise, it just uses the standard DOM2 classes that you've seen in previous chapters.

Example 13.7. An implementation independent DOM3 program to build and serialize an XML document

import org.w3c.dom.*;
import org.w3c.dom.ls.*;


public class SVGDOMCircle {

  public static void main(String[] args) {
     
    try {
      // Find the implementation
      DOMImplementation impl 
       = DOMImplementationRegistry.getDOMImplementation(
          "Core 2.0 LS-Load 3.0 LS-Save 3.0");
      if (impl == null) {
        System.out.println(
         "Could not find a DOM3 Load-Save compliant parser.");
        return;
      }
      
      // Create the document
      DocumentType svgDOCTYPE = impl.createDocumentType(
       "svg", "-//W3C//DTD SVG 1.0//EN", 
       "http://www.w3.org/TR/2001/REC-SVG-20010904/DTD/svg10.dtd"
      );
      Document doc = impl.createDocument(
       "http://www.w3.org/2000/svg", "svg", svgDOCTYPE);
       
      // Fill the document
      Node rootElement = doc.getDocumentElement();
      Element circle = doc.createElementNS(
       "http://www.w3.org/2000/svg", "circle");
      circle.setAttribute("r", "100");
      rootElement.appendChild(circle);

      // Serialize the document onto System.out
      DOMImplementationLS implls = (DOMImplementationLS) impl;
      DOMWriter writer = implls.createDOMWriter();
      writer.writeNode(System.out, doc);
      
    }
    catch (Exception e) {
      System.err.println(e);
    }
  
  }

}

This program has to test for both the LS-Load and LS-Save features because it’s not absolutely guaranteed that an implementation that has one will have the other, especially in the early days of DOM3.

Serialization Features

The defaults used by the writeNode() and writeToString() methods are acceptable for most uses. However, occasionally you want a little more control over the serialized form. For instance, you might want the output to be pretty printed with extra white space added to indent the elements nicely. Or perhaps you want the output to be in canonical form. All of this and more can be controlled by setting features in the writer before invoking the write method.

Defined features include:

normalize-characters, optional, default true

If true, output text should be normalized according to the W3C Character Model. For example, the word café would be represented as the four character string c a f é rather than the five character string c a f e combining_acute_accent. Implementations are only required to support a false value for this feature.

split-cdata-sections, required, default true

If true, CDATA sections containing the CDATA section end delimiter ]]> are split into pieces and the ]]> included in a raw text node. If false, such a CDATA section is not split. Instead an error is reported and output stops.

entities, required, default true

If true, entity references like © included in the output. If false, they are not. Instead their replacement text is included.

whitespace-in-element-content, optional, default true

If true, all white space is output. If false, text nodes containing only white space are deleted if the parent element’s declaration from the DTD/schema does not allow #PCDATA to appear at that point.

discard-default-content, required, default true

If true, the implementation will attempt write out any nodes whose presence can be inferred from the DTD or schema; e.g. default attribute values. If false, it won’t include them explicitly.

canonical-form, optional, default false

If true, the document will be written according to the rules specified by the Canonical XML specification. For instance attributes will be lexically ordered and CDATA sections will not be included. If false, then the exact output is implementation dependent.

format-pretty-print, optional, default false

If true, white space will be adjusted to “pretty print” the XML. Exactly what this means, e.g. how many spaces elements are indented or what maximum line length is used, is left up to implementations.

validation, optional, default false

If true, then the document’s schema is used to validate the document as it is being output. Any validation errors that are discovered are reported to the the registered error handler. (Both validation and error handlers are other new features in DOM3.)

In addition implementations may define additional custom features. These names will generally begin with vendor specific prefixes like “apache:” or “oracle:”. For portability, you should check for the existence of such a feature with canSetFeature() before setting it. Otherwise, you’re likely to encounter an unexpected DOMException when the program is run with a different parser.

For example, this code fragment attempts to output the Document object doc onto the OutputStream out in canonical form. However, if the implementation of DOMWriter doesn’t support Canonical XML, it just outputs the document in the normal way:

try {
  DOMWriter writer = new XMLSerializer();
  if (writer.canSetFeature("canonical-form", true)) {
    writer.setFeature("canonical-form", true);
  }
  writer.writeNode(out, doc);
}
catch (Exception e) {
  System.err.println(e);
}

Filtering Output

One of the more original aspects of the DOMWriter API is the ability to attach filters to a writer that remove certain nodes from the output. A DOMWriterFilter is a sub-interface of NodeFilter from last chapter’s traversal API, and works almost exactly like it. This shouldn’t be too surprising since serializing a document is just another tree-walking operation.

To perform output filtering you first implement the DOMWriterFilter interface shown in Example 13.8. As with the NodeFilter superinterface, the acceptNode() method returns one of the three named constants NodeFilter.FILTER_ACCEPT, NodeFilter.FILTER_REJECT, or NodeFilter.FILTER_SKIP to indicate whether or not a particular node and its descendants should be output. (This method isn’t listed here because it’s inherited from the superinterface.)

Example 13.8. The DOMWriterFilter interface

package org.w3c.dom.ls;

public interface DOMWriterFilter extends NodeFilter {

  public int getWhatToShow();

}

The getWhatToShow() method returns an int constant indicating which kinds of nodes are passed to this filter for processing. This is a combination of the bit constants used by NodeIterator and TreeWalker in the last chapter; that is, NodeFilter.SHOW_ELEMENT, NodeFilter.SHOW_TEXT, NodeFilter.SHOW_COMMENT, etc.

Chapter 8 demonstrated a SAX filter that removed everything that wasn’t in the XHTML namespace from a document. Example 13.9 is a DOMWriterFilter that accomplishes the same task.

Example 13.9. Filtering everything that isn’t XHTML on output

import org.w3c.dom.*;
import org.w3c.dom.traversal.NodeFilter;
import org.w3c.dom.ls.DOMWriterFilter;


public class XHTMLFilter implements DOMWriterFilter {
  
  public final static String XHTML_NAMESPACE 
   = "http://www.w3.org/1999/xhtml";

  // This filter only operates on elements. Everything else
  // will be output without passing through the filter. However
  // descendants of non-XHTML elements will not be output
  // because their ancestor elements have been rejected.
  // Note that this means we don't fully handle nested XHTML;
  // e.g. XHTML contains SVG which contains XHTML.
  // XHTML inside SVG will not be output.
  public int getWhatToShow() {
    return NodeFilter.SHOW_ELEMENT;    
  }
  
  
  public short acceptNode(Node node) {
     
    // Is this necessary or does getWhatToShow() handle this????
    // I've requested clarification from the DOM working group.
    int type = node.getNodeType();
    if (type != Node.ELEMENT_NODE) {
      return NodeFilter.FILTER_ACCEPT;
    }   

    String namespace = node.getNamespaceURI();
    if (XHTML_NAMESPACE.equals(namespace)) {
      return NodeFilter.FILTER_ACCEPT;
    }
    else {
     return NodeFilter.FILTER_SKIP; 
    }

  }

}

The one thing this doesn’t filter out is non-XHTML attributes. Those are written out with their elements. They are not passed to acceptNode(). To filter out attributes from other namespaces would require a custom DOMWriter. You might be able to remove them from the element nodes passed to acceptNode(), but this would modify the in-memory tree as well as the streamed output. Furthermore, although Java doesn’t support this, the IDL code for DOMWriter indicates that the Node passed to acceptNode() is read-only. The underlying implementation is probably not expecting acceptNode() to modify its argument. Doing so is asking for corrupt data structures.

You can install a filter into a DOMWriter using the setFilter() method. Then any node the filter rejects will not be serialized. Example 13.10 uses the above XHTMLFilter to output pure XHTML from an input document that might contain SVG, MathML, SMIL, or other non-XHTML elements.

Example 13.10. Using a DOMWriterFilter

import org.w3c.dom.*;
import org.w3c.dom.ls.*;


public class XHTMLPurifier {
  
  public static void main(String[] args) {
     
    try {
      // Find the implementation
      DOMImplementation impl 
       = DOMImplementationRegistry.getDOMImplementation(
          "Core 2.0 LS-Load 3.0 LS-Save 3.0");
      if (impl == null) {
        System.out.println(
         "Could not find a DOM3 Load-Save compliant parser.");
        return;
      }
      DOMImplementationLS implls = (DOMImplementationLS) impl;

      // Load the parser
      DOMBuilder parser = implls.createDOMBuilder(
       DOMImplementationLS.MODE_SYNCHRONOUS);
      
      // Parse the document
      Document doc = parser.parseURI(document);
      
      // Serialize the document onto System.out while filtering
      DOMWriter writer = implls.createDOMWriter();
      DOMWriterFilter filter = new XHTMLFilter();
      writer.setFilter(filter);
      writer.writeNode(System.out, doc);
      
    }
    catch (Exception e) {
      System.err.println(e);
    }
  
  }
  
}

Copyright 2001, 2002 Elliotte Rusty Haroldelharo@metalab.unc.eduLast Modified September 11, 2002
Up To Cafe con Leche