Receiving Documents

In general, a single XMLReader may parse multiple documents and may do so with the same ContentHandler. Consequently it’s important to tell where one document ends and the next document begins. To provide this information, the parser invokes startDocument() as soon as it begins parsing a new document before it invokes any other methods in ContentHandler. It calls endDocument() after it’s finished parsing the document, and will not report any further content from that document. No arguments are passed to either of these methods. They serve no purpose other than marking the beginning and end of a complete XML document.

Because an XMLReader may parse multiple documents with the same ContentHandler object, per-document data structures are normally initialized in the startDocument() method rather than in a constructor. These data structures can be flushed, saved, or committed as appropriate by the endDocument() method.

Caution

If you are using one ContentHandler for multiple documents, do not assume that the endDocument() method for the previous document actually ran. If one of the earlier methods such as startElement() threw an exception, it’s likely that the parsing was not finished and that any cleanup code you put in endDocument() was not executed. For safety, it’s a good idea to reinitialize all per-document data structures in startDocument().

For example, let’s revise the tag stripper program so that it can operate on multiple XML documents in series. Furthermore, rather than printing the results on a Writer we’ll store them in a List of Strings. As is common in SAX programs, we need a data structure that holds the information collected from each document. For this simple program, a simple data structure suffices, namely a StringBuffer which is stored in the currentDocument field. This field is initialized to a new StringBuffer object in the startDocument() method and converted to a string and stored in the documents vector in the endDocument() method Example 6.6 demonstrates the necessary ContentHandler class. The characters() method simply appends text to the currentDocument buffer.

Example 6.6. A ContentHandler interface that resets its data structures between documents

import org.xml.sax.*;
import java.util.List;


public class MultiTextExtractor implements ContentHandler {

  private List documents;
  
  // This field is deliberately not initialized in the
  // constructor. It is initialized for each document parsed, not
  // for each object constructed.
  private StringBuffer currentDocument;
  
  public MultiTextExtractor(List documents) {
    
    if (documents == null) {
      throw new NullPointerException(
       "Documents list must be non-null");
    }
    this.documents = documents;   
  }

  // Initialize the per-document data structures
  public void startDocument() {
    
    currentDocument = new StringBuffer();
    
  }
  
  // Flush and commit the per-document data structures
  public void endDocument() {
    
    String text = currentDocument.toString();
    documents.add(text);
    
  }
    
  // Update the per-document data structures
  public void characters(char[] text, int start, int length) {

    currentDocument.append(text, start, length); 
      
  }  
    
  // do-nothing methods
  public void setDocumentLocator(Locator locator) {}
  public void startPrefixMapping(String prefix, String uri) {}
  public void endPrefixMapping(String prefix) {}
  public void startElement(String namespaceURI, String localName,
   String qualifiedName, Attributes atts) {}
  public void endElement(String namespaceURI, String localName,
   String qualifiedName) {}
  public void ignorableWhitespace(char[] text, int start, 
   int length) {}
  public void processingInstruction(String target, 
   String data) {}
  public void skippedEntity(String name) {}

}

Caution

Parsers and ContentHandlers are not thread safe or reentrant. While it’s straightforward to design a SAX program that operates on multiple documents in series, it is almost impossible to design one that operates on multiple documents in parallel. If you need to perform XML parsing in multiple, simultaneous threads, give each thread its own XMLReader and ContentHandler objects. Similarly, if you want to parse another document from inside one of the ContentHandler methods, create a new XMLReader and a new ContentHandler object to parse it with. Do not try to reuse the existing XMLReader and ContentHandler before they've finished with the current document.


Copyright 2001, 2002 Elliotte Rusty Haroldelharo@metalab.unc.eduLast Modified February 28, 2002
Up To Cafe con Leche