The Text Interface

The Text interface represents a text node. This can be a child of an element, an attribute, or an entity reference. When a document is built by a parser, each text node will contain the longest possible run of contiguous parsed character data from the document, and thus no text node will be adjacent to any other. However, documents built in memory may contain adjacent text nodes. Invoking the normalize() method in the Node interface on any ancestor of the text nodes will merge these together.

Example 11.9 summarizes the Text interface. Besides the methods like setData() and getNodeValue() that Text inherits from its super-interfaces, it has one new method that splits a Text object into two.

Example 11.9. The Text interface

package org.w3c.dom;

public interface Text extends CharacterData {

  public Text splitText(int offset) throws DOMException;

}

The splitText() method changes one text node into two by dividing its data at a specified offset. All characters after the split are cut out of the original node. A new text node is created and returned. Both are included in the tree. If the offset is less than zero or greater than the length of the data, splitText() throws a DOMException with the code for INDEX_SIZE_ERR.

The main reason to split a text node is so that you can move or delete part of some text, but not the entire node. It can also be used to insert a new node in the middle of a run of text. For example, suppose date is an Element object representing this element:

<date>2002-01-08</date>

Now suppose you want to change date to represent this element:

<date><year>2002</year><month>01</month><day>08</day></date>

This code will do it:

Document document = date.getOwnerDocument();
Text yearText = (Text) date.getFirstChild();
Text slash = yearText.splitText(4);
Text monthText = slash.splitText(1);
Text nextSlash = monthText.splitText(2);
Text dayText = nextSlash.splitText(1);

Element year = document.createElement("year");
Element month = document.createElement("month");
Element day = document.createElement("day");

date.removeChild(slash);
date.removeChild(monthText);
date.removeChild(yearText);
date.removeChild(nextSlash);
date.removeChild(dayText);

year.appendChild(yearText);
month.appendChild(monthText);
day.appendChild(dayText);
date.appendChild(year);
date.appendChild(month);
date.appendChild(day);

A lot of the time these operations can be more straightforwardly implemented through String methods.

Example 11.10 is a simple program that recursively descends a DOM tree and prints all text nodes on System.out. This has the effect of stripping out the markup while leaving all text inside the document intact:

Example 11.10. Printing the text nodes in an XML document

import javax.xml.parsers.*;
import org.w3c.dom.*;
import org.xml.sax.SAXException;
import java.io.IOException;


public class DOMTextExtractor {

  public void processNode(Node node) {
    
    if (node instanceof Text) {
      Text text = (Text) node;
      String data = text.getData();
      System.out.println(data);
    }
    
  }

  // note use of recursion
  public void followNode(Node node) {
    
    processNode(node);
    if (node.hasChildNodes()) {
      NodeList children = node.getChildNodes();
      for (int i = 0; i < children.getLength(); i++) {
        followNode(children.item(i));
      } 
    }
    
  }

  public static void main(String[] args) {

    if (args.length <= 0) {
      System.out.println("Usage: java DOMTextExtractor URL");
      return;
    }
    
    String url = args[0];
    
    try {
      DocumentBuilderFactory factory 
       = DocumentBuilderFactory.newInstance();
      DocumentBuilder parser = factory.newDocumentBuilder();
      // If expandEntityReferences isn't turned off, there
      //  won't be any entity reference nodes in the DOM tree
      factory.setExpandEntityReferences(false);
      
      // Read the document
      Document document = parser.parse(url); 
      
      // Process the document
      DOMTextExtractor extractor = new DOMTextExtractor();
      extractor.followNode(document);

    }
    catch (SAXException e) {
      System.out.println(url + " is not well-formed.");
    }
    catch (IOException e) { 
      System.out.println(
       "Due to an IOException, the parser could not check " + url
      ); 
    }
    catch (FactoryConfigurationError e) { 
      System.out.println("Could not locate a factory class"); 
    }
    catch (ParserConfigurationException e) { 
      System.out.println("Could not locate a JAXP parser"); 
    }
     
  } // end main

}

Here’s the result of running the XML specification through this program:

D:\books\XMLJAVA>java DOMTextExtractor 
 http://www.w3.org/TR/2000/REC-xml-20001006.xml






Extensible Markup Language (XML)


1.0 (Second Edition)


REC-xml-20001006


W3C Recommendation


6
October
2000
…

Notice that white space is included in text nodes and is significant. Text inside entity references is also found, one way or another. If the DOM parser is producing entity reference nodes, then the replacement text of the entity becomes children of the entity reference nodes. Otherwise, the replacement text of the entity is simply resolved into the surrounding text nodes.


Copyright 2001, 2002 Elliotte Rusty Haroldelharo@metalab.unc.eduLast Modified July 29, 2002
Up To Cafe con Leche