The ProcessingInstruction Interface

The ProcessingInstruction interface represents a processing instruction such as <?xml-stylesheet type="text/css" href="order.css"?> or <?php echo "Hello World";?>.

Example 11.17 summarizes the ProcessingInstruction interface. This interface adds methods to get the target and the data of the processing instruction as strings. Even if the data has a pseudo-attribute format as in <?xml-stylesheet type="text/css" href="order.css"?>. DOM doesn’t recognize that. For this processing instruction the target is xml-stylesheet and the data is type="text/css" href="order.css".

Example 11.17. The ProcessingInstruction interface

package org.w3c.dom;

public interface ProcessingInstruction extends Node {

  public String getTarget();
  public String getData();
  public void   setData(String data) throws DOMException;

}

As usual, ProcessingInstruction objects also have all the methods of the Node super-interface such as getNodeName() and getNodeValue(). The value of a processing instruction is its data. However, processing instructions do not have children, so Node methods like getFirstChild() return null, and methods like appendChild() throw a DOMException with the code HIERARCHY_REQUEST_ERR.

As an example, let’s extend the earlier XLinkSpider program so that it respects robots processing instructions. Such an instruction looks like this, and appears in the prolog of an XML document:

<?robots index="yes" follow="no"?>

The semantics of this instruction is deliberately similar to the robots META tag in HTML. That is, follow="yes" means robots should follow links they find in this page, follow="no" means they shouldn’t. Similarly, index="yes" means search engines should include this page, and index="no" means they shouldn’t.

Like many processing instructions, the syntax is based on pseudo-attributes. DOM doesn’t provide any means to parse these, even though it’s a very common format for processing instructions. However you can fake DOM out. What I’m going to do is extract the target and data of the processing instruction and use them to form a string that has this format:

<target data/>

In other words, a processing instruction like <?robots index="yes" follow="no"?> is going to turn into a String like <robots index="yes" follow="no" />. This string is in turn a well-formed XML document that can be parsed and its attributes extracted. Admittedly, this approach is very circuitous and probably not optimally efficient. However, it’s a lot easier to code and explain than writing your own mini-parser just to handle pseudo-attributes. Example 11.18 is a simple utility class that implements this hack. The parsing is completely hidden inside the constructor, so if this is too offensive to your sensibilities, you can replace it with more appropriate code without changing the public interface. Since this class is quite useful in practice, not merely an example for this book, I’ve placed it in the com.macfaq.xml package. Don’t forget to configure your class and source paths appropriately when compiling it.

Example 11.18. Reading PseudoAttributes from a ProcessingInstruction

package com.macfaq.xml;

import org.w3c.dom.*;
import javax.xml.parsers.*;
import org.xml.sax.*;
import java.io.*;


public class PseudoAttributes {

  private NamedNodeMap pseudo;

  public PseudoAttributes(ProcessingInstruction pi) 
   throws SAXException {
  
    StringBuffer sb = new StringBuffer("<");
    sb.append(pi.getTarget());
    sb.append(" ");
    sb.append(pi.getData());
    sb.append("/>");
    StringReader reader = new StringReader(sb.toString());
    InputSource source = new InputSource(reader);
    try {
      DocumentBuilderFactory factory 
       = DocumentBuilderFactory.newInstance();
      DocumentBuilder parser = factory.newDocumentBuilder();
      
      // This line will throw a SAXException if the processing
      // instruction does not use pseudo-attributes
      Document doc = parser.parse(source);
      Element root = doc.getDocumentElement();
      pseudo = root.getAttributes();
      
    }
    catch (FactoryConfigurationError e) { 
      // I don't absolutely need to catch this, but I hate to 
      // throw an Error for no good reason.
      throw new SAXException(e.getMessage()); 
    }    
    catch (SAXException e) { 
      throw e; 
    }    
    catch (Exception e) { 
      throw new SAXException(e); 
    }    
    
  }

  // delegator methods
  public Attr item(int index) {
    return (Attr) pseudo.item(index);
  }
  
  public int getLength() {
    return pseudo.getLength();
  }

  public String getValue(String name) {
    Attr att = (Attr) pseudo.getNamedItem(name);
    if (att == null) return "";
    return att.getValue();
  }
  
}

This class makes it easy for the earlier DOMSpider program to recognize the robots processing instruction. I won’t repeat the entire program, most of which hasn’t changed. The relevant change is in the spider() method. It now has to look for a robots processing instruction in each document and use that to decide whether or not to call process() (index="yes|no") and/or findLinks() (follow="yes|no").

  public void spider(String systemID) {
    currentDepth++;
    try {
      if (currentDepth < maxDepth) {
        Document document = parser.parse(systemID);
        
        // Look for a robots PI with follow="no"
        boolean index = true;
        boolean follow = true;
        NodeList children = document.getChildNodes();
        for (int i = 0; i < children.getLength(); i++) {
          Node child = children.item(i); 
          int type = child.getNodeType();
          if (type == Node.PROCESSING_INSTRUCTION_NODE) {
            ProcessingInstruction pi 
             = (ProcessingInstruction) child; 
            if (pi.getTarget().equals("robots")) {
               PseudoAttributes pseudo = new PseudoAttributes(pi); 
               if (pseudo.getValue("index").equals("no")) {
                 index = false; 
               }
               if (pseudo.getValue("follow").equals("no")) {
                 follow = false; 
               }
            }
          }
        } // end for
        
        if (index) process(document, systemID);
        
        if (follow) {
          Vector toBeVisited = new Vector();
          // search the document for uris, 
          // store them in vector, and print them
          findLinks(
           document.getDocumentElement(), toBeVisited, systemID);
    
          Enumeration e = toBeVisited.elements();
          while (e.hasMoreElements()) {
            String uri = (String) e.nextElement();
            visited.add(uri);
            spider(uri); 
          } // end while
        } // end if 
      
      }
    
    }
    catch (SAXException e) {
      // Couldn't load the document, 
      // probably not well-formed XML, skip it 
    }
    catch (IOException e) {
      // Couldn't load the document, 
      // likely network failure, skip it 
    }
    finally { 
      currentDepth--;
      System.out.flush();     
    }
      
  }

Copyright 2001, 2002 Elliotte Rusty Haroldelharo@metalab.unc.eduLast Modified January 08, 2002
Up To Cafe con Leche