Chapter 5. Reading XML

Table of Contents

InputStreams and Readers
XML Parsers
Choosing an XML API
Choosing an XML Parser
Available Parsers
SAX
DOM
JAXP
JDOM
dom4j
ElectricXML
XMLPULL
Summary

Writing XML documents is very straightforward, as I hope the last two chapters proved. Reading them is not nearly as simple. Fortunately, you don’t have to do all the work yourself. You can use an XML parser to read the document for you. The XML parser exposes the contents of an XML document through an API. A client application reads an XML document through this API. As well as reading the document and providing the contents to the client application, the parser also checks the document for well-formedness and (optionally) validity. If it finds an error, it informs the client application.

InputStreams and Readers

It’s time to reverse the examples of previous two chapters. Instead of putting information into an XML document, I’m going to take it out of one. In particular, I’m going to use an example that reads the response from the Fibonacci XML-RPC servlet introduced in Chapter 3. This document has the form shown in Example 5.1.

Example 5.1. A response from the Fibonacci XML-RPC server

<?xml version="1.0"?>
<methodResponse>
  <params>
    <param>
      <value><double>28657</double></value>
    </param>
  </params>
</methodResponse>

The clients for the XML-RPC server developed in Chapter 3 simply printed the entire document on the console. Now I want to extract just the answer and strip out all the markup. That is, the user interface will look something like this:

C:\XMLJAVA>java FibonacciClient 9
34

From the user’s perspective, the XML is completely hidden. The user neither knows nor cares that the request is being sent and the response received in an XML document. That’s merely an implementation detail. In fact, the user may not even know that the request is being sent over the network rather than being processed locally. All the user sees is the very basic command line interface. Obviously you could attach a fancier GUI front-end, but since this is not a book about GUI programming, I’ll leave that as an exercise for the reader.

Given that you’re writing a client to talk to an XML-RPC server, you know that the documents you’re processing always take this form. You know that the root element is methodResponse. You know that the methodResponse element contains a single params element that in turn contains a param element. You know that this param element contains a single value element. (For the moment, I'm going to ignore the possibility of a fault response to keep the examples smaller and simpler. Adding that would be straightforward, and we'll do that in later chapters.) All of this is specified by the XML-RPC specification. If any of this is violated in the response you get back from the server, then that server is not sending correct XML-RPC. You’d probably respond to this by throwing an exception.

Given that you’re writing a client to talk to the specific servlet at http://www.elharo.com/fibonacci/XML-RPC, you know that the value element contains a single double element that in turn contains a string representing a double. This isn’t true for all XML-RPC servers, but it is for this one. If the server returns a value with a type other than double, you’d probably respond by throwing an exception, just as you would if a local method you expected to return a Double instead returned a String. The only significant difference is that in the XML-RPC case neither the compiler nor the virtual machine can do any type checking. Thus you may want to be a little bit more explicit about handling the case where something unexpected is returned.

The main point is this: most programs you write are going to read documents written in a specific XML vocabulary. They are not going to be designed to handle absolutely any well-formed document that comes down the pipe. Your programs will make assumptions about the content and structure of those documents, just as they now make assumptions about the content and structure of external objects. If you are concerned that your assumptions may occasionally be violated (and you should be), you can validate your documents against a schema of some kind so you know up-front if you’re being fed bad data. However, you do need to make some assumptions about the format of your documents before you can reasonably process them.

It’s simple enough to hook up an InputStream and/or an InputStreamReader to the document, and read it out. For example, this method reads an input XML document from the specified input stream and copies it to System.out:

public printXML(InputStream xml) {

  int c;
  while ((c = xml.read()) != -1) System.out.write(c);
   
}

To actually extract the information a little more work is required. You need to determine which pieces of the input you actually want and separate those out from all the rest of the text. In the Fibonacci XML-RPC example, you need to extract the text string between the <double> and </double> tags and then convert it to a java.math.BigInteger object. (Remember, I’m only using a double here because XML-RPC’s ints aren’t big enough to handle Fibonacci numbers. However, all the responses should contain an integral value.)

The readFibonacciXMLRPCResponse() method in Example 5.2 does exactly this by first reading the entire XML document into a StringBuffer, converting the buffer to a String, and then using the indexOf() and substring() methods to extract the desired information. The main() method connects to the server using the URL and URLConnection classes, sends a request document to the server using the OutputStream and OutputStreamWriter classes, and passes InputStream containing the response XML document to the readFibonacciXMLRPCResponse() method.

Example 5.2. Reading an XML-RPC Response

import java.net.*;
import java.io.*;
import java.math.BigInteger;


public class FibonacciClient {

  static String defaultServer 
   = "http://www.elharo.com/fibonacci/XML-RPC";
   
  public static void main(String[] args) {
      
    if (args.length <= 0) {
      System.out.println(
       "Usage: java FibonacciClient number url"
      );
      return;
    }
    
    String server = defaultServer;
    if (args.length >= 2) server = args[1];
      
    try {
      // Connect to the server
      URL u = new URL(server);
      URLConnection uc = u.openConnection();
      HttpURLConnection connection = (HttpURLConnection) uc;
      connection.setDoOutput(true);
      connection.setDoInput(true); 
      connection.setRequestMethod("POST");
      OutputStream out = connection.getOutputStream();
      Writer wout = new OutputStreamWriter(out);
       
      // Write the request
      wout.write("<?xml version=\"1.0\"?>\r\n");  
      wout.write("<methodCall>\r\n"); 
      wout.write(
       "  <methodName>calculateFibonacci</methodName>\r\n");
      wout.write("  <params>\r\n"); 
      wout.write("    <param>\r\n"); 
      wout.write("      <value><int>" + args[0] 
       + "</int></value>\r\n");
      wout.write("    </param>\r\n"); 
      wout.write("  </params>\r\n"); 
      wout.write("</methodCall>\r\n"); 
        
      wout.flush();
      wout.close();
      
      // Read the response
      InputStream in = connection.getInputStream();
      BigInteger result = readFibonacciXMLRPCResponse(in);
      System.out.println(result);
        
      in.close();
      connection.disconnect();
    }
    catch (IOException e) {
      System.err.println(e); 
    }
  
  }

  private static BigInteger readFibonacciXMLRPCResponse(
   InputStream in) throws IOException, NumberFormatException, 
   StringIndexOutOfBoundsException {
    
    StringBuffer sb = new StringBuffer();
    Reader reader = new InputStreamReader(in, "UTF-8");
    int c;
    while ((c = in.read()) != -1) sb.append((char) c);
    
    String document = sb.toString();
    String startTag = "<value><double>";
    String endTag = "</double></value>";
    int start = document.indexOf(startTag) + startTag.length();
    int end = document.indexOf(endTag);
    String result = document.substring(start, end);
    return new BigInteger(result);
    
  }  

}

Reading the response XML document is more work than writing the request document, but still plausible. However, this stream- and string-based solution is far from robust. In particular, it will fail if:

  • The document returned is encoded in UTF-16 instead of UTF-8

  • An earlier part of the document contains the text “<value><double>”, even in a comment.

  • The response is written with line breaks between the value and double tags like this:

    <value>
      <double>28657</double>
    </value>
  • There’s extra white space inside the double tags like this:

    <double >28657</double >

Perhaps worse than these are all the malformed responses FibonacciClient will accept even though it should recognize and reject them. And this is a simple example where we just want one piece of data that’s clearly marked up. The more data you want from an XML document, and the more complex and flexible the markup, the harder it is to find using basic string matching or even the regular expressions introduced in Java 1.4.

Straight text parsing is not the appropriate tool with which to navigate an XML document. The structure and semantics of an XML document is encoded in the document’s markup, its tags and its attributes; and you need a tool that is designed to recognize and understand this structure as well as reporting any possible errors in this structure. This tool is called an XML parser.


Copyright 2001, 2002 Elliotte Rusty Haroldelharo@metalab.unc.eduLast Modified September 16, 2001
Up To Cafe con Leche