Parsing non-XML Documents

One of the neatest tricks that a SAX filter can perform is presenting something that isn’t an XML document as if it were one. The client application doesn’t need to know that what it’s parsing is really a tab-delimited text file or a relational database if the filter hides all those details and just presents a view of that source as a well-formed XML document.

There are several advantages to using this approach as opposed to actually converting the source. For one, it’s a lot more dynamically adaptable to changing data. For another, it’s easier to integrate into existing XML processing chains. A small advantage is that you don’t need to worry about escaping illegal characters like < and & because these would normally be unescaped before the parser passed them back to the client application.

For example, JDBC makes it easy to write a SAX filter that puts an XML face on a SQL table. The parse() method reads from a JDBC ResultSet which it annotates with tags and attributes by firing the appropriate events. The entire ResultSet becomes a single root table element. Each record is presented in a record element. This record element contains one field element for each field. Each field element contains the value of the field as text. It also has an xsi:type attribute identifying the type of the field and a name attribute giving the name of the field. An alternate approach would use the SQL field names as the element names rather than the generic field with a name attribute. Example 8.13 demonstrates.

Example 8.13. Accessing databases through SAX

import org.xml.sax.*;
import org.xml.sax.helpers.*;
import java.sql.*;
import java.io.*;


public class DatabaseFilter extends XMLFilterImpl {

  private Connection connection;
  
  // The string passed to the constructor must be a JDBC URL that
  // contains all necessary information for connecting to the
  // database such as host, port, username, password, and
  // database name. For example, 
  // jdbc:mysql://host:port]/dbname?user=username&password=pass
  // The driver should have been loaded before this method is
  // called
  public DatabaseFilter(String jdbcURL) throws SQLException {
    connection = DriverManager.getConnection(jdbcURL);
  }

  public void parse(InputSource in) throws SAXException, 
   IOException, UnsupportedOperationException {
    throw new UnsupportedOperationException(
     "Can't read a database from an InputStream or Reader"
    );
  }
  
  public void parse(String selectQuery) 
   throws SAXException, IOException {
    
    try {
      Statement statement = connection.createStatement();
      ResultSet data = statement.executeQuery(selectQuery);
      ResultSetMetaData metadata = data.getMetaData();
      int numFields = metadata.getColumnCount();
      
      Attributes emptyAttributes = new AttributesImpl();
      startElement("", "table", "table", emptyAttributes);
      
      while (data.next()) {
        startElement("", "record", "record", emptyAttributes);
        for (int field = 1; field <= numFields; field++) {
          AttributesImpl fieldAtts = new AttributesImpl();
          int type = metadata.getColumnType(field);
          String typeName = getSchemaType(type);
          fieldAtts.addAttribute(
           "http://www.w3.org/2001/XMLSchema-instance", "type", 
           "xsi:type", "NMTOKEN", typeName);
          String name = metadata.getColumnName(field);
          fieldAtts.addAttribute(
           "", "name", "name", "NMTOKEN", name);
          
          // Convert nulls to empty elements with xsi:nil="true"
          Object value = data.getObject(field);
          if (value == null) { // null value in database
            fieldAtts.addAttribute(
             "http://www.w3.org/2001/XMLSchema-instance", "nil", 
             "xsi:nil", "NMTOKEN", "true");
            startElement("", "field", "field", fieldAtts);
            endElement("", "field", "field");
          }
          else { // non-null value
            startElement("", "field", "field", fieldAtts);
            convertToXML(data, field, type);
            endElement("", "field", "field");
          }
        }
        endElement("", "record", "record");
      }
      endElement("", "table", "table");
      statement.close();
    }
    catch (SQLException e) {  // convert exception type
      throw new SAXException(e); 
    }
    
  }

  // I want the XML document to store values in the standard W3C
  // XML Schema Language forms. This requires certain conversions 
  // depending on the type of the data
  private void convertToXML(ResultSet data, int field, int type)
   throws SQLException, SAXException {

    switch (type) {
      case Types.BINARY: 
      case Types.VARBINARY: 
      case Types.LONGVARBINARY: 
        hexEncode(data.getBinaryStream(field));
        break;
      case Types.BLOB:
        Blob blob = data.getBlob(field);
        hexEncode(blob.getBinaryStream());
        break;
      case Types.CLOB: 
        Clob clob = data.getClob(field);
        Reader r = clob.getCharacterStream();
        char[] text = new char[1024];
        int numRead;
        try {
          while ((numRead = r.read(text, 0, 1024)) != -1) {
            escapeText(text, 0, numRead);
            characters(text, 0, numRead); 
          }
        }
        catch (IOException e) {
          throw new SAXException("Read from CLOB failed", e); 
        }
        break;
      case Types.ARRAY:
        Array array = data.getArray(field);
        writeArray(array);
        break;
      default: // All other types can be handled as strings
        Object o = data.getObject(field); 
        if (o == null) return;                
        String s = o.toString(); 
        char[] value = s.toCharArray();
        escapeText(value, 0, value.length);
        characters(value, 0, value.length);
    }     

  }
  
  private void hexEncode(InputStream in) 
   throws SQLException, SAXException {
    
    try {
      int octet;
      while ((octet = in.read()) != -1) {
        StringWriter out = new StringWriter(2);
        if (octet < 16) out.write('0');
        out.write(Integer.toHexString(octet));
        char[] text = out.toString().toCharArray();
        characters(text, 0, 2);
      }
    }
    catch (IOException e) {
      throw new SAXException(e);
    }
    
  }
 
  // String types may contain C0 control characters that are
  // not legal in XML. I convert these to the Unicode replacement
  // character 0xFFFD
  private void escapeText(char[] text, int start, int length) {
    for (int i = start; i < length; i++) {
        text[i] = escapeChar(text[i]);
    }
  }

  private char escapeChar(char c) {
    if (c >= 0x20) return c;
    else if (c == '\n') return c;
    else if (c == '\r') return c;
    else if (c == '\t') return c;
    return '\uFFFD';
  }
 
  private void writeArray(Array array) 
   throws SQLException, SAXException {
    
    ResultSet data = array.getResultSet();
    int type = array.getBaseType();
    String typeName = getSchemaType(type);

    while (data.next()) {
      AttributesImpl fieldAtts = new AttributesImpl();
      fieldAtts.addAttribute(
        "http://www.w3.org/2001/XMLSchema-instance", "type", 
        "xsi:type", "NMTOKEN", typeName);
      startElement("", "component", "component", fieldAtts);
      convertToXML(data, 2, type);
      endElement("", "component", "component");
    }
    
  }
  
  public static String getSchemaType(int type) {
   
    switch (type) {
      case Types.ARRAY:         return "array";
      case Types.BIGINT:        return "xsd:long";
      case Types.BINARY:        return "xsd:hexBinary";
      case Types.BIT:           return "xsd:boolean";
      case Types.BLOB:          return "xsd:hexBinary";
      case Types.CHAR:          return "xsd:string";
      case Types.CLOB:          return "xsd:string";
      case Types.DATE:          return "xsd:date";
      case Types.DECIMAL:       return "xsd:decimal";
      case Types.DOUBLE:        return "xsd:double";
      case Types.FLOAT:         return "xsd:decimal";
      case Types.INTEGER:       return "xsd:int";
      case Types.JAVA_OBJECT:   return "xsd:string";
      case Types.LONGVARBINARY: return "xsd:hexBinary";
      case Types.LONGVARCHAR:   return "xsd:string";
      case Types.NUMERIC:       return "xsd:decimal";
      case Types.REAL:          return "xsd:float";
      case Types.REF:           return "xsd:IDREF";
      case Types.SMALLINT:      return "xsd:short";
      case Types.STRUCT:        return "struct";
      case Types.TIME:          return "xsd:time";
      case Types.TIMESTAMP:     return "xsd:dateTime";
      case Types.TINYINT:       return "xsd:byte";
      case Types.VARBINARY:     return "xsd:hexBinary";
                                // most general type
      default:                  return "xsd:string"; 
    }
    
  }
  
  // Warn clients that this filter does not receive its events 
  // from another XML parser
  public void setParent(XMLReader parent) 
   throws UnsupportedOperationException {
    throw new UnsupportedOperationException(
     "A DatabaseFilter reads from an underlying SQL database;"
     + " not an underlying XML parser"
    );
  }
  
}

The trickiest part of this design was not the XML output. It was figuring out how to pass in the database connection parameters and the SQL queries. SUN has defined a JDBC URL. However, such a URL only indicates the database, username, and password to access. It does not go all the way down to the level of the SQL query, so it could not be used as a system ID. I chose instead to pass in the database connection parameters through constructors and the SQL query as a system ID string passed to the parse() method. This is not the customary URI system ID, but since this will only be used in this program and not directly in XML documents, this doesn’t cause any major problems. Nonetheless, I changed the name of the formal argument to the parse() method from systemID to SQLQuery to try to make this more obvious. This filter cannot parse InputSource objects because it’s really not possible to read a database from a stream. Thus the parse() method that takes an InputSource as an argument throws an UnsupportedOperationException.

A large part of the logic in this filter involves converting JDBC results into text. Most of the JDBC/SQL types have natural string representations which conveniently match W3C XML Schema Language primitive types. These can be retrieved by calling toString() on the corresponding Java object returned by the JDBC getObject() method. The binary types (BINARY, VARBINARY, LONGVARBINARY, and BLOB) have to be hex encoded first.[2] CLOBs don’t have to be hex encoded, but they aren’t available as a single string, and thus have to be read from a stream too. Finally, arrays and structs require more detailed treatment as a complex type rather than a simple type.

The sample driver program for this filter, shown in Example 8.14, is very similar to earlier driver programs except that it requires the user to specify two arguments on the command line, the JDBC URL giving the connection parameters and the SQL query to execute. Because SQL queries normally contain white space, it must be enclosed in double quotes. A more serious example would reuse the same driver (or filter) for multiple queries.

Example 8.14. A very simple user interface for extracting XML data from a relational database

import org.xml.sax.*;
import com.megginson.sax.DataWriter;


public class SQLDriver {

  public static void main(String[] args) {
    
    if (args.length < 2) {
      System.out.println(
       "Usage: java SQLDriver URL query driverClass");
      return;
    }
    String url = args[0];
    String query = args[1];
    String driverClass = "org.gjt.mm.mysql.Driver"; // MySQL
    if (args.length >= 3) driverClass = args[2];
    
    try {
      // Load JDBC driver
      Class.forName(driverClass).newInstance();
      // Technically, the newInstance() call isn't needed, 
      // but the MM.MySQL documentation suggests this to 
      // "work around some broken JVMs"
    
      XMLFilter filter = new DatabaseFilter(url);
      DataWriter out = new DataWriter();
      out.forceNSDecl(
       "http://www.w3.org/2001/XMLSchema-instance", "xsi");
      out.forceNSDecl("http://www.w3.org/2001/XMLSchema", "xsd");
      out.setIndentStep(2);
      filter.setContentHandler(out);
      filter.parse(query);
      out.flush();
    }
    catch (InstantiationException e) { 
      System.out.println(driverClass 
       + " could not be instantiated");
    }
    catch (ClassNotFoundException e) { 
      System.out.println(driverClass + " could not be found");
    }
    catch (Exception e) { // SQL, SAX, and IO
      e.printStackTrace();
      System.out.println(e);
    }
    
  }
  
}

Instead of XMLWriter, this driver uses David Megginson’s other public domain writer program, DataWriter, which is designed for record oriented data of this sort and inserts extra white space to prettify the output. Here’s some output from when I ran SQLDriver against one of my databases:

% java -Dorg.xml.sax.driver=gnu.xml.aelfred2.XmlReader
 SQLDriver 'jdbc:mysql://luna.oit.unc.edu/NYWC?user=elharo&password=MyPassword' 
 "SELECT LastName, FirstName FROM composers
  WHERE LastName like 'A%'"
<table xmlns:xsd="http://www.w3.org/2001/XMLSchema" 
       xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
  <record>
    <field xsi:type="xsd:string" name="LastName">Anderson</field>
    <field xsi:type="xsd:string" name="FirstName">Beth</field>
  </record>
  <record>
    <field xsi:type="xsd:string" name="LastName">Austin</field>
    <field xsi:type="xsd:string" name="FirstName">Dorothea</field>
  </record>
  <record>
    <field xsi:type="xsd:string" name="LastName">Austin</field>
    <field xsi:type="xsd:string" name="FirstName">Elizabeth</field>
  </record>
  <record>
    <field xsi:type="xsd:string" name="LastName">Ayers</field>
    <field xsi:type="xsd:string" name="FirstName">Lydia</field>
  </record>
</table>

Warning

Depending on platform, inputting the above arguments can be a little tricky. The SQL query contains white space, and the JDBC URL contains characters that are important to the shell like &. You may need to use quote marks to prevent some of these arguments from being interpreted by the shell. Details vary from platform to platform. Of course, this is just a quick hack to demonstrate the filter. A real program would provide a GUI that made these points moot.



[2] Base-64 encoding would be more efficient but I didn’t want to introduce another class library just to do Base-64 encoding.


Copyright 2001, 2002 Elliotte Rusty Haroldelharo@metalab.unc.eduLast Modified December 02, 2001
Up To Cafe con Leche