Input

Input
Prev	Chapter 4. Converting Flat Files to XML	Next

By far the hardest part of this or any similar problem is parsing the non-XML input data. Everything else pales by comparison. Unlike parsing XML, you generally cannot rely on a library to do the hard work for you. You have to do it yourself. And also unlike XML, there’s little guarantee that the data is well-formed. It’s more likely than not that you will encounter incorrectly formatted data.

In this case, since the records are separated into lines, I’ll read each line, one at a time, using the readLine() method of java.io.BufferedReader. This method works well enough as long as the data is in a file, though it’s potentially buggy when the data is served over a network socket.

Each line is dissected into its component fields inside the splitLine() method. Each record is stored in its own map. The keys for the map are read from a constant array, because the fields are always in the same positions in each record.

Caution

For parsing the data out of each line, a lot of Java programmer’s immediately reach for the java.util.StringTokenizer or java.io.StreamTokenizer classes. Don’t. These classes are very strangely designed and rarely do what developers expect them to do. For example, if StreamTokenizer encounters a \n inside a string literal, it will convert it to a linefeed. This makes sense when parsing Java source code, but in most other environments \n are just another two characters with no special meaning. Java’s tokenizer classes are designed for and suited to parsing Java source code. They are not suitable for reading tab-delimited or comma delimited data. If you want to design your program around a tokenization function, you should write one yourself that behaves appropriately for your data format.

Example 4.1 shows the input code. To use this, open an input stream to the file containing the budget data and pass that stream as an argument to the parse() method. You’ll get back a List containing the parsed data. Each object in this list is a Map containing the data for one line item. Both keys and values for this map are strings. Since the keys are constant, they’re stored in a final static array named keys. At various times I plan to use the keys as XML element names, XML attribute names, or SQL field names. Therefore, it’s necessary to begin them all with letters. Thus the keys for the fiscal year fields will be named, FY1976, FY1977, FY1978, and so forth instead of just 1976, 1977, 1978, and so forth. This will mean we won’t be trivially able to store the keys as ints. However, this turns out not have been the case anyway because one of the year fields turns out to be the transitional quarter in 1976. which does not represent a full year and does not have a numeric name.

Caution

In 1976 the government’s fiscal year shifted forward one quarter. That means that the 1977 fiscal year started in October, a quarter after the 1976 fiscal year ended. There was a transitional quarter from July through September that year. Thus not all the data actually represents a whole year. Here, the special case is very much a result of the data itself. Thus the data can’t be fixed. However, it still requires extra code, and makes the examples less clean than they otherwise would be.

This sort of funky data (a year with only three months in it that can easily be confused with another year) is exactly the sort of thing you have to watch out for when processing legacy data. The real world does not always fit into neatly typed categories. There's almost always some outlier data that just doesn't fit the schema. All too often it's been forced into the existing system by some manager or data entry clerk in ways the original designers never intended. This happens all the time. You cannot assume the data actually adheres to its schema, either implicit or explicit.

The code to parse each line of input is hidden inside the private splitLine() method. This code is relatively complex. It iterates through the record looking for comma delimiter characters. However, it has to ignore characters that appear inside quoted strings. Furthermore it must also recognize that the end of the string delimits the last token. Even so, this method is not very robust. It will throw an uncaught exception if any quotes are omitted, or if there are too few fields. It will not notice and report the error if a record contains too many fields.

Example 4.1. A class that parses comma separated values into a List of HashMaps

import java.io.*;
import java.util.*;


public class BudgetData {

  public static List parse(InputStream src) throws IOException {
      
    // The document as published by the OMB is encoded in Latin-1
    InputStreamReader isr = new InputStreamReader(src, "8859_1");
    BufferedReader in = new BufferedReader(isr);
    List records = new ArrayList();  
    String lineItem;
    while ((lineItem = in.readLine()) != null) {
      records.add(splitLine(lineItem));
    }       
    return records;
        
  } 

  // the field names in order
  public final static String[] keys = {
    "AgencyCode",
    "AgencyName",
    "BureauCode",
    "BureauName",
    "AccountCode",
    "AccountName",
    "TreasuryAgencyCode",
    "SubfunctionCode",
    "SubfunctionTitle",
    "BEACategory",
    "On-Off-BudgetIndicator",
    "FY1976", "TransitionQuarter", "FY1977", "FY1978", "FY1979",  
    "FY1980", "FY1981", "FY1982", "FY1983", "FY1984", "FY1985",  
    "FY1986", "FY1987", "FY1988", "FY1989", "FY1990", "FY1991",  
    "FY1992", "FY1993", "FY1994", "FY1995", "FY1996", "FY1997",  
    "FY1998", "FY1999", "FY2000", "FY2001", "FY2002", "FY2003", 
    "FY2004", "FY2005", "FY2006" 
   };

  private static Map splitLine(String record) {
     
    record = record.trim();
    
    int index = 0;
    Map result = new HashMap();
    for (int i = 0; i < keys.length; i++) {
      //find the next comma    
      StringBuffer sb = new StringBuffer();
      char c;
      boolean inString = false;
      while (true) {
        c = record.charAt(index);
        if (!inString && c == '"') inString = true;
        else if (inString && c == '"') inString = false;
        else if (!inString && c == ',') break;
        else sb.append(c);
        index++;
        if (index == record.length()) break;
      }
      String s = sb.toString().trim();
      result.put(keys[i], s);
      index++;
    }  
        
    return result;   
        
  } 

}

Prev	Up	Next
The Model	Home	Determining the Output Format

Copyright 2001, 2002 Elliotte Rusty Harold	elharo@metalab.unc.edu	Last Modified August 21, 2001
	Up To Cafe con Leche