Determining the Output Format

Determining the Output Format
Prev	Chapter 4. Converting Flat Files to XML	Next

Once all the data has been stored in a convenient data structure, you have to write it out as XML. The first step in this process is deciding what the XML should look like. XML does offer the possibility of a much more hierarchical structure than is available in the flat format. For instance, you could sort budgets by department and year. You might use the natural hierarchy of the government in which agencies contains bureaus, bureaus contain accounts, and accounts contain subfunctions. I’ll demonstrate several possibilities, beginning with the simplest, a fairly flat XML representation that stays very close to the original flat format.

As published the data is essentially one table. Thus the simplest output format merely duplicates this table structure. This table will be the root element, here called Budget. Each record in the table—that is, each line in the text file—will be encoded as a separate LineItem element. Each field will be encoded in a child element of the LineItem element whose name is the name of that field. Example 4.2 produces this data.

Example 4.2. Naively reproducing the original table structure in XML

import java.io.*;
import java.util.*;


public class FlatXMLBudget {

  public static void convert(List data, OutputStream out) 
   throws IOException {
      
    Writer wout = new OutputStreamWriter(out, "UTF8"); 
    wout.write("<?xml version=\"1.0\"?>\r\n");
    wout.write("<Budget>\r\n");
          
    Iterator records = data.iterator();
    while (records.hasNext()) {
      wout.write("  <LineItem>\r\n");
      Map record = (Map) records.next();
      Set fields = record.entrySet();
      Iterator entries = fields.iterator();
      while (entries.hasNext()) {
        Map.Entry entry = (Map.Entry) entries.next();
        String name = (String) entry.getKey();
        String value = (String) entry.getValue();
        // some of the values contain ampersands and less than
        // signs that must be escaped
        value = escapeText(value);
        
        wout.write("    <" + name + ">");
        wout.write(value);        
        wout.write("</" + name + ">\r\n");
      }
      wout.write("  </LineItem>\r\n");
    }
    wout.write("</Budget>\r\n");
    wout.flush();
        
  } 

  public static String escapeText(String s) {
   
    if (s.indexOf('&') != -1 || s.indexOf('<') != -1 
     || s.indexOf('>') != -1) {
      StringBuffer result = new StringBuffer(s.length() + 4);
      for (int i = 0; i < s.length(); i++) {
        char c = s.charAt(i);
        if (c == '&') result.append("&amp;");
        else if (c == '<') result.append("&lt;");
        else if (c == '>') result.append("&gt;");
        else result.append(c);
      }
      return result.toString();  
    }
    else {
      return s;   
    }
        
  }

  public static void main(String[] args) {
  
    try {
        
      if (args.length < 1) {
       System.out.println("Usage: FlatXMLBudget infile outfile");
       return;
      }
      
      InputStream in = new FileInputStream(args[0]); 
      OutputStream out; 
      if (args.length < 2) {
        out = System.out;
      }
      else {
        out = new FileOutputStream(args[1]); 
      }

      List results = BudgetData.parse(in);
      convert(results, out);
    }
    catch (IOException e) {
      System.err.println(e);       
    }
  
  }

}

The main() method reads the name of an input and output file from the command line. It parses the input file using the previously designed BudgetData.parse() method. This produces a List of line items. This list is passed to the convert() method along with an OutputStream the output will be written to. I used an OutputStream rather than a Writer here because with a Writer it’s not possible to pick your encoding, whereas with an OutputStream you can choose the encoding when you chain the Writer to it.

The convert() method iterates through that list. It extracts each record in turn, outputs a <LineItem> start-tag, then iterates through the Map representing the record, outputting each record in turn. The keys in the Map serve double duty, also becoming the names of the child elements of LineItem. The escapeText() method turns any <, >, or & characters that appear in the value into their respective escape sequences. Finally, the </LineItem> end-tag is output.

To some extent I’ve exposed more of the details here than is ideal. This group of classes is not as well encapsulated as I’d normally prefer. For the most part that’s because I’m deliberately trying to show you how everything works internally. I’m not viewing this as reusable code. If I did need to make this more reusable, I’d probably define a Budget class that contained a list of LineItem objects. Public methods in these classes would hide the detailed storage as a List of Maps.

Here’s the prolog, root element start-tag, and first record from the XML output:

<?xml version="1.0"?>
<Budget>
  <LineItem>
    <FY1994>0</FY1994>
    <FY1993>0</FY1993>
    <FY1992>0</FY1992>
    <FY1991>0</FY1991>
    <FY1990>0</FY1990>
    <AccountCode></AccountCode>
    <On-Off-BudgetIndicator>On-budget</On-Off-BudgetIndicator>
    <FY1989>0</FY1989>
   <AccountName>Receipts, Central fiscal operations</AccountName>
    <FY1988>0</FY1988>
    <FY1987>0</FY1987>
    <FY1986>0</FY1986>
    <FY1985>0</FY1985>
    <FY1984>0</FY1984>
    <FY1983>0</FY1983>
    <FY1982>0</FY1982>
    <SubfunctionCode>803</SubfunctionCode>
    <FY1981>0</FY1981>
    <FY2006>0</FY2006>
    <FY1980>0</FY1980>
    <FY2005>0</FY2005>
    <FY2004>0</FY2004>
    <FY2003>0</FY2003>
    <FY2002>0</FY2002>
    <FY2001>0</FY2001>
    <FY2000>0</FY2000>
    <AgencyCode>001</AgencyCode>
    <BEACategory>Mandatory</BEACategory>
    <TransitionQuarter>-132</TransitionQuarter>
    <FY1979>-726</FY1979>
    <FY1978>-385</FY1978>
    <FY1977>-429</FY1977>
    <FY1976>-287</FY1976>
    <TreasuryAgencyCode></TreasuryAgencyCode>
    <AgencyName>Legislative Branch</AgencyName>
    <BureauCode>00</BureauCode>
    <BureauName>Legislative Branch</BureauName>
    <FY1999>0</FY1999>
    <FY1998>0</FY1998>
    <FY1997>0</FY1997>
    <FY1996>0</FY1996>
   <SubfunctionTitle>Central fiscal operations</SubfunctionTitle>
    <FY1995>0</FY1995>
  </LineItem>

As you can see, one thing lost in the transition to XML is the order of the input fields. This is a side effect of storing each record as an unordered HashMap. However, order is really not important for this use-case so nothing significant has been lost. Order is rarely a problem for data-oriented XML applications. If order is important in your data, you can maintain it by using a list or an array to hold the fields instead of a map. You could even use a SortedMap with a custom implementation of the Comparable interface for the keys you store in the map.

The architecture used here is somewhat memory intensive. The entire input data is read in and parsed before the first line is written out. Since in this case, the size of the input data is reaosnably small for modern systems (just over a megabyte), this isn’t really an issue. However, if the data were much bigger it would make sense to stream it in reasonably sized chunks. Each line item could be output as soon as it was read rather than waiting for the last one to be read. For example, instead of storing each record in a big list, the BudgetData.parse() method could immediately write the data onto a Writer:

  public static void parse(InputStream src, OutputStream out) 
   throws IOException {
      
    Writer wout = new OutputStreamWriter(out, "UTF8"); 
    wout.write("<?xml version=\"1.0\"?>\r\n");
    wout.write("<Budget>\r\n");
    
    // The document as published by the OMB is encoded in Latin-1
    InputStreamReader isr = new InputStreamReader(src, "8859_1");
    BufferedReader in = new BufferedReader(isr);
    String lineItem;
    while ((lineItem = in.readLine()) != null) {
      Map data = splitLine(lineItem);
      Iterator records = data.iterator();
      wout.write("  <LineItem>\r\n");
      Set fields = records.entrySet();
      Iterator entries = fields.iterator();
      while (entries.hasNext()) {
        Map.Entry entry = (Map.Entry) entries.next();
        String name = (String) entry.getKey();
        String value = (String) entry.getValue();
        // some of the values contain ampersands and less than
        // signs that must be escaped
        value = escapeText(value);
        
        wout.write("    <" + name + ">");
        wout.write(value);        
        wout.write("</" + name + ">\r\n");
      }  // end entries loop
      wout.write("  </LineItem>\r\n");
    }  // end lineitem loop
        
    wout.write("</Budget>\r\n");
    wout.flush();     
        
  }

The disadvantage to this approach is that it only works when the output is fairly close to the input. In this case, each line of text in the input produces a single element in the output. The other extreme would be reordering the line items by year, essentially exchanging the rows and the columns. In this case, each element of output would include some content from every line of text. We'll see some examples of this (and some in-between cases) shortly. In these cases, a streaming architecture would require multiple passes over the input data and be very inefficient. Still, if the data set were large enough, that might be the only workable approach.

Validation

Error checking in this program is minimal. For instance, the convert() method does not check to see that each item in the List is in fact a Map. If somebody passes in a List that contains non-Map objects, an uncaught ClassCastException will be thrown. convert() also doesn’t check that each Map has all 43 required entries or that the keys in the Map match the known key list. No exception will be thrown in these cases. However, if a record turns out to be short a few fields or to have the wrong fields, then validation of the output should pick it up.

The first test you should make of the output is for simple well-formedness. Whatever tool you normally use to check well-formedness, for example, sax.Counter from Xerces-J, will suffice. The first time I ran the output from this program through sax.Counter, it told me the output was indeed not well-formed. The problem was that some of the names and titles used & characters that had to be escaped, so I added the escapeText() method to the FlatXMLBudget class, and reran the program. I checked my output again, and it was still malformed. Although I’d written the escapeText() method I’d forgotten to invoke it. I edited FlatXMLBudget one more time and reran the program. My output was still malformed. This time I’d forgotten to recompile the program before rerunning it, so I recompiled and ran it one more time. This time my output proved well-formed. This is just normal back-and-forth development. Although the test is specific to XML, there’s nothing XML-specific about the process. You keep fixing problems, and trying again until you succeed. Many fixes will simply expose new bugs and errors. Eventually you fix the last one, and the program runs correctly.

Once you’re confident your program is outputting well-formed XML, the next step is to validate it. Even if your use-case doesn’t actually require valid XML, validation is still an invaluable testing and debugging tool. Example 4.3 is a schema for the intended output of Example 4.2. I chose to use a schema rather than a DTD so I could check that each field contained the correct form of data.

Example 4.3. A schema for the XML budget data

<?xml version="1.0"?>
<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema">

  <xsd:element name="Budget">
    <xsd:complexType>
      <xsd:sequence>
        <xsd:element name="LineItem" maxOccurs="unbounded">
          <xsd:complexType>
            <xsd:all>
              <xsd:element name="AccountCode">
                <xsd:simpleType>
                  <xsd:union 
                       memberTypes="FourDigitCode SixDigitCode"/>
                </xsd:simpleType> 
              </xsd:element>
              <xsd:element name="AccountName">
                <xsd:simpleType>
                  <xsd:restriction base="xsd:string">
                    <xsd:maxLength value="160"/>
                  </xsd:restriction>
                </xsd:simpleType>
              </xsd:element>
              <xsd:element name="AgencyCode" 
                           type="ThreeDigitCode"/>
              <xsd:element name="TreasuryAgencyCode" 
                           type="TwoDigitCode"/>
              <xsd:element name="AgencyName">
                <xsd:simpleType>
                  <xsd:restriction base="xsd:string">
                    <xsd:maxLength value="89"/>
                  </xsd:restriction>
                </xsd:simpleType>
              </xsd:element>
              <xsd:element name="BEACategory">
                <xsd:simpleType>
                  <xsd:restriction base="xsd:string">
                    <xsd:enumeration value="Mandatory"/>
                    <xsd:enumeration value="Discretionary"/>
                    <xsd:enumeration value="Net interest"/>
                  </xsd:restriction>
                </xsd:simpleType>
              </xsd:element>
              <xsd:element name="BureauCode" 
                           type="TwoDigitCode"/>
              <xsd:element name="BureauName">
                <xsd:simpleType>
                  <xsd:restriction base="xsd:string">
                    <xsd:maxLength value="89"/>
                  </xsd:restriction>
                </xsd:simpleType>
              </xsd:element>
              <xsd:element name="On-Off-BudgetIndicator">
                <xsd:simpleType>
                  <xsd:restriction base="xsd:string">
                    <xsd:enumeration value="On-budget"/>
                    <xsd:enumeration value="Off-budget"/>
                  </xsd:restriction>
                </xsd:simpleType>
              </xsd:element>
              <xsd:element name="SubfunctionCode" 
                           type="ThreeDigitCode"/>
              <xsd:element name="SubfunctionTitle">
                <xsd:simpleType>
                  <xsd:restriction base="xsd:string">
                    <xsd:maxLength value="72"/>
                  </xsd:restriction>
                </xsd:simpleType>
              </xsd:element>
              <xsd:element name="FY1976" type="xsd:integer"/>
              <xsd:element name="TransitionQuarter" 
                           type="xsd:integer"/>
              <xsd:element name="FY1977" type="xsd:integer"/>
              <xsd:element name="FY1978" type="xsd:integer"/>
              <xsd:element name="FY1979" type="xsd:integer"/>
              <xsd:element name="FY1980" type="xsd:integer"/>
              <xsd:element name="FY1981" type="xsd:integer"/>
              <xsd:element name="FY1982" type="xsd:integer"/>
              <xsd:element name="FY1983" type="xsd:integer"/>
              <xsd:element name="FY1984" type="xsd:integer"/>
              <xsd:element name="FY1985" type="xsd:integer"/>
              <xsd:element name="FY1986" type="xsd:integer"/>
              <xsd:element name="FY1987" type="xsd:integer"/>
              <xsd:element name="FY1988" type="xsd:integer"/>
              <xsd:element name="FY1989" type="xsd:integer"/>
              <xsd:element name="FY1990" type="xsd:integer"/>
              <xsd:element name="FY1991" type="xsd:integer"/>
              <xsd:element name="FY1992" type="xsd:integer"/>
              <xsd:element name="FY1993" type="xsd:integer"/>
              <xsd:element name="FY1994" type="xsd:integer"/>
              <xsd:element name="FY1995" type="xsd:integer"/>
              <xsd:element name="FY1996" type="xsd:integer"/>
              <xsd:element name="FY1997" type="xsd:integer"/>
              <xsd:element name="FY1998" type="xsd:integer"/>
              <xsd:element name="FY1999" type="xsd:integer"/>
              <xsd:element name="FY2000" type="xsd:integer"/>
              <xsd:element name="FY2001" type="xsd:integer"/>
              <xsd:element name="FY2002" type="xsd:integer"/>
              <xsd:element name="FY2003" type="xsd:integer"/>
              <xsd:element name="FY2004" type="xsd:integer"/>
              <xsd:element name="FY2005" type="xsd:integer"/>
              <xsd:element name="FY2006" type="xsd:integer"/>
            </xsd:all>
          </xsd:complexType>
         </xsd:element>
      </xsd:sequence>
    </xsd:complexType>  
  </xsd:element>

  <xsd:simpleType name="TwoDigitCode">
    <xsd:restriction base="xsd:string">
      <xsd:pattern value="[0-9][0-9]"/>
    </xsd:restriction>
  </xsd:simpleType> 
  
  <xsd:simpleType name="ThreeDigitCode">
    <xsd:restriction base="xsd:string">
      <xsd:pattern value="[0-9][0-9][0-9]"/>
    </xsd:restriction>
  </xsd:simpleType> 
  
  <xsd:simpleType name="FourDigitCode">
    <xsd:restriction base="xsd:string">
      <xsd:pattern value="[0-9][0-9][0-9][0-9]"/>
    </xsd:restriction>
  </xsd:simpleType> 
  
  <xsd:simpleType name="SixDigitCode">
    <xsd:restriction base="xsd:string">
      <xsd:pattern value="[0-9][0-9][0-9][0-9][0-9][0-9]"/>
    </xsd:restriction>
  </xsd:simpleType> 
  
</xsd:schema>

As well as indicating bugs in your code that you need to fix, schema checking can uncover dirty input data that isn’t what it’s supposed to be. This happens all the time in real-world data. When I validated the XML output against this schema, a few problems did crop up. A few account codes were blank, and the third BEA category was actually “Net interest”, not “Net Interest” as the documentation claimed. However, these problems were quite minor overall. In fact, I was quite surprised that there weren’t more problems. The programmers at the OMB are doing a much better than average job.

When faced with dirty data that's static, the normal solution is to edit the input data by hand and rerun the program. However if the data is dynamic (e.g. the sales figures for various retail locations submitted to a central server when each franchise closes down for the night) then you’ll have to write extra code to massage the bad data into an acceptable format, at least until you can fix the process that’s generating the dirty data.

Attributes

The dual fact that the order of child elements really isn’t significant within a LineItem and that no two child elements are repeated, suggests that some of this data could be stored as attributes. Besides the obvious change in the convert() method, this also requires changing the escapeText() method to escape any embedded quotes, which are legal in element content but not attribute values. Example 4.4 demonstrates.

Example 4.4. Converting to XML with attributes

import java.io.*;
import java.util.*;


public class AttributesXMLBudget {

  public static void convert(List data, OutputStream out) 
   throws IOException {
      
    Writer wout = new OutputStreamWriter(out, "UTF8"); 
    wout.write("<?xml version=\"1.0\"?>\r\n");
    wout.write("<Budget>\r\n");
          
    Iterator records = data.iterator();
    while (records.hasNext()) {
      wout.write("  <LineItem>");
      Map record = (Map) records.next();

      // write the attributes
      writeAttribute(wout, "AgencyCode", record);
      writeAttribute(wout, "AgencyName", record);
      writeAttribute(wout, "BureauCode", record);
      writeAttribute(wout, "BureauName", record);
      writeAttribute(wout, "AccountCode", record);
      writeAttribute(wout, "AccountName", record);
      writeAttribute(wout, "TreasuryAgencyCode", record);
      writeAttribute(wout, "SubfunctionCode", record);
      writeAttribute(wout, "SubfunctionTitle", record);
      writeAttribute(wout, "BEACategory", record);
      writeAttribute(wout, "On-Off-BudgetIndicator", record);
      wout.write(">\r\n");
      writeAmount(wout, 1976, record);
      wout.write("    <Amount year='TransitionQuarter'>");
      wout.write(
        escapeText((String) record.get("TransitionQuarter"))
      );
      wout.write("</Amount>\r\n");
      for (int year=1977; year <= 2006; year++) {
        writeAmount(wout, year, record);
      }
      wout.write("  </LineItem>\r\n");
    }
    wout.write("</Budget>\r\n");
    wout.flush();
        
  } 

  // Just a couple of private methods to factor out repeated code 
  private static void writeAttribute(Writer out, String name, 
   Map record) throws IOException {
    out.write(" " + name + "='" 
     + escapeText((String) record.get(name)) + "'");       
  }

  private static void writeAmount(Writer out, int year, 
   Map record) throws IOException {
    out.write("    <Amount year='" + year + "'>");
    out.write(escapeText((String) record.get("FY" + year)));
    out.write("</Amount>\r\n");
  }

  public static String escapeText(String s) {
   
    if (s.indexOf('&') != -1 || s.indexOf('<') != -1
     || s.indexOf('>') != -1 || s.indexOf('"') != -1
     || s.indexOf('\'') != -1 ) {
      StringBuffer result = new StringBuffer(s.length() + 6);
      for (int i = 0; i < s.length(); i++) {
        char c = s.charAt(i);
        if (c == '&') result.append("&amp;");
        else if (c == '<') result.append("&lt;");
        else if (c == '"') result.append("&quot;");
        else if (c == '\'') result.append("&apos;");
        else if (c == '>') result.append("&gt;");
        else result.append(c);
      }
      return result.toString();  
    }
    else {
      return s;   
    }
        
  }

  public static void main(String[] args) {
  
    try {
        
      if (args.length < 1) {
        System.out.println(
         "Usage: AttributesXMLBudget infile outfile"
        );
        return;
      }
      
      InputStream in = new FileInputStream(args[0]); 
      OutputStream out; 
      if (args.length < 2) {
        out = System.out;
      }
      else {
        out = new FileOutputStream(args[1]); 
      }

      List results = BudgetData.parse(in);
      convert(results, out);
    }
    catch (IOException ex) {
      System.err.println(ex);       
    }
  
  }

}

In this case, since different fields are treated differently, the program can’t just iterate through the map. It must ask for each field by name and put it where it belongs. This has the advantage of making the output a little more human legible, as this fragment of output demonstrates.

<?xml version="1.0"?>
<Budget>
  <LineItem AgencyCode='001' AgencyName='Legislative Branch' BureauCode='00' 
            BureauName='Legislative Branch' AccountCode='' 
            AccountName='receipts, Central fiscal operations' 
            TreasuryAgencyCode='' SubfunctionCode='803' 
            SubfunctionTitle='Central fiscal operations' 
            BEACategory='mandatory' On-Off-BudgetIndicator='On-budget'>
    <Amount year='1976'>-287</Amount>
    <Amount year='TransitionalQuarter'>-132</Amount>
    <Amount year='1977'>-429</Amount>
    <Amount year='1978'>-385</Amount>
    <Amount year='1979'>-726</Amount>
    <Amount year='1980'>0</Amount>
    <Amount year='1981'>0</Amount>
    <Amount year='1982'>0</Amount>
    <Amount year='1983'>0</Amount>
    <Amount year='1984'>0</Amount>
    <Amount year='1985'>0</Amount>
    <Amount year='1986'>0</Amount>
    <Amount year='1987'>0</Amount>
    <Amount year='1988'>0</Amount>
    <Amount year='1989'>0</Amount>
    <Amount year='1990'>0</Amount>
    <Amount year='1991'>0</Amount>
    <Amount year='1992'>0</Amount>
    <Amount year='1993'>0</Amount>
    <Amount year='1994'>0</Amount>
    <Amount year='1995'>0</Amount>
    <Amount year='1996'>0</Amount>
    <Amount year='1997'>0</Amount>
    <Amount year='1998'>0</Amount>
    <Amount year='1999'>0</Amount>
    <Amount year='2000'>0</Amount>
    <Amount year='2001'>0</Amount>
    <Amount year='2002'>0</Amount>
    <Amount year='2003'>0</Amount>
    <Amount year='2004'>0</Amount>
    <Amount year='2005'>0</Amount>
    <Amount year='2006'>0</Amount>
  </LineItem>

The attribute version is a little more compact than the element-only alternative. It also has the advantage of keeping the years in chronological order.

Clearly, there are many other ways you might choose to divide the output between elements and attributes. For instance, you might make all the names elements but all the codes attributes. There’s certainly more than one way to do it. The modifications to produce slightly different forms of this output are straightforward.

Copyright 2001, 2002 Elliotte Rusty Harold	elharo@metalab.unc.edu	Last Modified September 08, 2002
	Up To Cafe con Leche

Prev	Up	Next
Input	Home	Building Hierarchical Structures from Flat Data