Building Hierarchical Structures from Flat Data

Sometimes you want to convert only a subset of the available data. For example, you might want to store each year in a separate XML document. With a few tweaks to the basic code, this is not hard to do. Another possibility is to arrange the data hierarchically, with Agency elements containing Bureau elements which contain Account elements which contain Subfunction elements. This is a little more complex, but far from impossible. Or perhaps you want to massage the data while converting it, for instance by changing the amounts of thousands of dollars to amounts of single dollars. Or you can do all three. The hierarchy would be arranged something like Example 4.5:

Example 4.5. A hierarchical arrangement of the budget data

<?xml version="1.0"?>
<Budget year="2001">
  <Agency>
    <Name>Farm Credit System Financial Assistance Corporation</Name>
    <Code>354</Code>
    <Bureau>
      <Name>Farm Credit System Financial Assistance Corporation</Name>
      <Code>00</Code>
      <Account>
        <Name>Financial assistance corporation trust fund</Name>
        <Code>8202</Code>
        <BEACategory>Mandatory</BEACategory>
        <Subfunction>
          <Title>Farm income stabilization</Title>
          <Code>351</Code>
          <Amount>-1000</Amount>
        </Subfunction>
      </Account>
    </Bureau>
  </Agency>
  <Agency>
    <Name>Department of Education</Name>
    <Code>018</Code>
    <Bureau>
      <Name>Office of Elementary and Secondary Education</Name>
      <Code>10</Code>
      <Account>
        <Name>Reading excellence</Name>
        <Code>0011</Code>
        <BEACategory>Discretionary</BEACategory>
        <Subfunction>
          <Title>
            Elementary, secondary, and vocational education
          </Title>
          <Code>501</Code>
          <Amount>286000000</Amount>
        </Subfunction>
      </Account>
      <Account>
        <Name>Indian education</Name>
        <Code>0101</Code>
        <BEACategory>Discretionary</BEACategory>
        <Subfunction>
          <Title>
            Elementary, secondary, and vocational education
          </Title>
          <Code>501</Code>
          <Amount>116000000</Amount>
        </Subfunction>
      </Account>
      <!-- more accounts...-->
    </Bureau>
    <Bureau>
      <Name>
        Office of Bilingual Education and Minority Languages Affairs
      </Name>
      <Code>15</Code>
      <Account>
        <Name>Bilingual and immigrant education</Name>
        <Code>1300</Code>
        <BEACategory>Discretionary</BEACategory>
        <Subfunction>
          <Title>
            Elementary, secondary, and vocational education
          </Title>
          <Code>501</Code>
          <Amount>460000000</Amount>
        </Subfunction>
      </Account>
    </Bureau>
    <!-- more bureaus...-->
  </Agency>
  <!-- many more agencies... -->
</Budget>

There are two basic approaches to imposing hierarchy on the flat structures we’re starting with. One is to build the hierarchy into the Java data structures in memory. Then these data structures can be output in XML quite simply. The alternative is to leave the data structures in memory flat, but try to insert the hierarchy as the output code is being written. The latter approach is not too difficult when the data can be easily sorted, but most of the time I find the former approach to be simpler.

Although all the data could still be stored in lists of lists and lists of maps and maps of lists and so forth, I really do find that at this point the data structures are clearer if we name them; and in Java there really isn’t any difference between a named data structure and a class. Thus we can describe the structure as classes and objects. Informally, the structure looks like this:

class Budget {
 
  String year;
  List   agencies;
            
}

class Agency {
 
  String code;
  String treasuryCode;
  String name;
  List   bureaus;
            
}

class Bureau {
 
  String code;
  String name;
  List   accounts;       
        
}

class Account {
 
  String code;
  String name;
  String BEACategory;
  List   subfunctions;
    
}

class Subfunction {
 
  String code;
  String title;
  long   amount;  
        
}

Note

I must admit that the lack of generic types (templates to C++ programmers) hurts a little here. For example, I want to say that an Agency contains a list of Bureaus. However, all I can really say is that an Agency contains a list of objects and that this list happens to be named bureaus. It’s not a huge problem—if it were, I could always define my own custom list classes with the appropriate types—but it does lead to a less natural solution than generic types do. I thought seriously about using generics for this example, but that would limit this code to Java 1.4. Maybe for the second edition.

Of course, we’ll also need a number of methods in these classes. Factory methods will be used instead of constructors to guarantee that 19 IRS Bureau objects aren’t created just because the IRS has 19 line items in the budget. (One IRS is bad enough. :-) ) Each class will have a getXML() method that returns a String containing the XML form of the object. And each class will have a method to add a child node of the appropriate type. Figure 4.2 shows a UML static structure diagram for the complete group of classes.

Figure 4.2. A UML diagram for the budget class hierarchy

The Budget class, Example 4.6 is at the top of the tree. It contains a list of agencies and a year. Line items encoded as maps by BudgetData are fed into the tree through the add() method. This method breaks that map up into the relevant parts, encodes each as an object, adds each of those objects to the tree in the right place, and stores the agency in the list of agencies. The various add() methods in both this and the other classes are responsible for putting each piece in the right place.

Example 4.6. The Budget class

import java.util.*;


public class Budget {

  private List   agencies = new ArrayList();
  private String year;
  
  public Budget(String year) {
    this.year = year;
  }
  
  public void add(Agency agency) {
    if (!agencies.contains(agency)) agencies.add(agency);     
  }

  public void add(Map lineItem) { 
           
    String agencyName = (String) lineItem.get("AgencyName");
    agencyName = escapeText(agencyName);
    String agencyCode = (String) lineItem.get("AgencyCode");
    String treasuryAgencyCode 
     = (String) lineItem.get("TreasuryAgencyCode");
    Agency agency = Agency.getInstance(agencyName, agencyCode, 
     treasuryAgencyCode, year);
    this.add(agency);
    
    String bureauName = (String) lineItem.get("BureauName");
    bureauName = escapeText(bureauName);
    String bureauCode = (String) lineItem.get("BureauCode");
    Bureau bureau = Bureau.getInstance(bureauName, bureauCode, 
     agencyCode, year);
    agency.add(bureau);
    
    // Names and codes of two accounts in different bureaus 
    // can be the same
    String accountName = (String) lineItem.get("AccountName");
    accountName = escapeText(accountName);
    String accountCode = (String) lineItem.get("AccountCode");
    String category    = (String) lineItem.get("BEACategory");
    Account account = Account.getInstance(accountName,  
     accountCode, category, bureauCode, agencyCode, year);
    bureau.add(account);
    
    // Names and codes of two subfunctions in different accounts 
    // can be the same
    String subfunctionTitle 
     = escapeText((String) lineItem.get("SubfunctionTitle"));
    String subfunctionCode
     = (String) lineItem.get("SubfunctionCode");
    String yearKey = year;
    if (!yearKey.equals("TransitionQuarter")) {
      yearKey = "FY" + year;
    }
    long amount
     = 1000L * Long.parseLong((String) lineItem.get(yearKey));
    Subfunction subfunction = new Subfunction(subfunctionTitle,
     subfunctionCode, amount);
    account.add(subfunction);
        
  } 

  public String getXML() {
        
    StringBuffer result = new StringBuffer("<Budget year='" 
     + this.year +"'>\r\n");
    Iterator iterator = agencies.iterator();
    while (iterator.hasNext()) {
      Agency agency = (Agency) iterator.next();
      result.append(agency.getXML());
    }
    result.append("</Budget>\r\n");
    return result.toString();
    
  }
  
  public static String escapeText(String s) {
   
    if (s.indexOf('&') != -1 || s.indexOf('<') != -1
     || s.indexOf('>') != -1 || s.indexOf('"') != -1
     || s.indexOf('\'') != -1 ) {
      StringBuffer result = new StringBuffer(s.length() + 6);
      for (int i = 0; i < s.length(); i++) {
        char c = s.charAt(i);
        if (c == '&') result.append("&amp;");
        else if (c == '<') result.append("&lt;");
        else if (c == '"') result.append("&quot;");
        else if (c == '\'') result.append("&apos;");
        else if (c == '>') result.append("&gt;");
        else result.append(c);
      }
      return result.toString();  
    }
    else {
      return s;   
    }
        
  }
  
}

Most of the other classes are structurally the same with subtle variations. Each has a list of its children, a private constructor, a public factory method, an add() method to add a child of the appropriate type, and a getXML() method that returns the XML representation of the object as a String. I’ll begin with the Agency class in Example 4.7.

Example 4.7. The Agency class

import java.util.*;


public class Agency {
 
  private String code;
  private String name;
  private String treasuryCode;
  private String year;
  
  private List   bureaus = new ArrayList();
  
  private static Map instances = new HashMap();

  // A private constructor so instantiators 
  // have to use the factory method
  private Agency(String name, String code, String treasuryCode, 
	 String year) {
        
    this.name = name;
    this.code = code;
    this.treasuryCode = treasuryCode;
    this.year = year;
    
  }
  
  public static Agency getInstance(String name, String code, 
	 String treasuryCode, String year) {
        
    // Agencies can be uniquely identified by code alone
    String key = code+" "+year;
    Agency agency = (Agency) instances.get(key);
    if (agency == null) {
      agency = new Agency(name, code, treasuryCode, year);
      instances.put(key, agency);
    }
    
    return agency;
        
  }
  
  public void add(Bureau b) {
    if (!bureaus.contains(b)) {
        bureaus.add(b);
    }
  }
  
  public String getXML() {
        
    StringBuffer result = new StringBuffer("  <Agency>\r\n");
    result.append("    <Name>" + name + "</Name>\r\n");
    result.append("    <Code>" + code + "</Code>\r\n");
    result.append("    <TreasuryAgencyCode>" + treasuryCode 
     + "</TreasuryAgencyCode>\r\n");
    Iterator iterator = bureaus.iterator();
    while (iterator.hasNext()) {
      Bureau bureau = (Bureau) iterator.next();
      result.append(bureau.getXML());
    }
    result.append("  </Agency>\r\n");
    return result.toString();
    
  }  
           
}

This is not a general purpose class. The public interface is limited to a factory method, a method to add a Bureau to the Agency, and a method to get the XML representation of the object as a String. There aren’t even any getter methods. I’ve deliberately chosen not to burden this and the other classes with a lot of error checking. I’m assuming that will be done both in the database before the data is input here and later through validation after the XML document has been produced. Doing it here too just seemed a tad superfluous.

Clients retrieve instances of the class using the getInstance() factory method. The first time this class is asked to get a particular agency it constructs it using the private constructor, stores it in the static instances map, and returns a reference to it. Subsequent requests will return the formerly created object from the map. The key for the map is the concatenation of the agency’s code and year.

The add() method simply stores Bureau objects in the bureaus list. Since bureaus is private and only accessed through this add() method, we can be certain that no non-Bureau objects will ever get added to this list. We can further guarantee that the list doesn’t contain any duplicates.

The getXML() method returns an XML representation of this Agency object. It loops through all the bureaus the agency contains, and uses their getXML() methods to get the XML representations for each of them. The XML is returned as a String. An alternative approach that might be more memory efficient and perhaps a little faster, especially in streaming applications, would be to pass the OutputStream or Writer that the XML should be written onto to the getXML() method. In essence, this would be a writeXML() method rather than a getXML() method.

The code here is far from the most efficient possible. There are a lot of optimizations that could be done, including many that would build far fewer intermediate string representations of each element. For instance, you could pass in the StringBuffer to append to to each getXML() method. However, since this is intended primarily as a one-off solution, the optimization didn’t seem to be worth the investment of time or the added complexity.

The Bureau class shown in Example 4.8 is much the same except that it contains a list of accounts. Furthermore, both an agency code and a bureau code are required to uniquely identify bureaus. The bureau code alone is not sufficient, so the keys for the instances map are formed by concatenating the agency code, the bureau code, and the year.

Example 4.8. The Bureau Class

import java.util.*;


public class Bureau {
 
  // Agency code plus bureau code uniquely identify a bureau  
  // Bureau code alone is definitely not sufficient
  private String code;
  private String name;
  private String year;
  private String agencyCode;
  
  private List   accounts = new ArrayList();
  
  private static Map instances = new HashMap();

  // Use a private constructor so instantiators 
  // have to use the factory method
  private Bureau(String name, String code, String agencyCode, 
    String year) {
        
    this.name = name;
    this.code = code;
    this.agencyCode = agencyCode;
    this.year = year;
    
  }
  
  public static Bureau getInstance(String name, String code, 
   String agencyCode, String year) {
        
    String key = agencyCode+" "+code+" "+year;
    Bureau bureau = (Bureau) instances.get(key);
    if (bureau == null) {
      bureau = new Bureau(name, code, agencyCode, year);
      instances.put(key, bureau);
    }
    
    return bureau;
        
  }
  
  public void add(Account account) {
    if (!accounts.contains(account)) accounts.add(account);     
  }
  
  public String getXML() {
        
    StringBuffer result = new StringBuffer("    <Bureau>\r\n");
    result.append("      <Name>" + name + "</Name>\r\n");
    result.append("      <Code>" + code + "</Code>\r\n");
    Iterator iterator = accounts.iterator();
    while (iterator.hasNext()) {
      Account account = (Account) iterator.next();
      result.append(account.getXML());
    }
    result.append("    </Bureau>\r\n");
    return result.toString();
    
  }
          
}

The Account class shown in Example 4.9 is similar to Bureau and Agency. Accounts are uniquely identified by an account code, a bureau code, an agency code, and a BEA code. Otherwise, this class is structured the same as the last two.

Example 4.9. An Account Class

import java.util.*;


public class Account {
 
  // An account is uniquely identified by account code,
  // bureau code, agency code and BEA category
  private String code;
  private String name;
  private String BEACategory;
  private String bureauCode;
  private String agencyCode;
  private String year;
  
  private List   subfunctions = new ArrayList();
  
  private static Map instances = new HashMap();

  // Use a private constructor so clients 
  // have to use the factory method
  private Account(String name, String code, String BEACategory, 
   String bureauCode, String agencyCode, String year) {
        
    this.name = name;
    this.code = code;
    this.BEACategory = BEACategory;
    this.bureauCode = bureauCode;
    this.agencyCode = agencyCode;
    this.year = year;
    
  }
  
  public static Account getInstance(String name, String code, 
   String BEACategory, String bureauCode, String agencyCode, 
   String year) {
        
    String key = code + " " + BEACategory + " " + bureauCode 
     + " " + agencyCode + " " + year;
    Account account = (Account) instances.get(key);
    if (account == null) {
      account = new Account(name, code, BEACategory, bureauCode, 
       agencyCode, year);
      instances.put(key, account);
    }
    
    return account;
        
  }
  
  public void add(Subfunction sfx) {
    if (!subfunctions.contains(sfx)) subfunctions.add(sfx);     
  }
  
  public String getXML() {
        
    StringBuffer result = new StringBuffer();
    result.append("      <Account>\r\n");
    result.append("        <Name>" + name + "</Name>\r\n");
    result.append("        <Code>" + code + "</Code>\r\n");
    result.append("        <BEACategory>" + BEACategory 
     + "</BEACategory>\r\n");
    Iterator iterator = subfunctions.iterator();
    while (iterator.hasNext()) {
      Subfunction subfunction = (Subfunction) iterator.next();
      result.append(subfunction.getXML());
    }
    result.append("      </Account>\r\n");
    return result.toString();
    
  }
           
}

Finally the Subfunction class, Example 4.10, contains a title, a code, and an amount of money for a given year. Unlike agencies, bureaus, and accounts, subfunctions are not necessarily unique. For example, the “Department of Defense-Military” subfunction appears in 222 different accounts. Therefore this class does not use a factory method, just a regular constructor.

Example 4.10. The Subfunction Class

public class Subfunction {
 
  private String code;
  private String title;
  private long   amount;
    
  public Subfunction(String title, String code, long amount) {
        
    this.title  = title;
    this.code   = code;
    this.amount = amount;
    
  }
  
  public String getXML() {
        
    StringBuffer result 
      = new StringBuffer("        <Subfunction>\r\n");
    result.append("          <Name>" + title + "</Name>\r\n");
    result.append("          <Code>" + code + "</Code>\r\n");
    result.append("          <Amount>");
    result.append(amount + "</Amount>\r\n");
    result.append("        </Subfunction>\r\n");
    return result.toString();
    
  } 
               
}

The ultimate plan is to read all the records of the budget data, and use that information to create objects of these classes. As each object is created, it will be added to its parent object’s list. When all records have been processed, the getXML() method of the Budget object will be invoked to retrieve the XML representation of the budget. This XML can be written onto an output stream. Example 4.11 does this. It has a main() method to provide a simple user interface. This just reads the year and the input and output file names from the user, and calls convert(). The convert() method delegates reading the input to the BudgetData class and the data-structure building and most of the output to the Budget class.

Example 4.11. The driver class that builds the data structure and writes it out again

import java.io.*;
import java.util.*;


public class HierarchicalXMLBudget {

  public static void convert(List budgetData, String year, 
   OutputStream out) throws IOException { 
     
    Budget budget = new Budget(year);
    Iterator records = budgetData.iterator();
    while (records.hasNext()) {
      Map lineItem = (Map) records.next();
      budget.add(lineItem);
    }

    Writer wout = new OutputStreamWriter(out, "UTF8"); 
    wout.write("<?xml version=\"1.0\"?>\r\n");
    wout.write(budget.getXML());
    wout.flush();
        
  }

  public static void main(String[] args) {
  
    try {
        
      if (args.length < 2) {
        System.out.println(
         "Usage: HierarchicalXMLBudget year infile outfile");
        return;
      }
      
      // simple error checking on the year value
      try {
        if (!args[0].equals("TransitionQuarter")) {
          Integer.parseInt(args[0]);
        }
      }
      catch (NumberFormatException e) {
        System.out.println(
         "Usage: HierarchicalXMLBudget year infile outfile");
        return;        
      }
      
      InputStream in = new FileInputStream(args[1]); 
      OutputStream out; 
      if (args.length < 3) {
        out = System.out;
      }
      else {
        out = new FileOutputStream(args[2]); 
      }

      List results = BudgetData.parse(in);
      convert(results, args[0], out);
    }
    catch (IOException e) {
      System.err.println(e);       
    }
  
  }

}

Here’s the start of a sample run:

C:\XMLJAVA>java HierarchicalXMLBudget 2002 budauth.txt
<?xml version="1.0"?>
<Budget year='2002'>
  <Agency>
    <Name>Legislative Branch</Name>
    <Code>001</Code>
    <TreasuryAgencyCode></TreasuryAgencyCode>
    <Bureau>
      <Name>Legislative Branch</Name>
      <Code>00</Code>
      <Account>
        <Name>Receipts, Central fiscal operations</Name>
        <Code></Code>
        <BEACategory>Mandatory</BEACategory>
        <Subfunction>
          <Name>Central fiscal operations</Name>
          <Code>803</Code>
          <Amount>0</Amount>
        </Subfunction>
      </Account>
      <Account>
        <Name>Receipts, Central fiscal operations</Name>
        <Code></Code>
        <BEACategory>Net interest</BEACategory>
        <Subfunction>
...

The technique used here is actually an extremely powerful one. The data is read one item at a time. Each item is stored in a data structure. The data structure automatically moves each item into its appropriate slot. The final result is generated by a single method call to the root of the data structure.

Note

A similar approach would be to define classes that represent the XML elements and attributes rather than the input data. Indeed this is the approach taken by the Document Object Model, DOM. I’ll explore this alternative in coming chapters.


Copyright 2001, 2002 Elliotte Rusty Haroldelharo@metalab.unc.eduLast Modified August 17, 2001
Up To Cafe con Leche