Alternatives to Java

When all you have is a hammer, most problems look a lot like nails. Since you’re reading this book, I’m willing to bet that Java is your hammer of choice, and indeed Java is a very powerful hammer. However, sometimes you really could use a screwdriver; and this may be one of those times. I must admit that the solution for imposing hierarchy developed in the last section feels more than a little like pounding a screw with a hammer. Maybe it would be better to use the hammer to set the screw, but then use a screwdriver to drive it in. In this section I want to explore a few possible screwdrivers including XSLT and XQuery. Rather than using such complex Java code, I’ll use Java to get the data into the simple XML format produced by Example 4.2 that closely matches the flat input data. Then I’ll use XSLT to transform this simple intermediate XML format into the less-flat final XML format. To refresh your memory, the flat XML data is organized like this:

<?xml version="1.0"?>
<Budget>
  <LineItem>
    <FY1994>-1982</FY1994>
    <FY1993>4946</FY1993>
    <FY1992>-3251</FY1992>
    <FY1991>-17373</FY1991>
    <FY1990>-90008</FY1990>
    <AccountCode>265197</AccountCode>
    <On-Off-BudgetIndicator>On-budget</On-Off-BudgetIndicator>
    <TransitionQuarter>0</TransitionQuarter>
    <FY1989>-80069</FY1989>
    <AccountName>Sale of scrap and salvage materials</AccountName>
    <FY1988>-72411</FY1988>
    <FY1987>-60964</FY1987>
    <FY1986>-61462</FY1986>
    <FY1985>-68182</FY1985>
    <FY1984>-79482</FY1984>
    <FY1983>0</FY1983>
    <FY1982>0</FY1982>
    <SubfunctionCode>051</SubfunctionCode>
    <FY1981>0</FY1981>
    <FY2006>-1000</FY2006>
    <FY1980>0</FY1980>
    <FY2005>-1000</FY2005>
    <FY2004>-1000</FY2004>
    <FY2003>-1000</FY2003>
    <FY2002>-1000</FY2002>
    <FY2001>-1000</FY2001>
    <FY2000>-2000</FY2000>
    <AgencyCode>007</AgencyCode>
    <BEACategory>Mandatory</BEACategory>
    <FY1979>0</FY1979>
    <FY1978>0</FY1978>
    <FY1977>0</FY1977>
    <FY1976>0</FY1976>
    <TreasuryAgencyCode>97</TreasuryAgencyCode>
    <AgencyName>Department of Defense--Military</AgencyName>
    <BureauCode>00</BureauCode>
    <BureauName>Department of Defense--Military</BureauName>
    <FY1999>-1000</FY1999>
    <FY1998>-2000</FY1998>
    <FY1997>-4000</FY1997>
    <FY1996>-1000</FY1996>
    <SubfunctionTitle>Department of Defense-Military</SubfunctionTitle>
    <FY1995>-1000</FY1995>
  </LineItem>
  <!-- several thousand more LineItem elements... -->
</Budget>

Imposing Hierarchy with XSLT

The XSLT stylesheet shown in Example 4.12 will convert flat XML budget data of this type into an output document of the same form produced by Example 4.11. Because the input file is so large, you may need to raise the memory allocation for your XSLT processor before running the transform.

Example 4.12. An XSLT stylesheet that converts flat XML data to hierarchical XML data

<?xml version="1.0" encoding="ISO-8859-1"?>
<xsl:stylesheet version="1.0"
  xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  
  <!-- Try to make the output look half decent -->
  <xsl:output indent="yes" encoding="ISO-8859-1"/>
  
  <!-- Muenchian method -->
  <xsl:key name="agencies" match="LineItem" use="AgencyCode"/>
  <xsl:key name="bureaus"  match="LineItem" 
    use="concat(AgencyCode,'+',BureauCode)"/>
  <xsl:key name="accounts" match="LineItem" 
    use="concat(AgencyCode,'+',BureauCode,'+',AccountCode)"/>
  <xsl:key name="subfunctions" match="LineItem" 
    use="concat(AgencyCode,'+',BureauCode,'+',AccountCode,
    '+',SubfunctionCode)"/>
  
  <xsl:template match="Budget">
    <Budget year='2001'>
      <xsl:for-each select="LineItem[generate-id() 
       = generate-id(key('agencies',AgencyCode)[1])]">
        <Agency>
          <Name><xsl:value-of select="AgencyName"/></Name>
          <Code><xsl:value-of select="AgencyCode"/></Code>
          <xsl:for-each 
            select="/Budget/LineItem[AgencyCode
            =current()/AgencyCode]
             [generate-id() = 
               generate-id(key('bureaus', 
                     concat(AgencyCode, '+', BureauCode))[1])]">
            <Bureau>
              <Name><xsl:value-of select="BureauName"/></Name>
              <Code><xsl:value-of select="BureauCode"/></Code>
              <xsl:for-each select="/Budget/LineItem
                  [AgencyCode=current()/AgencyCode]
                  [BureauCode=current()/BureauCode]
                  [generate-id() = generate-id(key('accounts',
                   concat(AgencyCode,'+',BureauCode,'+',
                                            AccountCode))[1])]">
                <Account>
                  <Name>
                    <xsl:value-of select="AccountName"/>
                  </Name>
                  <Code>
                    <xsl:value-of select="AccountCode"/>
                  </Code>
                  <xsl:for-each select=
                    "/Budget/LineItem
                     [AgencyCode=current()/AgencyCode]
                     [BureauCode=current()/BureauCode]
                     [AccountCode=current()/AccountCode]
                     [generate-id()=generate-id(
                       key('subfunctions', concat(AgencyCode,'+',
                       BureauCode,'+',AccountCode,'+',
                       SubfunctionCode))[1])]">
                    <Subfunction BEACategory="{BEACategory}"
                     BudgetIndicator="{On-Off-BudgetIndicator}">
                      <Title>
                       <xsl:value-of select="SubfunctionTitle"/>
                      </Title>
                      <Code>
                       <xsl:value-of  select="SubfunctionCode"/>
                      </Code>
                      <Amount>
                        <xsl:value-of select="FY2001"/>
                      </Amount>                       
                    </Subfunction>
                  </xsl:for-each> 
                </Account>
              </xsl:for-each>
            </Bureau>
          </xsl:for-each> 
        </Agency>
      </xsl:for-each>
    </Budget>
  </xsl:template>
  
</xsl:stylesheet>

The algorithm for converting flat data to hierarchical data with XSLT is known as the Muenchian method after its inventor, Steve Muench of Oracle. The trick of the Muenchian method is to use the xsl:key element and the key() function to create node-sets of all the LineItem elements that share the same agency, bureau, account, or subfunction. Inside the template, the generate-id() function is used to compare the current node to the first node in any given group. Output is generated only if we are indeed processing the first Agency, Bureau, Account or Subfunction element with a specified code. Also note, that the select attributes in the xsl:for-each elements keep returning to the root rather than processing children and descendants as is customary. This reflects the fact that the hierarchy in the input is not the same as the hierarchy in the output.

Note

XSLT 2.0 is going to make it much easier to write style sheets that group elements in this fashion. This will likely involve a new xsl:for-each-group element that groups elements according to an XPath expression a current-group() function that selects all the members of the current group so they can be processed together.

One minor advantage of using XSLT instead of Java data structures is that XSLT preserves the order of the input data. You’ll notice that the output begins with the Legislative Branch agency and bureau and the Receipts, Central fiscal operations account, the same as the input data does. This was not the case for the output produced by Java.

<?xml version="1.0" encoding="ISO-8859-1"?>
<Budget year="2001">
   <Agency>
      <Name>Legislative Branch</Name>
      <Code>001</Code>
      <Bureau>
         <Name>Legislative Branch</Name>
         <Code>00</Code>
         <Account>
            <Name>Receipts, Central fiscal operations</Name>
            <Code/>
            <Subfunction BEACategory="Mandatory" 
              BudgetIndicator="On-budget">
               <Title>Central fiscal operations</Title>
               <Code>803</Code>
               <Amount>0</Amount>
            </Subfunction>
            <Subfunction BEACategory="Net interest" 
              BudgetIndicator="On-budget">
               <Title>Other interest</Title>
               <Code>908</Code>
               <Amount>0</Amount>
            </Subfunction>
         </Account>
         <Account>
            <Name>Charges for services to trust funds</Name>
            ...

The XML Query Language

Caution

This section describes bleeding edge technology. The broad picture presented here is likely to be correct, but the details are almost certain to change. Furthermore, the exact subset of XQuery implemented by early experimental tools varies a lot from one product to the next.

XSLT is Turing complete. Nonetheless some operations are more than a little cumbersome in XSLT. Using the Muenchian method to impose hierarchy is definitely not something envisioned by XSLT’s inventors. The W3C has begun work on a language more suitable for querying XML documents. This language is called, simply enough, the XML Query Language, or XQuery for short. XQuery is to XML documents as SQL is to relational tables. However, XQuery is limited to SELECT. It does not have any equivalent of INSERT, UPDATE, or DELETE. It is a read-only language.

XQuery queries are not in general well-formed XML. Although there is an XML syntax for XQuery, it is not intended to be used by human beings. Instead humans are supposed to write in a more natural 4GL syntax which will be compiled to XML documents if necessary. If you think about it, this shouldn’t be so surprising. SQL statements aren’t tables. Why should XQuery statements be XML documents?

The basic nature of an XQuery query is the FLWR (pronounced “flower”) statement. This is an acronym for for-let-where-return, the basic form of an XQuery query. In brief, for each node in a node-set let a variable have a certain value, where some condition is true and return an XML fragment based on the values of these variables. Variables are set and XML returned using XPath 2.0 expressions.

For example, here’s an XQuery that generates a list of agency names from the flat XML budget:

for $name in document("budauth.xml")/Budget/LineItem/AgencyName
return $name 

The for clause iterates over every node in the node set returned by the XPath 2.0 expression document("budauth.xml")/Budget/LineItem/AgencyName. This expression returns a node set containing 3175 AgencyName elements. The XQuery variable $name is set to each of these elements in turn. The return clause is evaluated for each value of $name. In this case, the return clause says to simply return the node the $name variable currently points to. In this example, the $name variable always points to an AgencyName element so the output would begin like this:

<AgencyName>Legislative Branch</AgencyName>
<AgencyName>Legislative Branch</AgencyName>
<AgencyName>Legislative Branch</AgencyName>
<AgencyName>Legislative Branch</AgencyName>
<AgencyName>Legislative Branch</AgencyName>
…

This is not a well-formed XML document because it does not have a root element. However, it is a well-formed XML document fragment.

You can use the XPath 2.0 distinct-values() function around the XPath expression to select only one of each AgencyName elements:

for $name in distinct-values(document("budauth.xml")/Budget/LineItem/AgencyName)
return $name

The output would now begin like this, repeating each agency name only once:

<AgencyName>Legislative Branch</AgencyName>
<AgencyName>Judicial Branch</AgencyName>
<AgencyName>Department of Agriculture</AgencyName>
<AgencyName>Department of Commerce</AgencyName>
<AgencyName>Department of Defense--Military</AgencyName>
…

As well as copying existing elements, XQuery can create new elements. You can type the tags literally where you want them to appear. To include the value of a variable (or other expression) inside the tags, enclose it in curly braces. For example, this query places <Name> and </Name> tags around each agency name rather than <AgencyName> and </AgencyName>. Notice also that it only selects the text content of each AgencyName element rather than the complete element node:

for $name in distinct-values(document("budauth.xml")//AgencyName/text())
return <Name>{ $name }</Name>

The output now begins like this:

<Name>Legislative Branch</Name>
<Name>Judicial Branch</Name>
<Name>Department of Agriculture</Name>
<Name>Department of Commerce</Name>
<Name>Department of Defense--Military</Name>
…

More complex queries typically require multiple variables. These can be set in a let clause based on XPath expressions that refer to the variable in the for clause. For example, this query selects distinct agency codes but returns agency names:

for $code in distinct-values(document("budauth.xml")//AgencyCode)
let $name := $code/../AgencyName
return $name

A where clause can further restrict the members of the node set for which results are generated. where conditions can use boolean connectors such as and, or, and not()). For example, this query finds all the bureaus in the Department of Agriculture:

for $bureau in distinct-values(document("budauth.xml")/Budget/LineItem/BureauName)
where $bureau/../AgencyName = "Department of Agriculture"
return $bureau

XQuery expressions may nest. That is, the return statement of the FLWR may contain another FLWR. For example, this statement lists all the bureau names inside their respective agencies:

for $ac in distinct-values(document("budauth.xml")//AgencyCode)
return
  <Agency>
    <Name>{ $ac/../AgencyName/text() }</Name>
    {  
      for $bc in distinct-values(document("budauth.xml")//BureauCode)
      where $bc/../AgencyCode = $ac
      return 
        <Bureau>
          <Name>{ $bc/../BureauName/text() }</Name>
        </Bureau>
    }
  </Agency>

The output now begins like this:

<Agency>
  <Name>Legislative Branch</Name>
  <Bureau>
    <Name>Legislative Branch</Name>
  </Bureau>
  <Bureau>
    <Name>Senate</Name>
  </Bureau>
  <Bureau>
    <Name>House of Representatives</Name>
  </Bureau>
  <Bureau>
    <Name>Joint Items</Name>
  </Bureau>
…

This is all the syntax needed to write a query that will convert flat budget data such as that produced by Example 4.2 into a hierarchical XML document. Such a query is shown in Example 4.13 which selects the data from just 2001.

Example 4.13. An XQuery that converts flat data to hierarchical data

<Budget year="2001">
  {
  for $ac in distinct-values(document("budauth.xml")//AgencyCode)
  return
    <Agency>
      <Name>{ $ac/../AgencyName/text() }</Name>
      <Code>{ $ac/text() }</Code>
      {  
        for $bc 
         in distinct-values(document("budauth.xml")//BureauCode)
        where $bc/../AgencyCode = $ac
        return 
          <Bureau>
            <Name>{ $bc/../BureauName/text() }</Name>
            <Code>{ $bc/text() }</Code>
            {  
            for $acct in distinct-values(
             document("budauth.xml")//AccountCode)
            where $acct/../AgencyCode = $ac 
             AND $acct/../BureauCode = $bc
            return 
              <Account 
                BEACategory="{ $acct/../BEACategory/text() }">
                <Name>{ $acct/../AccountName/text() }</Name>
                <Code>{ $acct/text() }</Code>  
                {  
                  for $sfx 
                    in document("budauth.xml")//SubfunctionCode
                  where $sfx/../AgencyCode = $ac 
                    and $sfx/../BureauCode = $bc 
                    and $sfx/../AccountCode = $acct
                  return 
                    <Subfunction>
                 <Title>{$sfx/../SubfunctionTitle/text()}</Title>
                      <Code>{ $sfx/text() }</Code>  
                      <Amount>{ $sfx/../FY2001/text() }</Amount>
                    </Subfunction>
               }    
            </Account>
           }
          </Bureau>
      }
    </Agency>
  }
</Budget>

There’s a lot more to XQuery, but this should give you the idea of what it can do. It’s definitely worth a look any time you have to perform database-like operations on XML documents.


Copyright 2001, 2002 Elliotte Rusty Haroldelharo@metalab.unc.eduLast Modified July 25, 2002
Up To Cafe con Leche