XML News from Friday, January 15, 2010

I've converted all the old quotes archives to well-formed (though not necessarily valid) XHTML and uploaded them into eXist. Now I have to come up with an XQuery that breaks them up into individual quotes. This is proving trickier than expected (and I expected it to be pretty tricky, especially since a lot of the old quotes aren't in perfectly consistent formats.

Maybe it's time to try out Oxygen's XQuery debugger since they sent me a freebie? If only the interface weren't such a horrow show. They say they have a debugger but I can't find it, and the buttons they're using in the screencast don't seem to be present in the latest version. In the meantime, can anyone see the syntax error in this code?

xquery version "1.0";
declare namespace xmldb="http://exist-db.org/xquery/xmldb";
declare namespace html="http://www.w3.org/1999/xhtml";

     for $dt in doc("/db/quoteshtml/quotes2010.html")/html:html/html:body/html:dl/html:dt
        let $id := string($dt/@id)
        let $date := string($dt)
        let $dd := $dt/following-sibling::html:dd
        let $quote := $dd/html:blockquote
        let $cite := string($quote/@cite)
        let $source := $quote/following-sibling::html:p
        let $author := normalize-space(substring-after($source/*[1], "--"))
     return
        <quote>
           <id>{$id}</id>
           <date>{$date}</date>
           <quote>{$quote}</quote>
           <cite>{$cite}</cite>
           <source>{$quote}</source>
           <author>{$author}</author>
        </quote>

The error message from exist is "The actual cardinality for parameter 1 does not match the cardinality declared in the function's signature: string($arg as item()?) xs:string. Expected cardinality: zero or one, got 4."

Found the bug: the debugger wasn't very helpful (once I found it--apparently Author and Oxygen are not the same thing), but Saxon had much better error messages than eXist. I needed to change let $dd := $dt/following-sibling::html:dd to let $dd := $dt/following-sibling::html:dd[1]. eXist didn't tell me which line had the problem so I was looking in the wrong place. Saxon pointed me straight to it. Score 1 for Saxon.

Here's the finished script. It works for at least the lasy couple of years. I still have to test it out on some of the older files:

xquery version "1.0";
declare namespace xmldb="http://exist-db.org/xquery/xmldb";
declare namespace html="http://www.w3.org/1999/xhtml";

 for $dt in doc("/db/quoteshtml/quotes2009.html")/html:html/html:body/html:dl/html:dt
    let $id := string($dt/@id)
    let $date := string($dt)
    let $dd := $dt/following-sibling::html:dd[1]
    let $quote := $dd/html:blockquote
    let $cite := string($quote/@cite)
    let $source := $quote/following-sibling::*
    let $sourcetext := normalize-space(substring-after($source, "--"))
    let $author := if (contains($sourcetext, "Read the"))
                   then substring-before($sourcetext, "Read")
                   else substring-before($sourcetext, "on the")
    let $location := if ($source/html:a)
                   then $source/html:a
                   else substring-after($sourcetext, "on the")
    let $quotedate := if (contains($sourcetext, "list,"))
                   then  normalize-space(substring-after($sourcetext, "list,"))
                   else ""
    let $justlocation := if (contains($location, "list,"))
                   then  normalize-space(substring-after(substring-before($sourcetext, ","), "on the"))
                   else $location
 return
    <quote>
       <id>{$id}</id>
       <postdate>{$date}</postdate>
       <quote>{$quote}</quote>
       <cite>{$cite}</cite>
       <author>{$author}</author>
       <location>{$justlocation}</location>
       {
         if ($quotedate) 
         then <quotedate>{$quotedate}</quotedate>
         else ""
       }
    </quote>