Cafe con Leche XML News and Resourcesthe PSVI (post-schema validation infoset) represents a fundamental break with the basic relevant XML Specs. Indeed, it might be said that XML Schemas are not schemas for documents, but schemas for databases that have an XML serialization. The two are not the same.
--Rick Jelliffe
Read the rest in Comment on XSD 1.1 from Rick Jelliffe on 2009-05-13
The W3C XQuery working group has posted a new candidate recommendation of XQuery and XPath Full Text 1.0 as well as a new working draft of XQuery and XPath Full Text 1.0 Use Cases.
1.1 Full-Text Search and XML
As XML becomes mainstream, users expect to be able to search their XML documents. This requires a standard way to do full-text search, as well as structured searches, against XML documents. A similar requirement for full-text search led ISO to define the SQL/MM-FT [SQL/MM] standard. SQL/MM-FT defines extensions to SQL to express full-text searches providing functionality similar to that defined in this full-text language extension to XQuery 1.0 and XPath 2.0.
XML documents may contain highly structured data (fixed schemas, known types such as numbers, dates), semi-structured data (flexible schemas and types), markup data (text with embedded tags), and unstructured data (untagged free-flowing text). Where a document contains unstructured or semi-structured data, it is important to be able to search using Information Retrieval techniques such as scoring and weighting.
Full-text search is different from substring search in many ways:
A full-text search searches for tokens and phrases rather than substrings. A substring search for news items that contain the string "lease" will return a news item that contains "Foobar Corporation releases the 20.9 version ...". A full-text search for the token "lease" will not.
There is an expectation that a full-text search will support language-based searches which substring search cannot. An example of a language-based search is "find me all the news items that contain a token with the same linguistic stem as 'mouse'" (finds "mouse" and "mice"). Another example based on token proximity is "find me all the news items that contain the tokens 'XML' and 'Query' allowing up to 3 intervening tokens".
Full-text search must address the vagaries and nuances of language. Search results are often of varying usefulness. When you search a web site for cameras that cost less than $100, this is an exact search. There is a set of cameras that matches this search, and a set that does not. Similarly, when you do a string search across news items for "mouse", there is only 1 expected result set. When you do a full-text search for all the news items that contain the token "mouse", you probably expect to find news items containing the token "mice", and possibly "rodents", or possibly "computers". Not all results are equal. Some results are more "mousey" than others. Because full-text search may be inexact, we have the notion of score or relevance. We generally expect to see the most relevant results at the top of the results list.
Note:
As XQuery and XPath evolve, they may apply the notion of score to querying structured data. For example, when making travel plans or shopping for cameras, it is sometimes useful to get an ordered list of near matches in addition to exact matches. If XQuery and XPath define a generalized inexact match, we expect XQuery and XPath to utilize the scoring framework provided by XQuery and XPath Full Text.
[Definition: Full-text queries are performed on tokens and phrases. Tokens and phrases are produced via tokenization.] Informally, tokenization breaks a character string into a sequence of tokens, units of punctuation, and spaces.
Tokenization, in general terms, is the process of converting a text string into smaller units that are used in query processing. Those units, called tokens, are the most basic text units that a full-text search can refer to. Full-text operators typically work on sequences of tokens found in the target text of a search. These tokens are characterized by integers that capture the relative position(s) of the token inside the string, the relative position(s) of the sentence containing the token, and the relative position(s) of the paragraph containing the token. The positions typically comprise a start and an end position.
Tokenization, including the definition of the term "tokens", SHOULD be implementation-defined. Implementations SHOULD expose the rules and sample results of tokenization as much as possible to enable users to predict and interpret the results of tokenization. Tokenization operates on the string value of an item; for element nodes this does not include the content of attribute nodes, but for attribute nodes it does. Tokenization is defined more formally in 4.1 Tokenization.
[Definition: A token is a non-empty sequence of characters returned by a tokenizer as a basic unit to be searched. Beyond that, tokens are implementation-defined.] [Definition: A phrase is an ordered sequence of any number of tokens. Beyond that, phrases are implementation-defined.]
Note:
Consecutive tokens need not be separated by either punctuation or space, and tokens may overlap.
Note:
In some natural languages, tokens and words can be used interchangeably.
[Definition: A sentence is an ordered sequence of any number of tokens. Beyond that, sentences are implementation-defined. A tokenizer is not required to support sentences.]
[Definition: A paragraph is an ordered sequence of any number of tokens. Beyond that, paragraphs are implementation-defined. A tokenizer is not required to support paragraphs.]
Some XML elements represent semantic markup, e.g., <title>. Others represent formatting markup, e.g., <b> to indicate bold. Semantic markup serves well as token boundaries. Some formatting markup serves well as token boundaries, for example, paragraphs are most commonly delimited by formatting markup. Other formatting markup may not serve well as token boundaries. Implementations are free to provide implementation-defined ways to differentiate between the markup's effect on token boundaries during tokenization. In the absence of an implementation-defined way to differentiate, element markup (start tags, end tags, and empty-element tags) creates token boundaries.
A sample tokenization is used for the examples in this document. The results might be different for other tokenizations.
Tokenization enables functions and operators that operate on a part or the root of the token (e.g., wildcards, stemming).
Tokenization enables functions and operators which work with the relative positions of tokens (e.g., proximity operators).
This specification focuses on functionality that serves all languages. It also selectively includes functionalities useful within specific families of languages. For example, searching within sentences and paragraphs is useful to many western languages and to some non-western languages, so that functionality is incorporated into this specification.
Permalink to Today's News | Recent News | Today's Java News on Cafe au Lait | The Cafes | Older News | E-mail Elliotte Rusty Harold
Selected content that might have some relevance or interest for this site's visitors:
You can also see previous recommended reading or subscribe to the recommended reading RSS feed if you like.
The W3C has published a candidate recommendation of SKOS Simple Knowledge Organization System Reference and a new working draft of SKOS Simple Knowledge Organization System Primer. According to the primer:
SKOS — Simple Knowledge Organisation System — provides a model for expressing the basic structure and content of concept schemes such as thesauri, classification schemes, subject heading lists, taxonomies, folksonomies, and other similar types of controlled vocabulary. As an application of the Resource Description Framework (RDF), SKOS allows concepts to be composed and published on the World Wide Web, linked with data on the Web and integrated into other concept schemes.
This document is a user guide for those who would like to represent their concept scheme using SKOS.
In basic SKOS, conceptual resources (concepts) are identified with URIs, labeled with strings in one or more natural languages, documented with various types of note, semantically related to each other in informal hierarchies and association networks, and aggregated into concept schemes.
In advanced SKOS, conceptual resources can be mapped across concept schemes and grouped into labeled or ordered collections. Relationships can be specified between concept labels. Finally, the SKOS vocabulary itself can be extended to suit the needs of particular communities of practice or combined with other modeling vocabularies.
This document is a companion to the SKOS Reference, which gives the normative reference on SKOS.
YesLogic has posted the first beta of Prince 7.0, a $495-$3900 payware batch formatter for Linux, Windows, and Mac OS X that produces PDF and PostScript from XML documents with CSS stylesheets that passes the Acid2 test. Version 7.0 adds support for Arabic, Hebrew, and Hindi; and kerning and ligatures.
SyncroSoft has released <Oxygen/> 10.3, $349 payware XML editor written in Java. Oxygen supports XML, XSL, DTDs, XQuery, SVG, Relax NG, Schematron, and the W3C XML Schema Language. According to the announcement:
Version 10.3 of Oxygen XML Editor improves both the XML Authoring and the XML Development capabilities. As a result of user feedback the Oxygen XML Author API was reorganized and extended with additional functionality. There are various improvements to the existing frameworks (DITA, DocBook, TEI, etc.) like automatic ID generation or DITA aware search and replace. An important new XML development feature is the Component Dependencies View that presents a tree of component dependencies starting with a specified component for XSLT, XML Schema, Relax NG and NVDL. The new version also integrates the Saxon SA XQuery Update functionality and updates a number of components to their latest versions.
If you must have a specialized XML development environment, then Oxygen is the one to buy, though personally I still prefer using plain vanilla text editors and the command line myself. At the end of the day, XML is just text; and an excellent text editor does a better job of it than a a text editor that's an afterthought in a product designed to shield users from raw XML. At most, I want some extra features on the side that don't get in my way when I'm just typing; for instance, a menu item to check the document for well-formedness or a spell checker that's smart enough to ignore tags. I don't want anything that gets in the way of my typing like auto-tag closing or tree views.
Mark Logic has released version 4.1 of their namesake XML database for Linux, Solaris, and Windows. New features in 4.1 include:
Pricing's hidden, but seems to be in the ballpark of $60,000 as best I can tell.
Norm Walsh has posted version 0.9.12 of Calabash, an open source XProc implementation written in Java. This release fixes bugs and adds a non-standard “general values extension”. Java 5 or later is required. Calabash is published under the GNU General Public License Version 2.0.
Oracle has released the final version of Java Specification Request (JSR) 225, XQuery API for Java™ (XQJ). There's also a reference implementation and technical compatibility kit. As JDBC is to SQL, XQJ is to XQuery.
The following sample Java code is meant to convey a first look and feel of the style and usage of the XQJ API. It is by no means exhaustive or complete; e.g., no error handling is shown and it is assumed that xqds is an XQDataSource object representing a given data source. It illustrates the basic steps that an application would perform to execute an XQuery expression at a given XQuery implementation.
// establish a connection to the XQuery engine XQConnection conn = xqds.getConnection(); // create an expression object that is later used // to execute an XQuery expression XQExpression expr = conn.createExpression(); // the XQuery expression to be executed String es = "for $n in fn:doc('catalog.xml')//item " + "return fn:data($n/name)"; // execute the XQuery expression XQResultSequence result = expr.executeQuery(es); // process the result (sequence) iteratively while (result.next()) { // retrieve the current item of the sequence as a String String str = result.getAtomicValue(); System.out.println("Product name: " + str); } // free all resources allocated for the result result.close(); // free all resources allocated for the expression expr.close(); // free all resources allocated for the connection conn.close();
On a side note, kudos to the spec authors for putting this simple example in the spec right up front. Something like this would help a lot of other JSRs.
The W3C Voice Browser Working Group has published the second working draft of the VoiceXML 3.0 specification. VoiceXML is used to describe those annoying call trees you hear when calling most major companies. "Press 1 if you want to wait on hold for 20 minutes and then be hung up on; press 2 if you want to wait indefinitely; press 3 if you'd rather we just hung up on you now."
How does one build a successor to VoiceXML 2.0/2.1? Requests for improvements to VoiceXML fell into two main categories: extensibility and new functionality.
To accommodate both, the Voice Browser Working Group
- Developed the detailed semantic descriptions of VoiceXML functionality that versions 2.0 and 2.1 lacked. The semantic descriptions clarify the meaning of the VoiceXML 2.0 and 2.1 functionalities and how they relate to each other. The semantic descriptions are represented in this document as English text, UML state chart visual diagrams [ref] and/or textual SCXML representations [ref]. Figure 1 illusrates the VoiceXML 3.0 framework which contains some abstract UML state chart visual diagrams representing some existing VoiceXML functionality.
- Described the detailed semantics for new functionality. New functions include, for example, speaker identification and verification, video capture and replay, and a more powerful prompt queue. These semantic descriptions for these new functions are also represented in this document as English text, UML state chart visual diagrams [ref] and/or textual SCXML representations [ref]. Figure 2 contains some abstract UML state chart visual diagrams representing new functionality.
- Organized the functionality into modules, with each module implementing different functions. One reason for the introduction of a more rigorous semantic definition is that it allows us to assign semantics to individual modules. This makes it easier to understand what happens when modules are combined or new ones are defined. In contrast, VoiceXML 2.0 and 2.1 had a single global semantic definition (the FIA), which made it difficult to understand what would happen if certain elements were removed from the language or if new ones were added. Figure 3 contains some modules, each containing VoiceXML 3.0 functionality Vendors may extend VoiceXML functionality by creating additional modules with additional functionality not described in this document. For example, a vendor might create a new GPS input module. Application developers should be cautious about using vendor-specific modules because the resulting application may not be portable.
- Restructured and revisedDefined the syntax of each module to incorporate any new functionality. Application developers use the syntax of each module as an API to invoke the module’s functions. Figure 4 illustrates some simplified syntax associated with modules.
- Introduced the concept of a profile (language) which incorporates the syntax of several modules. Figure 5 illustrates two profiles. For example, a VoiceXML 2.1 profile incorporates the syntax of most of the modules corresponding to the VoiceXML 2.1 functionality which will support most existing VoiceXML 2.1 applications. Thus most VoiceXML 2.1 applications can be easily ported to VoiceXML 3.0 using the VoiceXML 2.1 profile. Another profile omits the VoiceXML 2.1 Form Interpretation Algorithm (FIA). This profile may be used by developers who want to define their one own flow control rather than using the FIA. Profiles enable platform developers to select just the functionality that application developers need for a platform or class of application. Multiple profiles enables developers to use just the profile (language) needed for a platform or class of applications. For example, a lean profile for portable devices, or a full-function profile for servers-based applications using all of the new functionality of VoiceXML 3.0.
One of the benefits of detailed semantic descriptions is improving portability within VoiceXML. Two vendors may implement the same functionality differently; however, the functionality must be consistent with the semantic meanings described in this document so that application authors are isolated from the different implementations. This increases portable among platforms that support the same syntax. Note that there are many other factors that effect to the portability that is outside the scope of this document (e.g. speech recognition capabilities, telephony).
Wolfgang Meier has released eXist 1.2.6:
an open source database management system entirely built on XML technology. It stores XML data according to the XML data model and features efficient, index-based XQuery processing.
eXist-db supports many (web) technology standards making it an excellent application platform:
- XQuery 1.0 / XPath 2.0
- XSLT 1.0 (using Apache Xalan) or XSLT 2.0 (optional using Saxon)
- HTTP interfaces: REST, WebDAV, SOAP, XMLRPC, Atom Publishing Protocol
- XML database specific: XMLDB, XQJ/JSR-225 (under development), XUpdate, XQuery update extensions (to be aligned with the new XQuery Update Facility 1.0
eXist-db is highly compliant with the XQuery standard (current XQTS score is 99.4%). The query engine is highly extensible and features a large collection of XQuery Function Modules.
1.2.6 fixes several scary database corruption issues.
The W3C XQuery working group has posted a new candidate recommendation of XQuery Update Facility. XQuery as it currently exists is basically just SELECT in SQL terms. XQuery Update adds INSERT, UPDATE, and DELETE. More specifically it is:
upd:mergeUpdatesupd:revalidateupd:applyUpdatesupd:insertBeforeupd:insertAfterupd:insertIntoupd:insertIntoAsFirstupd:insertIntoAsLastupd:insertAttributesupd:deleteupd:replaceNodeupd:replaceValueupd:replaceElementContentupd:renameupd:removeTypeupd:setToUntypedThe following features are considered to be at risk:
They may be removed if implementations of them do not exist at the end of the Candidate Recommendation period.
Comments are due by August 31.
Could it really be 7 years? Yes, it could. Back from the dead after 7 years as a last call working draft, the W3C CSS Working Group has posted a new working draft of CSS Fonts Module Level 3. Described properties include:
font-familyfont-weightfont-stretchfont-stylefont-variantfont-sizefont-size-adjustfont@font-face"This draft consolidates material previously divided between the CSS3 Fonts and CSS3 Web Fonts modules."
The first release candidate of Firefox 3.5 is out; though you'll need to get it by auto-updating 3.5 beta 4. It's ugly as sin, breaks the back button, breaks the scrollbars, and still hasn't fixed this AppleScript bug. To add insult to injury the feedback page uses a pointless, illegible CAPTCHA:
I think I'm giving up on Firefox. I just need to get del.icio.us integrated into Safari and I'll be done.
They've also released Firefox 3.0.11 to fix a security vulnerability.
Michael Kay has released version 9.1.0.7 of Saxon, his XSLT 2.0 and XQuery processor for Java and .NET. This is a bug fix release.
Saxon is published in two versions for both of which Java 1.4 or later (or .NET) is required. Saxon 9.1B is an open source product published under the Mozilla Public License 1.0 that "implements the 'basic' conformance level for XSLT 2.0 and XQuery." Saxon 9.1 SA is £300.00 payware. According to Kay,
The most obvious difference between Saxon-SA and Saxon-B is that Saxon-SA is schema-aware: it allows stylesheets and queries to import an XML Schema, to validate input and output trees against a schema, and to select elements and attributes based on their schema-defined type. Saxon-SA also incorporates a free-standing XML Schema validator.>
In addition Saxon-SA incorporates some advanced extensions and optimizations not available in the Saxon-B product:
Saxon-SA is able to compile XQuery code directly into Java classes.
Saxon-SA has an advanced optimizer which recognizes joins in XPath expressions, XQuery FLOWR expressions, and in XSLT templates (nested
xsl:for-eachinstructions). Whereas Saxon-B always implements these as nested loops, Saxon-SA uses a variety of strategies including indexes and hash joins. This can give dramatic improvements in execution time for large documents: some of the queries in the XMark benchmark improve by a factor of 300 (from 16 seconds to 45 milliseconds) to process a 10Mbyte source file.Saxon-SA has a facility to process large documents in streaming mode. This enables documents to be handled that are too large to hold in memory (it has been tested up to 20Gb).
Additional extensions available in Saxon-SA include a try/catch capability for catching dynamic errors, improved error diagnostics, support for higher-order functions, and additional facilities in XQuery including support for grouping, advanced regular expression analysis, and formatting of dates and numbers.