Normalization

Implementations have quite a bit of leeway in exactly how they parse and serialize any given document. For example, a parser may represent CDATA sections as CDATASection objects or it may merge them into neighboring Text objects. A parser may include entity reference nodes in the tree, or it may instead include the nodes corresponding to each entity’s replacement text. A parser may include comments or it may ignore them. DOM Level 3 adds four methods to the Document interface to control exactly how a parser makes these choices.

public void normalizeDocument();
public boolean canSetNormalizationFeature(String name, boolean state);
public void setNormalizationFeature(String name, boolean state);
public boolean getNormalizationFeature(String name);

The canSetNormalizationFeature() method tests whether the implementation supports the desired value (true or false) for the named feature. The setNormalizationFeature() method sets the value of the named feature. It throws a DOMException with the error code NOT_FOUND_ERR if the implementation does not support the feature at all. It throws a DOMException with the error code NOT_SUPPORTED_ERR if the implementation does not support the requested value for the feature. (e.g. you try to set to true a feature that must have the value false). Finally, after all the features have been set, client code can invoke the normalizeDocument() method to modify the tree in accordance with the current values for all the different features.

Caution

These are very bleeding edge ideas from the latest DOM Level 3 Core Working Draft. Xerces 2.0.2 is the only DOM implementation that supports any of this yet.

The DOM 3 specification defines 13 standard features:

normalize-characters, optional, default false

If true, document text should be normalized according to the W3C Character Model. For example, the word resumé would be represented as the six character string r e s u m é rather than the seven character string r e s u m e combining_'. Implementations are only required to support a false value for this feature.

split-cdata-sections, required, default true

If true, CDATA sections containing the CDATA section end delimiter ]]> are split into pieces and the ]]> included in a raw text node. If false, such a CDATA section is not split.

entities, optional, default true

If false, entity reference nodes are replaced by their children. If true, they’re not.

whitespace-in-element-content, optional, default true

If true, all white space is retained. If false, text nodes containing only white space are deleted if the parent element’s declaration from the DTD/schema does not allow #PCDATA to appear at that point.

discard-default-content, required, default true

If true, the implementation will throw away any nodes whose presence can be inferred from the DTD or schema; e.g. default attribute values.

canonical-form, optional, default false

If true, the document will be arranged according to the rules specified by the Canonical XML specification, at least within the limits of what can be represented in a DOM implementation. For example, EntityReference nodes would be replaced by their content and CDATASection objects would be converted to Text objects. However, there’s no way in DOM to specify everything canonicalization requires. For instance, a DOM Element does not know the order of its attributes or whether an empty element will be written as a single empty-element tag or start-tag/end-tag pair. Thus, full canonicalization has to be deferred to serialization time.

namespace-declarations, optional, default true

If false, then all Attr nodes representing namespace declarations are deleted from the tree. otherwise they’re retained. This has no effect on the namespaces associated with individual elements and attributes.

validate, optional, default false

If true, then the document’s schema or DTD is used to validate the document as it is being normalized. Any validation errors that are discovered are reported to the registered error handler. (Both validation and error handlers are other new features in DOM3.)

validate-if-schema, optional, default false

If true and the validation feature is also true, then the document is validated if and only if it has a some kind of schema (DTD, W3C XML Schema Language schema, RELAX NG schema, etc.).

datatype-normalization, required, default false

If true, datatype normalization is performed according to the schema. For example, an element declared to have type xsd:boolean and represented as <state>1</state> could be changed to <state>true</state>.

cdata-sections, optional, default true

If false, all CDATASection objects are changed into Text objects and merged with any adjacent Text objects. If true, each CDATA section is represented as its own CDATASection object.

comments, required, default true

If true, comments are included in the Document; if false, they’re not.

infoset, optional

If true the Document only contains information provided by the XML Infoset. This is the same as setting namespace-declarations, validate-if-schema, entities, and cdata-sections to false and datatype-normalization, whitespace-in-element-content, and comments to true.

In addition, vendors are allowed to define their own non-standard features. Feature names must be XML 1.0 names, and should use a vendor specific prefix such as apache: or oracle: .

For an example of how these could be useful, consider the SOAP servlet earlier in this chapter. It needed to locate the calculateFibonacci element in the request document and extract its full text content. This had to work even if that element contained comments and CDATA sections. The getFullText() method that accomplished this wasn’t too hard to write. Nonetheless, in DOM3 it’s even easier. Set the create-cdata-nodes and comments features to false and call normalizeDocument() as soon as the document is parsed. Once this is done, the calculateFibonacci element only contains one text node child.

try {
    Document request = parser.parse(in);
    request.setNormalizationFeature("create-cdata-nodes", false);
    request.setNormalizationFeature("comments", false);
    request.normalizeDocument();
    
    NodeList ints = request.getElementsByTagNameNS(
       "http://namespaces.cafeconleche.org/xmljava/ch3/", 
       "calculateFibonacci");
    Node calculateFibonacci = ints.item(0);
    Node text = calculateFibonacci.getFirstChild();
    String generations = text.getNodeValue();
    // ...
}
catch (DOMException e) {
  // The create-cdata-nodes features is true by default and 
  // parsers aren’t required to support a false value for it, so 
  // you should be prepared to fall back on manual normalization 
  // if necessary. The comments feature, however, is required.
}

This wouldn’t work for the XML-RPC case, however, because XML-RPC documents can contain processing instructions, and there’s no feature to turn off processing instructions.


Copyright 2001, 2002 Elliotte Rusty Haroldelharo@metalab.unc.eduLast Modified July 29, 2002
Up To Cafe con Leche