How are documents canonicalized?

  1. The document is encoded in UTF-8

  2. Line breaks are normalized to a linefeed (ASCII , \n)

  3. Attribute values are normalized, as if by a validating processor

  4. Character and parsed entity references are replaced

  5. CDATA sections are replaced with their character content

  6. The XML and document type declarations are removed

  7. Empty elements are converted to start tag-end tag pairs

  8. White space outside of the document element and within start and end tags is normalized

  9. All white space in character content is retained (except for characters removed during linefeed normalization)

  10. Attribute value delimiters are set to double quotes

  11. Special characters in attribute values and character content are replaced by character references

  12. Superfluous namespace declarations are removed from each element

  13. Default attributes are added to each element

  14. Lexicographic order is imposed on the namespace declarations and attributes of each element


Previous | Next | Top | Cafe con Leche

Copyright 2000, 2001 Elliotte Rusty Harold
elharo@metalab.unc.edu
Last Modified September 13, 2000