Ignorable White Space

One of the more obscure parts of the XML 1.0 specification is the perhaps misleadingly named “ignorable white space”. This is white space that occurs between tags in places where the DTD does not allow mixed content. For example, consider the XML-RPC document in Example 6.13:

Example 6.13. A document that uses ignorable white space to prettify the XML

<?xml version="1.0"?>
<!DOCTYPE methodCall [
  <!ELEMENT methodCall (methodName, params)>
  <!ELEMENT params (param+)>
  <!ELEMENT param (value)>
  <!ELEMENT value (string)>
  <!ELEMENT methodName (#PCDATA)>
  <!ELEMENT string (#PCDATA)>
]>
<methodCall>
  <methodName>lookupSymbol</methodName>
  <params>
    <param>
      <value>
        <string>
          Red Hat 
        </string>
      </value>
    </param>
  </params>
</methodCall>

This example has quite a bit of white space just for indenting. In particular, the spaces, carriage returns, and line feeds between <methodCall> and <methodName>, </methodName> and <params>, <params> and <param>, <param> and <value>, </value> and </param>, </param> and </params>, and </params> and </methodCall> only exist for indenting. Furthermore, the DTD says that these elements cannot contain #PCDATA, and therefore it’s known that this white space is ignorable. Thus a validating parser will not pass these white space characters to the characters() method. Instead it passes them to the ignorableWhiteSpace() method. A non-validating parser might do the same, or it might pass the ignorable white space to the characters() method instead. If this matters to you, make sure you use a validating parser.

The space and line break characters in the string element are not ignorable because the DTD allows this element to contain #PCDATA. This white space is passed to the characters() method along with the words Red and Hat. White space is considered ignorable only where #PCDATA is invalid.

For purposes of this method, white space consists exclusively of the ASCII space (&#x20;), tab (&#x9;), carriage return (&#xD;), and line feed (&#xA;). Unicode includes many more space characters including new line (&#x85;), em space (&#x2003;), en space (&#x2002;), and more. However, these characters are never ignorable.

The ignorableWhiteSpace() method has the same arguments and the same caveats as the characters() method. For instance, there’s no guarantee that each call to this method will contain the maximum contiguous run of ignorable white space. However, its text[] argument should contain nothing except space characters, tabs, carriage returns, and linefeeds, at least in the sub-array delineated by start and start+length.


Copyright 2001, 2002 Elliotte Rusty Haroldelharo@metalab.unc.eduLast Modified October 16, 2001
Up To Cafe con Leche