XML and whitespaces

I thought handling of white spaces in XML documents was simple. Now, I know better. Let me share my understanding. There are 2 main levels of processing for white spaces:

at the parser level;
at the application level either for manipulating text in the DOM or for rendering text.

Parsing

White spaces may appear almost anywhere but for use in an application, only those in attribute values and in text content matter. XML 1.0 (or 1.1, for that respect it’s the same) indicates what should be done with white spaces:

An XML processor MUST always pass all characters in a document that are not markup through to the application. A validating XML processor MUST also inform the application which of these characters constitute white space appearing in element content.

XML defines what is markup and what is not.

Markup takes the form of start-tags, end-tags, empty-element tags, entity references, character references, comments, CDATA section delimiters, document type declarations, processing instructions, XML declarations, text declarations, and any white space that is at the top level of the document entity (that is, outside the document element and not inside any other markup).

So attribute values are markup as part of the start-tag construct and are not affected by the above quote and white space in attributes should be dealt differently:

Before the value of an attribute is passed to the application or checked for validity, the XML processor MUST normalize the attribute value by applying the algorithm below, or by using some other method such that the value passed to the application is the same as that produced by the algorithm.

Additionally, XML defines the xml:space attribute but at the parsing level it’s useless.

DOM Manipulations

The characters in the DOM are received by the parser for text and cdata section nodes, in the form of a DOMString. In DOM3 it is possible to retrieve/set the text content of a DOM subtree without scanning the whole tree using the textContent atribute. This attribute of a DOM node returns the text content with all whitespaces if no validation is performed or without the whitespaces not specifically allowed in the associated grammar (DTD, Schema, RNG,…) :

On getting, no serialization is performed, the returned string does not contain any markup. No whitespace normalization is performed and the returned string does not contain the white spaces in element content (see the attribute Text.isElementContentWhitespace).

Rendering

The rendering algorithm should handle the white spaces. For example, SVG overrides the semantics of the XML-defined xml:space attribute to apply normalization (convert carriage return, line feed, tabs into space, remove duplicated, trailing and leading spaces) when the value of xml:space is “default” and render the result without modifying the DOM tree value; and when the value is “preserve”, only the conversion of CR, LF and TAB is done, all spaces are displayed. In HTML context, CSS defines the white-space property and associated processing model.

Cyril @ Telecom ParisTech

Parsing

DOM Manipulations

Rendering

Thanks to Robin for his help in understanding all of this mess.

Leave a Reply Cancel reply

Groupe Multimédia / Multimedia Group