W3C

XSLT 2.0 and XQuery 1.0 Serialization

W3C Working Draft 11 February 2005

This version:
http://www.w3.org/TR/2005/WD-xslt-xquery-serialization-20050211/
Latest version:
http://www.w3.org/TR/xslt-xquery-serialization/
Previous versions:
http://www.w3.org/TR/2004/WD-xslt-xquery-serialization-20041029/ http://www.w3.org/TR/2004/WD-xslt-xquery-serialization-20040723/ http://www.w3.org/TR/2003/WD-xslt-xquery-serialization-20031112/ http://www.w3.org/TR/2003/WD-xslt-xquery-serialization-20030502/
Editors:
Michael Kay, Saxonica (formerly of Software AG) <http://www.saxonica.com>
Norman Walsh, Sun Microsystems <Norman.Walsh@Sun.COM>
Henry Zongaro, IBM <zongaro@ca.ibm.com>
Scott Boag, IBM <scott_boag@us.ibm.com>
Joanne Tong, IBM <joannet@ca.ibm.com>

This document is also available in these non-normative formats: XML.


Abstract

This document defines serialization of an instance of the data model as defined in [Data Model] into a sequence of octets. [Definition: Serialization is designed to be a component of a expanded a host language such as[XSLT 2.0] or [XQuery 1.0].]

Status of this Document

This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at http://www.w3.org/TR/.

This is a Public Working Draft for review by W3C Members and other interested parties. Publication as a Working Draft does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.

This document describes how [XSLT 2.0], [XQuery 1.0] and other related XML standards convert an instance of the [Data Model] into a sequence of octets.

This draft includes many corrections and changes based on member-only and public comments on the Last Call Working Draft (http://www.w3.org/TR/2003/WD-xslt-xquery-serialization-20031112/). The XML Query and XSL WGs wish to thank the people who have sent in comments for their close reading of the document.

This draft reflects decisions taken up to and including the joint meeting in Redwood Shores, CA during the week of November 8, 2004. These decisions are recorded in the Last Call issues list (http://www.w3.org/2005/02/xquery-serialization-issues.html). However, some of these decisions may not yet be reflected in this document.

XSLT 2.0 and XQuery 1.0 Serialization has been defined jointly by the XSL Working Group and the XML Query Working Group (both part of the XML Activity).

Public comments on this document and its open issues are invited. Comments should be sent to the W3C XSLT/XPath/XQuery mailing list, public-qt-comments@w3.org (archived at http://lists.w3.org/Archives/Public/public-qt-comments/), with “[Serial]” at the beginning of the subject field.

The patent policy for this document is the 5 February 2004 W3C Patent Policy. Patent disclosures relevant to this specification may be found on the XML Query Working Group's patent disclosure page and the XSL Working Group's patent disclosure page. An individual who has actual knowledge of a patent which the individual believes contains Essential Claim(s) with respect to this specification should disclose the information in accordance with section 6 of the W3C Patent Policy.

Table of Contents

1 Introduction
    1.1 Terminology
2 Sequence Normalization
3 Serialization Parameters
4 Phases of Serialization
5 XML Output Method
    5.1 The Influence of Serialization Parameters upon the XML Output Method
        5.1.1 XML Output Method: the version Parameter
        5.1.2 XML Output Method: the encoding Parameter
        5.1.3 XML Output Method: the indent Parameter
        5.1.4 XML Output Method: the cdata-section-elements Parameter
        5.1.5 XML Output Method: the omit-xml-declaration and standalone Parameters
        5.1.6 XML Output Method: the doctype-system and doctype-public Parameters
        5.1.7 XML Output Method: the undeclare-namespaces Parameter
        5.1.8 XML Output Method: the normalization-form Parameter
        5.1.9 XML Output Method: Other Parameters
6 XHTML Output Method
7 HTML Output Method
    7.1 The Influence of Serialization Parameters upon the HTML Output Method
        7.1.1 HTML Output Method: Markup for Elements
        7.1.2 HTML Output Method: Writing Attributes
        7.1.3 HTML Output Method: Indentation
        7.1.4 HTML Output Method: Writing Character Data
        7.1.5 HTML Output Method: Encoding
        7.1.6 HTML Output Method: Document Type Declaration
        7.1.7 HTML Output Method: Unicode Normalization
        7.1.8 HTML Output Method: Other Parameters
8 Text Output Method
9 Character Maps
10 Conformance

Appendices

A References
    A.1 Normative References
    A.2 Non-normative References
B Summary of Error Conditions


1 Introduction

This document defines serialization of the W3C XQuery 1.0 and XPath 2.0 Data Model, which is the data model of at least [XPath 2.0], [XSLT 2.0], and [XQuery 1.0], and any other specifications that reference it.

Serialization is the process of converting an instance of the [Data Model] into a sequence of octets. Serialization is well-defined for most data model instances.

Ed. Note: The document assumes the reader already knows generally what serialization is. A brief explanation will be added, especially to disabuse any reader who thinks it might mean Java (or .NET) serialization.

1.1 Terminology

In this specification, where they appear in upper case, the words "MUST", "MUST NOT", "SHOULD", "SHOULD NOT", "MAY", "REQUIRED", and "RECOMMENDED" are to be interpreted as described in [RFC2119].

[Definition: As is indicated in 10 Conformance, conformance criteria for serialization are determined by other specifications that refer to this specification. A serializer is software that implements some or all of the requirements of this specification in accordance with such conformance criteria.] A serializer is not REQUIRED to directly provide a programming interface that permits a user to set serialization parameters or to provide an input sequence for serialization.

Certain aspects of serialization are described in this specification as implementation-defined or implementation-dependent.

[Definition: Implementation-defined indicates an aspect that MAY differ between serializers, but whose actual behaviour MUST be specified either by another specification that sets conformance criteria for serialization (see 10 Conformance) or in documentation that accompanies the serializer.]

[Definition: Implementation-dependent indicates an aspect that MAY differ between serializers, and whose actual behaviour is not REQUIRED to be specified either by another specification that sets conformance criteria for serialization (see 10 Conformance) or in documentation that accompanies the serializer.]

[Definition: In some instances, the sequence that is input to serialization cannot be successfully converted into a sequence of octets given the set of serialization parameter (3 Serialization Parameters) values specified. A serialization error is said to occur in such an instance.] In some cases, a serializer is REQUIRED to signal such an error. What it means to signal a serialization error is determined by the relevant conformance criteria (10 Conformance) to which the serializer conforms. In other cases, there is an implementation-defined choice between signalling a serialization error and performing a recovery action. Such a recovery action will allow a serializer to produce a sequence of octets that might not fully reflect the usual requirements of the parameter settings that are in effect.

Many terms used in this document are defined in the XPath specification [XPath 2.0] or the Data Model specification [Data Model]. Particular attention is drawn to the following:

2 Sequence Normalization

An instance of the data model that is input to the serialization process is a sequence. Prior to serializing a sequence using any of the output methods whose behavior is specified by this document (3 Serialization Parameters) the serializer MUST first compute a normalized sequence for serialization; it is the normalized sequence that is actually serialized. [Definition: The purpose of sequence normalization is to create a sequence that can be serialized as a well-formed XML document or external general parsed entity, that also reflects the content of the input sequence to the extent possible.] [Definition: The result of the sequence normalization process is a result tree.]

The normalized sequence for serialization is constructed by applying all of the following rules in order, with the initial sequence being input to the first step, and the sequence that results from any step being used as input to the subsequent step. For any implementation-defined output method, it is implementation-defined whether this sequence normalization process takes place.

Where the process of converting the input sequence to a normalized sequence indicates that a value MUST be cast to xs:string, that operation is as defined in Section 17.1.2 Casting to xs:string and xdt:untypedAtomicFO of [Functions and Operators]. The steps in computing the normalized sequence are:

  1. If the sequence that is input to serialization is empty, create a sequence S1 that consists of a zero-length string. Otherwise, copy each item in the sequence that is input to serialization to create the new sequence S1.

  2. For each item in S1, if the item is atomic, obtain the lexical representation of the item by casting it to an xs:string and copy the string representation to the new sequence; otherwise, copy the item, which will be a node, to the new sequence. The new sequence is S2.

  3. For each subsequence of adjacent strings in S2, copy a single string to the new sequence equal to the values of the strings in the subsequence concatenated in order, each separated by a single space. Copy all other items to the new sequence. The new sequence is S3.

  4. For each item in S3, if the item is a string, create a text node in the new sequence whose string value is equal to the string; otherwise, copy the item to the new sequence. The new sequence is S4.

  5. For each item in S4, if the item is a document node, copy its children to the new sequence; otherwise, copy the item to the new sequence. The new sequence is S5.

  6. It is a serialization error [err:SE0001] if an item in S5 is an attribute node or a namespace node. Otherwise, construct a new sequence, S6, that consists of a single document node and copy all the items in the sequence, which are all nodes, as children of that document node.

S6 is the normalized sequence.

The result tree rooted at the document node that is created by the final step of this sequence normalization process is the instance of the data model to which the rules of the appropriate output method are applied. If the sequence normalization process results in a serialization error, the serializer MUST signal the error.

Note:

The sequence normalization process for a sequence $seq is equivalent to constructing a document node using the XSLT instruction:

<xsl:document>
  <xsl:copy-of select="$seq"/>
</xsl:document>

or the XQuery expression:

document {
  for $s in $seq return
    if ($s instance of document-node())
    then $s/child::node()
    else $s
}

This process results in a serialization error [err:SE0001] with sequences containing parentless attribute and namespace nodes.

3 Serialization Parameters

There are a number of parameters that influence how serialization is performed. Host languages MAY allow users to specify any or all of these parameters, but they are not REQUIRED to be able to do so. However, the host language MUST specify all applicable parameters except doctype-public and doctype-system, which are optional even when they are applicable.

The following serialization parameters are defined:

Serialization parameter name Permitted values for parameter
byte-order-mark One of the enumerated values yes or no. This parameter indicates whether the serialized sequence of octects is to be preceded by a Byte Order Mark. (See Section 5.1 of [Unicode Encoding].) The actual octet order used is implementation-dependent. If the concept of a Byte Order Mark is not meaningful in connection with the value of the encoding parameter, the byte-order-mark parameter is ignored.
cdata-section-elements A list of expanded QNames, possibly empty.
doctype-public A string of Unicode characters. This parameter may be absent.
doctype-system A string of Unicode characters. This parameter may be absent.
encoding A string of Unicode characters in the range #x21 to #x7E (that is, printable ASCII characters); the value SHOULD be a charset registered with the Internet Assigned Numbers Authority [IANA], [RFC2278] or begin with the characters x- or X-.
escape-uri-attributes One of the enumerated values yes or no.
include-content-type One of the enumerated values yes or no.
indent One of the enumerated values yes or no.
media-type A string of Unicode characters specifying the media type (MIME content type) [RFC2046]; the charset parameter of the media type MUST NOT be specified explicitly in the value of the media-type parameter. If the destination of the serialized output is annotated with a media type, this parameter MAY be used to provide such an annotation. For example, it MAY be used to set the media type in an HTTP header.
method An expanded QName with a empty namespace URI, and the local part of the name equal to one of xml, xhtml, html or text, or having a non-empty namespace URI. If the namespace URI is non-null, the parameter specifies an implementation-defined output method.
normalization-form One of the enumerated values NFC, NFD, NFKC, NFKD, fully-normalized, none or an implementation-defined value.
omit-xml-declaration One of the enumerated values yes or no.
standalone One of the enumerated values yes, no or none.
undeclare-namespaces One of the enumerated values yes or no.
use-character-maps A list of pairs, possibly empty, with each pair consisting of a single Unicode character and a string of Unicode characters.
version A string of Unicode characters.

The value of the method parameter is an expanded QName. If the value has a empty namespace URI, then the local name identifies a method specified in this document and MUST be one of xml, html, xhtml, or text; in this case, the output method specified MUST be used for serializing. If the namespace URI is non-empty, then it identifies an implementation-defined output method; the behavior in this case is not specified by this document.

In those cases where they have no important effect on the content of the serialized result, details of the output methods defined by this specification are left unspecified and are regarded as implementation-dependent. Whether a serializer uses apostrophes or quotation marks to delimit attribute values in the XML output method is an example of such a detail.

The detailed semantics of each parameter will be described separately for each output method for which it is applicable. If the semantics of a parameter are not described for an output method, then it is not applicable to that output method.

4 Phases of Serialization

Following the sequence normalization process described in 2 Sequence Normalization, serialization can be regarded as involving three phases of processing.

For an implementation-defined output method, any of these phases MAY be skipped or MAY be performed in a different order than is specified here. For the output methods defined in this specification, these phases are carried out sequentially as follows:

  1. Markup generation produces the character representation of those parts of the serialized result that describe the structure of the normalized sequence. In the cases of the XML, HTML and XHTML output methods, this phase produces the character representations of the following:

    • the document type declaration;

    • start tags and end tags (except for attribute values, whose representation is produced by the character expansion phase);

    • processing instructions; and

    • comments.

    In the cases of the XML and XHTML output methods, this phase also produces the following:

    • the XML or text declaration; and

    • empty element tags (except for the attribute values);

    In the case of the text output method, this phase has no effect.
  2. Character expansion is concerned with the representation of characters appearing in text and attribute nodes in the normalized sequence. The substitution processes that apply are listed below, in priority order: a character that is handled by one process in this list will be unaffected by processes appearing later in the list, except that a character affected by Unicode Normalization MAY be affected by creation of CDATA sections and by character escaping:

    1. URI escaping (in the case of URI-valued attributes in the HTML and XHTML output methods), as determined by the escape-uri-attributes parameter. [Definition: URI escaping is a process where non-ASCII characters in URI attribute values are escaped using the method defined by Section 5.4 of [XLink].]

    2. Character mapping, as determined by the use-character-maps parameter. Text nodes that are children of elements specified by the cdata-section-elements parameter are not affected by this step.

    3. Unicode Normalization, if requested by the normalization-form parameter. [Definition: Unicode Normalization is the process of removing alternate representations of equivalent sequences from textual data, to convert the data into a form that can be binary-compared for equivalence, as specified in [UAX #15: Unicode Normalization Forms]. For specific recommendations for character normalization on the World Wide Web, see [Character Model for the World Wide Web 1.0: Normalization].]

      The meanings associated with the possible values of the normalization-form parameter are as follows:

    4. Creation of CDATA sections, as determined by the cdata-section-elements parameter. Note that this is also affected by the encoding parameter, in that characters not present in the selected encoding cannot be represented in a CDATA section.

    5. Escaping according to XML or HTML rules of special characters that cannot be represented in the selected encoding. For example replacing < with &lt;

    6. If a quote (") character is in an attribute, and the attribute is delimited by quote, the character will be changed to an an apostrophe ('). Likewise, if a apostrophe (') character is in an attribute, and the attribute is delimited by apostrophe, the character will be changed to an an quote (").

  3. Encoding, as controlled by the encoding parameter, converts the character stream produced by the previous phases into a octet stream.

    Note:

    Serialization is only defined in terms of encoding the result as a stream of octets. However, a serializer may provide an option that allows the encoding phase to be skipped, so that the result of serialization is a stream of Unicode characters. The effect of any such option is implementation-defined, and a serializer is not required to support such an option.

5 XML Output Method

The XML output method serializes the normalized sequence as an XML entity that MUST satisfy the rules for either a well-formed XML document entity or a well-formed XML external general parsed entity, or both. A serialization error [err:SE0003] results if the serializer is unable to satisfy those rules, except for contents modified by the character expansion phase of serialization, as described in 4 Phases of Serialization, which could result in the serialized output being not well-formed but will not result in a serialization error. If a serialization error results, the serializer MUST signal the error.

If the document node of the normalized sequence has a single element node child and no text node children, then the serialized output is a well-formed XML document entity, and the serialized output MUST conform to the appropriate version of the XML Namespaces Recommendation [XML Names] or [XML Names 1.1]. If the normalized sequence does not take this form, then the serialized output is a well-formed XML external general parsed entity, which, when referenced within a trivial XML document wrapper like this:


<?xml version="version"?>
<!DOCTYPE doc [
<!ENTITY e SYSTEM "entity-URI">
]>
<doc>&e;</doc>

where entity-URI is a URI for the entity, and the value of the version pseudo-attribute is the value of the version parameter, produces a document which MUST itself be a well-formed XML document conforming to the corresponding version of the XML Namespaces Recommendation [XML Names] or [XML Names 1.1].

[Definition: A reconstructed tree may be constructed by parsing the XML document and converting it into an instance of the data model as specified in [Data Model].] The result of serialization MUST be such that the reconstructed tree may be different than the result tree as follows:

It is a serialization error [err:SE0004] to specify the doctype-system parameter, or to specify the standalone parameter with a value other than 'none', if the instance of the data model contains text nodes or multiple element nodes as children of the root node. The serializer MUST either signal the error, or recover by ignoring the request to output a document type declaration or standalone parameter.

The result of serialization using the XML output method is not guaranteed to be well-formed XML if character maps have been specified (see 9 Character Maps).

5.1 The Influence of Serialization Parameters upon the XML Output Method

5.1.1 XML Output Method: the version Parameter

The version parameter specifies the version of XML and the version of Namespaces in XML to be used for outputting the instance of the data model. The version output in the XML declaration (if an XML declaration is output) MUST correspond to the version of XML that the serializer used for outputting the instance of the data model.

If the serialized result would contain an NCName Names that contains a character that is not permitted by the version of Namespaces in XML specified by the version parameter, a serialization error [err:SE0005] results. The serializer MUST signal the error.

If the serialized result would contain a character that is not permitted by the version of XML specified by the version parameter, a serialization error [err:SE0006] results. The serializer MUST signal the error.

For example, if the version parameter has the value 1.0, and the instance of the data model contains a non-whitespace control character in the range #x1 to #x1F, a serialization error [err:SE0006] results. If the version parameter has the value 1.1 and a comment node in the instance of the data model contains a non-whitespace control character in the range #x1 to #x1F or a control character other than NEL in the range #x7F to #x9F, a serialization error [err:SE0006] results.

5.1.2 XML Output Method: the encoding Parameter

The encoding parameter specifies the encoding to use for outputting the instance of the data model. Serializers are REQUIRED to support values of UTF-8 and UTF-16. A serialization error [err:SE0007] occurs if an output encoding other than UTF-8 or UTF-16 is requested and the serializer does not support that encoding. The serializer MUST signal the error, or recover by using UTF-8 or UTF-16 instead. The serializer MUST NOT use an encoding whose name does not match the EncName XML production of the XML Recommendation [XML10].

When outputting a newline character in the instance of the data model, the serializer is free to represent it using any character sequence that will be normalized to a newline character by an XML parser, unless a specific mapping for the newline character is provided in a character map: see 9 Character Maps.

When outputting any other character that is defined in the selected encoding, the character MUST be output using the correct representation of that character in the selected encoding.

It is possible that the instance of the data model will contain a character that cannot be represented in the encoding that the serializer is using for output. In this case, if the character occurs in a context where XML recognizes character references (that is, in the value of an attribute node or text node), then the character MUST be output as a character reference. A serialization error [err:SE0008] occurs if such a character appears in a context where character references are not allowed (for example if the character occurs in the name of an element). The serializer MUST signal the error.

For example, if a text node contains the character LATIN SMALL LETTER E WITH ACUTE (#xE9), and the value of the encoding parameter is US-ASCII, the character MUST be serialized as a character reference. If a comment node contained the same character, a serialization error [err:SE0008] would result.

5.1.3 XML Output Method: the indent Parameter

If the indent parameter has the value yes, then the XML output method MAY output whitespace in addition to the whitespace in the instance of the data model in order to indent the result so that a person will find it easier to read; if the indent parameter has the value no, it MUST NOT output any additional whitespace. If the XML output method does output additional whitespace, it MUST use an algorithm to output additional whitespace that satisfies the following constraints:

  • Whitespace characters MUST NOT be added adjacent to a text node that contains non-whitespace characters.

  • Whitespace MAY only be added adjacent to an element node, that is, immediately before a start tag or immediately after an end tag.

  • The new whitespace characters MAY replace existing whitespace characters in the same position, for example a tab MAY be inserted as a replacement for existing spaces. However, existing whitespace MUST NOT be removed without such a replacement.

  • Whitespace characters MUST NOT be inserted in a part of the result document that is controlled by an xml:space attribute with value preserve. (See [XML10] for more information about the xml:space attribute.)

  • Whitespace characters SHOULD NOT be added in places where the characters would be significant — for example, in the content of an element whose content model is known to be mixed.

Note:

The effect of these rules is to ensure that whitespace is only added in places where (a) XSLT's <xsl:strip-space> declaration could cause it to be removed, and (b) it does not affect the string value of any element node with simple content. It is usually not safe to indent document types that include elements with mixed content.

Note:

The whitespace added may possibly be based on whitespace stripped from either the source document or the stylesheet (in the case of XSLT), or guided by other means that might depend on the host language, in the case of an instance of the data model created using some other process.

5.1.4 XML Output Method: the cdata-section-elements Parameter

The cdata-section-elements parameter contains a list of expanded QNames. If the expanded QName of the parent of a text node is a member of the list, then the text node MUST be output as a CDATA section, except in those circumstances described below.

If the text node contains the sequence of characters ]]>, then the currently open CDATA section MUST be closed following the ]] and a new CDATA section opened before the >.

If the text node contains characters that are not representable in the character encoding being used to output the instance of the data model, then the currently open CDATA section MUST be closed before such characters, the characters MUST be output using character references or entity references, and a new CDATA section MUST be opened for any further characters in the text node.

CDATA sections MUST NOT be used except where they have been explicitly requested by the user, either by using the cdata-section-elements parameter, or by using some other implementation-defined mechanism.

Note:

This is phrased to permit an implementor to provide an option that attempts to preserve CDATA sections present in the source document.

5.1.5 XML Output Method: the omit-xml-declaration and standalone Parameters

The XML output method MUST output an XML declaration if the omit-xml-declaration parameter has the value no. The XML declaration MUST include both version information and an encoding declaration. If the standalone parameter has the value yes or the value no, the XML declaration MUST include a standalone document declaration with the same value as the value of the standalone parameter. If the standalone parameter has the value none, the XML declaration MUST NOT include a standalone document declaration; this ensures that it is both an XML declaration (allowed at the beginning of a document entity) and a text declaration (allowed at the beginning of an external general parsed entity).

A serialization error [err:SE0009] results if the omit-xml-declaration parameter has the value yes, and

  • the standalone parameter has a value other than none; or

  • the version parameter has a value other than 1.0 and the doctype-system parameter is specified.

The serializer MUST signal the error.

Otherwise, if the omit-xml-declaration parameter has the value yes, the XML output method MUST NOT output an XML declaration.

5.1.6 XML Output Method: the doctype-system and doctype-public Parameters

If the doctype-system parameter is specified, the XML output method MUST output a document type declaration immediately before the first element. The name following <!DOCTYPE MUST be the name of the first element, if any. If the doctype-public parameter is also specified, then the XML output method MUST output PUBLIC followed by the public identifier and then the system identifier; otherwise, it MUST output SYSTEM followed by the system identifier. The internal subset MUST be empty. The doctype-public parameter MUST be ignored unless the doctype-system parameter is specified.

5.1.7 XML Output Method: the undeclare-namespaces Parameter

The Data Model allows an element node that binds a non-empty prefix to have a child element node that does not bind that same prefix. In Namespaces in XML 1.1 ([XML Names 1.1]), this can be represented accurately by undeclaring namespaces. If the undeclare-namespaces parameter has the value yes and the output method is XML and the version is greater than 1.0, the serializer MUST undeclare namespaces.

Consider an element x:foo with four in-scope namespaces that associate prefixes with URIs as follows:

  • x is associated with http://example.org/x

  • y is associated with http://example.org/y

  • z is associated with http://example.org/z

  • xml is associated with http://www.w3.org/XML/1998/namespace

Suppose that it has a child element x:bar with three in-scope namespaces:

  • x is associated with http://example.org/x

  • y is associated with http://example.org/y

  • xml is associated with http://www.w3.org/XML/1998/namespace

If namespace undeclaration is in effect, it will be serialized this way:

<x:foo xmlns:x="http://example.org/x"
       xmlns:y="http://example.org/y"
       xmlns:z="http://example.org/z">
      <x:bar xmlns:z="">...</x:bar>
</x:foo>

In Namespaces in XML ([XML Names]), namespace undeclaration is not possible. If the output method is XML, the value of the undeclare-namespaces parameter is yes, and the value of the version parameter is 1.0, a serialization error [err:SE0010] results; the serializer MUST signal the error.

5.1.8 XML Output Method: the normalization-form Parameter

The normalization-form parameter is applicable for the XML output method. The values NFC and none MUST be supported by the serializer. A serialization error [err:SE0011] results if the value of the normalization-form parameter specifies a normalization form that is not supported by the serializer; the serializer MUST signal the error.

It is a serialization error [err:SE0012] if the value of the parameter is fully-normalized and any relevant construct of the result begins with a combining character. The serializer MUST signal the error. See Section 2.13 of [XML11] for the definition of the relevant constructs of XML.

5.1.9 XML Output Method: Other Parameters

The media-type parameter is applicable for the XML output method. See 3 Serialization Parameters for more information.

The use-character-maps parameter is applicable for the XML output method. See 9 Character Maps for more information.

The byte-order-mark parameter is applicable for the XML output method. See 3 Serialization Parameters for more information.

6 XHTML Output Method

The XHTML output method serializes the instance of the data model as XML, using the HTML compatibility guidelines defined in the XHTML specification.

It is entirely the responsibility of the person or process that creates the instance of the data model to ensure that the instance of the data model conforms to the [XHTML 1.0] or [XHTML 1.1] specification. It is not an error if the instance of the data model is invalid XHTML. Equally, it is entirely under the control of the person or process that creates the instance of the data model whether the output conforms to XHTML Strict, XHTML Transitional, XHTML Frameset, or XHTML Basic.

The serialization of the instance of the data model follows the same rules as for the XML output method, with the exceptions noted below. These differences are based on the HTML compatibility guidelines published in Appendix C of [XHTML 1.0], which are designed to ensure that as far as possible, XHTML is rendered correctly on user agents designed originally to handle HTML.

Note:

As with the XML output method, the XHTML output method specifies that an XML declaration will be output unless it is suppressed using the omit-xml-declaration parameter. Appendix C.1 of [XHTML 1.0] provides advice on the consequences of including, or omitting, the XML declaration.

Note:

Appendix C of [XHTML 1.0] describes a number of compatibility guidelines for users of XHTML who wish to render their XHTML documents with HTML user agents. In some cases, such as the guideline on the form empty elements should take, only the serialization process itself has the ability to follow the guideline. In such cases, those guidelines are reflected in the requirements on the serializer described above.

In all other cases, the guidelines can be adhered to by the instance of the data model that is input to the serialization process. The guideline on the use of whitespace characters in attribute values is one such example. It is the responsibility of the person or process that creates the instance of the data model that is input to the serialization process to ensure it is created in a way that is consistent with the guidelines. No serialization error results if the input instance of the data model does not adhere to the guidelines.

7 HTML Output Method

The HTML output method serializes the instance of the data model as HTML.

For example, the following XSL stylesheet generates html output,

<xsl:stylesheet version="2.0"
                xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

<xsl:output method="html" version=”4.0”/>

<xsl:template match="/">
<html>
<xsl:apply-templates/>
</html>
</xsl:template>

...

</xsl:stylesheet>

The version attribute of the xsl:output element indicates the version of the HTML Recommendation [HTML] to which the serialized result is to conform. If the serializer does not support the version of HTML specified by this attribute (which corresponds to the version parameter defined in this specification), it MUST signal a serialization error [err:SE0013].

7.1 The Influence of Serialization Parameters upon the HTML Output Method

7.1.1 HTML Output Method: Markup for Elements

The HTML output method MUST NOT output an element differently from the XML output method unless the expanded QName of the element has a null namespace URI; an element whose expanded QName has a non-null namespace URI MUST be output as XML. If the expanded QName of the element has a null namespace URI, but the local part of the expanded QName is not recognized as the name of an HTML element, the element MUST be output in the same way as a non-empty, inline element such as span. In particular:

  1. If the result tree contains namespace nodes for namespaces other than the XML namespace, the HTML output method MUST represent these namespaces using attributes named xmlns or xmlns:prefix in the same way as the XML output method would represent them when the version parameter is set to 1.0.

  2. If the result tree contains elements or attributes whose names have a non-null namespace URI, the HTML output method MUST generate namespace-prefixed QNames for these nodes in the same way as the XML output method would do when the version parameter is set to 1.0.

  3. Where special rules are defined later in this section for serializing specific HTML elements and attributes, these rules MUST NOT be applied to an element or attribute whose name has a non-null namespace URI. However, the generic rules for the HTML output method that apply to all elements and attributes, for example the rules for escaping special characters in the text and the rules for indentation, MUST be used also for namespaced elements and attributes.

  4. When serializing an element whose name is not defined in the HTML specification, but that is in the null namespace, the HTML output method MUST apply the same rules (for example, indentation rules) as when serializing a span element. The descendants of such an element MUST be serialized as if they were descendants of a span element.

  5. When serializing an element whose name is in a non-null namespace, the HTML output method MUST apply the same rules (for example, indentation rules) as when serializing a div element. The descendants of such an element MUST be serialized as if they were descendants of a div element.

The HTML output method MUST NOT output an end-tag for empty elements. For HTML 4.0, the empty elements are area, base, basefont, br, col, frame, hr, img, input, isindex, link, meta and param. For example, an element written as <br/> or <br></br> in an XSLT stylesheet MUST be output as <br>.

The HTML output method MUST recognize the names of HTML elements regardless of case. For example, elements named br, BR or Br MUST all be recognized as the HTML br element and output without an end-tag.

The HTML output method MUST NOT perform escaping for the content of the script and style elements.

For example, a script element created by an XQuery direct element constructor or an XSLT literal result element, such as:

<script>if (a &lt; b) foo()</script>

or

<script><![CDATA[if (a < b) foo()]]></script>

MUST be output as

<script>if (a < b) foo()</script>

A common requirement is to output a script element as shown in the example below:

<script type="text/javascript">
      document.write ("<em>This won't work</em>")
</script>

This is illegal HTML, for the reasons explained in section B.3.2 of the HTML 4.01 specification. Nevertheless, it is possible to output this fragment, using either of the following constructs:

Firstly, by use of a script element created by an XQuery direct element constructor or an XSLT literal result element:

<script type="text/javascript">
      document.write ("<em>This won't work</em>")
</script>

Secondly, by constructing the markup from ordinary text characters:

<script type="text/javascript">
      document.write ("&lt;em&gt;This won't work&lt;/em&gt;")
</script>

As the HTML specification points out, the correct way to write this is to use the escape conventions for the specific scripting language. For JavaScript, it can be written as:

<script type="text/javascript">
      document.write ("&lt;em&gt;This will work&lt;\/em&gt;")
</script>

The HTML 4.01 specification also shows examples of how to write this in various other scripting languages. The escaping MUST be done manually, it will not be done by the serializer.

7.1.2 HTML Output Method: Writing Attributes

The HTML output method MUST NOT escape "<" characters occurring in attribute values.

If the indent parameter has the value yes, then the HTML output method MAY add or remove whitespace as it serializes the instance of the data model, so long as it does not change how an HTML user agent would render the output.

If the escape-uri-attributes parameter has the value yes, the HTML output method MUST apply URI escaping to URI attribute values, except that relative URIs MUST NOT be absolutized.

Note:

This escaping is deliberately confined to non-ASCII characters, because escaping of ASCII characters is not always appropriate, for example when URIs or URI fragments are interpreted locally by the HTML user agent. Even in the case of non-ASCII characters, escaping can sometimes cause problems. More precise control of URI escaping is therefore available by setting escape-uri-attributes to no, and controlling the escaping of URIs by means of the fn:escape-uri function defined in [Functions and Operators].

The HTML output method MUST output boolean attributes (that is attributes with only a single allowed value that is equal to the name of the attribute) in minimized form.

For example, a start-tag created using the following XQuery direct element constructor or XSLT literal result element

<OPTION selected="selected">

MUST be output as

<OPTION selected>

The HTML output method MUST NOT escape a & character occurring in an attribute value immediately followed by a { character (see Section B.7.1 of the HTML 4.0 Recommendation).

For example, a start-tag created using the following XQuery direct element constructor or XSLT literal result element

<BODY bgcolor='&{{randomrbg}};'>

MUST be output as

<BODY bgcolor='&{randomrbg};'>

7.1.3 HTML Output Method: Indentation

If the indent parameter has the value yes, then the HTML output method MAY add or remove whitespace as it serializes the result tree, so long as it does not change the way that a conforming HTML user agent would render the output.

Note:

This rule can be satisfied by observing the following constraints:

Whitespace MUST only be added before or after an element, or adjacent to an existing whitespace character.

Whitespace MUST NOT be added or removed adjacent to an inline element. The inline elements are those included in the %inline category of any of the HTML 4.01 DTD's, as well as the ins and del elements if they are used as inline elements (i.e., if they do not contain element children).

Whitespace MUST NOT be added or removed inside a formatted element, the formatted elements being pre, script, style, and textarea.

Note that the HTML definition of whitespace is different from the XML definition: see section 9.1 of the [HTML] specification.

7.1.4 HTML Output Method: Writing Character Data

The HTML output method MAY output a character using a character entity reference in preference to using a numeric character reference, if an entity is defined for the character in the version of HTML that the output method is using. Entity references and character references SHOULD be used only where the character is not present in the selected encoding, or where the visual representation of the character is unclear (as with &nbsp;, for example).

When outputting a sequence of whitespace characters in the instance of the data model, within an element where whitespace is treated normally (but not in elements such as pre and textarea), the HTML output method MAY represent it using any sequence of whitespace that will be treated in the same way by an HTML user agent. See section 3.5 of [XHTML Modularization] for some additional information on handling of whitespace by an HTML user agent.

Certain characters, specifically the control characters #x7F-#x9F, are legal in XML but not in HTML. It is a serialization error [err:SE0014] to use the HTML output method when such characters appear in the instance of the data model. The serializer MUST signal the error.

The HTML output method MUST terminate processing instructions with > rather than ?>.

7.1.5 HTML Output Method: Encoding

The encoding parameter specifies the encoding to be used. Serializers are REQUIRED to support values of UTF-8 and UTF-16. A serialization error [err:SE0007] occurs if an output encoding other than UTF-8 or UTF-16 is requested and the serializer does not support that encoding. The serializer MUST signal the error.

If there is a head element, and the include-content-type parameter has the value yes, the HTML output method MUST add a meta element as the first child element of the head element specifying the character encoding actually used.

For example,

<HEAD>
<META http-equiv="Content-Type" content="text/html; charset=EUC-JP">
...

The content type MUST be set to the value given for the media-type parameter.

If a meta element has been added to the head element as described above, then any existing meta element child of the head element having an http-equiv attribute with the value "Content-Type" MUST be discarded.

Note:

This process removes possible parameters in the attribute value. For example,

<meta http-equiv="Content-Type"
        content="text/html;version='3.0'"/>

in the data model instance would be replaced by,

<meta http-equiv="Content-Type"
        content="text/html;charset=utf-8"/>

It is possible that the instance of the data model will contain a character that cannot be represented in the encoding that the serializer is using for output. In this case, if the character occurs in a context where HTML recognizes character references, then the character MUST be output as a character entity reference or decimal numeric character reference; otherwise (for example, in a script or style element or in a comment), the serializer MUST signal a serialization error [err:SE0008].

7.1.6 HTML Output Method: Document Type Declaration

If the doctype-public or doctype-system parameters are specified, then the HTML output method MUST output a document type declaration immediately before the first element. The name following <!DOCTYPE MUST be HTML or html. If the doctype-public parameter is specified, then the output method MUST output PUBLIC followed by the specified public identifier; if the doctype-system parameter is also specified, it MUST also output the specified system identifier following the public identifier. If the doctype-system parameter is specified but the doctype-public parameter is not specified, then the output method MUST output SYSTEM followed by the specified system identifier.

7.1.7 HTML Output Method: Unicode Normalization

The normalization-form parameter is applicable for the HTML output method. The values NFC and none MUST be supported by the serializer. A serialization error [err:SE0011] results if the value of the normalization-form parameter specifies a normalization form that is not supported by the serializer; the serializer MUST signal the error.

7.1.8 HTML Output Method: Other Parameters

The media-type parameter is applicable for the HTML output method. See 3 Serialization Parameters for more information.

The use-character-maps parameter is applicable for the HTML output method. See 9 Character Maps for more information.

The byte-order-mark parameter is applicable for the HTML output method. See 3 Serialization Parameters for more information.

8 Text Output Method

The text output method serializes the instance of the data model by outputting the string value of the document node created by sequence normalization, without any escaping.

A newline character in the instance of the data model MAY be output using any character sequence that is conventionally used to represent a line ending in the chosen system environment.

The media-type parameter is applicable for the text output method. See 3 Serialization Parameters for more information.

The encoding parameter identifies the encoding that the text output method MUST use to convert sequences of characters to sequences of bytes. Serializers are REQUIRED to support values of UTF-8 and UTF-16. A serialization error [err:SE0007] occurs if the serializer does not support the encoding specified by the encoding parameter. The serializer MUST signal the error. If the instance of the data model contains a character that cannot be represented in the encoding that the serializer is using for output, the serializer MUST signal a serialization error [err:SE0008].

The normalization-form parameter is applicable for the text output method. The values NFC and none MUST be supported by the serializer. A serialization error [err:SE0011] results if the value of the normalization-form parameter specifies a normalization form that is not supported by the serializer; the serializer MUST signal the error.

The use-character-maps parameter is applicable for the text output method. See 9 Character Maps for more information.

The byte-order-mark parameter is applicable for the text output method. See 3 Serialization Parameters for more information.

9 Character Maps

The use-character-maps parameter is a list of characters and corresponding string substitutions.

Character maps allow a specific character appearing in a text or attribute node in the instance of the data model to be replaced with a specified string of characters during serialization. The string that is substituted is output "as is," and the serializer performs no checks that the resulting document is well-formed. This mechanism can therefore be used to introduce arbitrary markup in the serialized output. See Section 20.1 Character MapsXT of [XSLT 2.0] for examples of using character mapping in XSLT.

Character mapping is applied to the characters that actually appear in a text or attribute node in the instance of the data model, before any other serialization operations such as escaping or Unicode Normalization are applied. If a character is mapped, then it is not subjected to XML or HTML escaping, nor to Unicode Normalization. The string that is substituted for a character is not validated or processed in any way by the serializer, except for translation into the target encoding. In particular, it is not subjected to XML or HTML escaping, it is not subjected to Unicode Normalization, and it is not subjected to further character mapping.

Character mapping is not applied to characters in text nodes whose parent elements are listed in the cdata-section-elements parameter, nor to characters for which output escaping has been disabled (disabling output escaping is an [XSLT 2.0] feature), nor to characters in attribute values that are subject to the URI escaping defined for the HTML and XHTML output methods, unless URI escaping has been disabled using the escape-uri-attributes parameter in the output definition.

On serialization, occurrences of a character specified in the use-character-maps in text nodes and attribute values are replaced by the corresponding string from the use-character-maps parameter.

Note:

Using a character map can result in non-well-formed documents if the string contains XML-significant characters. For example, it is possible to create documents containing unmatched start and end tags, references to entities that are not declared, or attributes that contain tags or unescaped quotation marks.

If a character is mapped, then it is not subjected to XML or HTML escaping.

A serialization error [err:SE0008] occurs if character mapping causes the output of a string containing a character that cannot be represented in the encoding that the serializer is using for output. The serializer MUST signal the error.

10 Conformance

Serialization is intended primarily as a component that can be used by other specifications. Therefore, this document relies on specifications that use it to specify conformance criteria for Serialization in their respective environments. Specifications that set conformance criteria for their use of Serialization MUST NOT change the semantic definitions of Serialization as given in this specification, except by subsetting and/or compatible extensions.

Specifications that set conformance criteria for their use of Serialization MUST NOT change the semantic definitions of Serialization as given in this specification, except by subsetting and/or compatible extensions. Thus, it is the responsibility of those specifications to avoid any behavior that would conflict with the semantic definition of Serialization.

A References

A.1 Normative References

Character Model for the World Wide Web 1.0: Normalization
World Wide Web Consortium, Character Model for the World Wide Web 1.0: Normalization See http://www.w3.org/TR/2004/WD-charmod-norm-20040225
Data Model
World Wide Web Consortium, XQuery 1.0 and XPath 2.0 Data Model. See http://www.w3.org/TR/xpath-datamodel/.
Functions and Operators
World Wide Web Consortium, XQuery 1.0 and XPath 2.0 Functions and Operators. W3C Working Draft. See http://www.w3.org/TR/xpath-functions/.
HTML
World Wide Web Consortium. HTML 4.01 specification. W3C Recommendation. See http://www.w3.org/TR/html4/.
IANA
Internet Assigned Numbers Authority. Character Sets. See http://www.iana.org/assignments/character-sets.
RFC2046
N. Freed, N. Borenstein. Multipurpose Internet Mail Extensions (MIME) Part Two: Media Types. IETF RFC 2046. See http://www.ietf.org/rfc/rfc2046.txt.
RFC2119
S. Bradner. Key words for use in RFCs to Indicate Requirement Levels. IETF RFC 2119. See http://www.ietf.org/rfc/rfc2119.txt.
RFC2278
N. Freed, J. Postel. IANA Charset Registration Procedures. IETF RFC 2278. See http://www.ietf.org/rfc/rfc2278.txt.
RFC3236
M. Baker, P. Stark. The 'application/xhtml+xml' Media Type. IETF RFC 3236. See http://www.ietf.org/rfc/rfc3236.txt.
Unicode Encoding
Unicode Consortium. Unicode Character Encoding Model. Unicode Standard Annex #17. See http://www.unicode.org/unicode/reports/tr17/.
UAX #15: Unicode Normalization Forms
Unicode Consortium. Unicode Normalization Forms. Unicode Standard Annex #15. See http://www.unicode.org/unicode/reports/tr15/.
XHTML 1.0
World Wide Web Consortium. XHTML 1.0: The Extensible HyperText Markup Language (Second Edition). W3C Recommendation. See http://www.w3.org/TR/xhtml1/.
XHTML 1.1
World Wide Web Consortium. XHTML 1.1: Module-Based XHTML. W3C Recommendation. See http://www.w3.org/TR/xhtml11/.
XML10
World Wide Web Consortium. Extensible Markup Language (XML) 1.0 (Second Edition) W3C Recommendation. See http://www.w3.org/TR/2000/REC-xml-20001006.
XML11
World Wide Web Consortium. Extensible Markup Language (XML) 1.1 W3C Recommendation. See http://www.w3.org/TR/2004/REC-xml11-20040204/.
XML Names
World Wide Web Consortium. Namespaces in XML. W3C Recommendation. See http://www.w3.org/TR/REC-xml-names/.
XML Names 1.1
World Wide Web Consortium. Namespaces in XML 1.1. W3C Recommendation. See http://www.w3.org/TR/xml-names11/.
XLink
World Wide Web Consortium. XML Linking Language (XLink). W3C Recommendation. See http://www.w3.org/TR/2001/REC-xlink-20010627/.
XML Schema
World Wide Web Consortium. XML Schema Part 1: Structures and XML Schema Part 2: Data Types. W3C Recommendation. See http://www.w3.org/TR/xmlschema-1/ and http://www.w3.org/TR/xmlschema-2/
XPath 2.0
World-Wide Web Consortium, XML Path Language (XPath) 2.0. See http://www.w3.org/TR/xpath20/.
XQuery 1.0
World Wide Web Consortium, XQuery 1.0: An XML Query Language. See http://www.w3.org/TR/xquery/.
XSLT 2.0
World Wide Web Consortium, XSL Transformations Language (XSLT) Version 2.0. See http://www.w3.org/TR/xslt20/.

A.2 Non-normative References

XHTML Modularization
World Wide Web Consortium, Modularization of XHTML See http://www.w3.org/TR/xhtml-modularization/.
XHTML Media Types W3C Note 1 August 2002
World Wide Web Consortium, XHTML Media Types W3C Note 1 August 2002 See http://www.w3.org/TR/xhtml-media-types/.

B Summary of Error Conditions

err:SE0001

It is a error [err:SE0001] if an item in S5 in sequence normalization is an attribute node or a namespace node.

err:SE0003

It is an error if the serializer is unable to satisfy the rules for either a well-formed XML document entity or a well-formed XML external general parsed entity, or both, except for contents modified by the character expansion phase of serialization.

err:SE0004

It is an error to specify the doctype-system parameter, or to specify the standalone parameter with a value other than 'none', if the instance of the data model contains text nodes or multiple element nodes as children of the root node.

err:SE0005

It is an error if the serialized result would contain an NCName Names that contains a character that is not permitted by the version of Namespaces in XML specified by the version parameter.

err:SE0006

It is an error if the serializer does not support the version of XML and the version of Namespaces in XML specified in the version parameter.

err:SE0007

It is an error if an output encoding other than UTF-8 or UTF-16 is requested and the serializer does not support that encoding.

err:SE0008

It is an error if a character that cannot be represented in the encoding that the serializer is using for output appears in a context where character references are not allowed (for example if the character occurs in the name of an element).

err:SE0009

It is an error if the omit-xml-declaration parameter has the value yes, and the standalone attribute has a value other than none; or the version parameter has a value other than 1.0 and the doctype-system parameter is specified.

err:SE0010

It is an error if the output method is xml, the value of the undeclare-namespaces parameter is yes, and the value of the version parameter is 1.0.

err:SE0011

It is a error if the value of the normalization-form parameter specifies a normalization form that is not supported by the serializer.

err:SE0012

It is an error if the value of the normalization-form parameter is fully-normalized and any relevant construct of the result begins with a combining character.

err:SE0013

It is an error if the serializer does not support the version of HTML specified by the version parameter.

err:SE0014

It is an error to use the HTML output method when characters which are legal in XML but not in HTML, specifically the control characters #x7F-#x9F, appear in the instance of the data model.