Copyright © 1999 W3C (MIT, INRIA, Keio), All Rights Reserved. W3C liability, trademark, document use and software licensing rules apply.
This document describes a subset of the information contained in an XML document and a syntax for expressing that subset. This syntax, called Canonical XML, is designed to encode the "logical structure" of XML documents; two XML documents whose Canonical-XML form is identical will be considered equivalent for the purposes of many applications.
This is a W3C Working Draft for review by W3C members and other interested parties. This represents the consensus position of the W3C XML Syntax Working group, based on its own discussions and analysis of feedback on previous versions. The Working Group does not expect to introduce any substantial changes to the design described here, and intends to proceed towards Recommendation status on that basis. Nonethless, this a draft document and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use W3C Working Drafts as reference material or to cite them as other than "work in progress." A list of current W3C working drafts can be found at http://www.w3.org/TR.
The XML 1.0 Recommendation [XML] describes the syntax of a class of data objects called XML documents. It is possible for XML documents which are equivalent for the purposes of many applications to differ in their physical representation. In particular, they may differ in their entity structure, attribute ordering, and character encoding. This means that much equivalence testing of XML documents cannot be done at the byte-comparison level. This Canonical XML specification aims to introduce a notion of equivalence between XML documents which can be tested at the syntactic level and, in particular, by byte-for-byte comparison. In the syntax it describes, "logically equivalent" documents are byte-for-byte identical.
The syntax described in this specification is called Canonical XML. XML documents may be transformed into Canonical XML (with potentially some information loss) - the result of this transormation is described as the canonical form of the original document. Canonical XML is XML - that is to say, the canonical form of any XML document is an XML document.
There are two essential aspects to the specification of Canonical XML:
Which information from an XML document is included in its canonical form (and which is not).
How information is expressed in Canonical XML.
For the purposes of this specification, the information in an XML document is that described by the XML Information Set Specification [Infoset]. The canonical form of an XML document, which is itself an XML document, also has an information set. This section describes what portion of an XML document's information set is included in that of its canonical form.
Note that information not included in Canonical XML may still be used producing it. In particular:
Attribute types serve as the basis of the normalization process for attribute values in Canonical XML, but the type of attributes is not preserved in it.
The replacement text of general parsed entities that are referenced is included in Canonical XML, but the information about which entity any character or logical structure came from is not.
Attribute values provided by default are included in Canonical XML, but the fact that the value was provided by default is not.
The canonical form includes only the "children" property of the document information item. It does not include any of the optional properties of the document information item, nor the "notations" or "entities" properties.
The canonical form includes the properties: "namespace URI," "local name," "children" and "attributes" from each element information item. It does not include the "declared namespaces" property, nor any of the optional properties. Note that the infoset lists the "children" property as including references to skipped entity information items but the canonical form does not include these.
The canonical form included all of the required properties, but none of the optional properties, of the attribute information item.
For Processing Instructions appearing outside of the Document Type Definition, the canonical form includes all of the required properties, but none of the optional properties, of the processing instruction information item. For those which appear in the Document Type Definition, the canonical form includes no Processing Instruction information items.
Reference to skipped entity information items are not included in the canonical form of a document. Such information items could not appear in Canonical XML because canonicalization requires the reading of declarations for all entities referenced in a document.
The canonical form includes the required "character code" property of the character information item. None of the optional properties of the character information item are included.
Canonical XML does not include comment information items.
Canonical XML does not include document type declaration information items.
Canonical XML does not include entity information items.
Canonical XML does not include notation information items.
Canonical XML does not include entity start marker information items.
Canonical XML does not include entity end marker information items.
Canonical XML does not include CDATA start marker information items.
No CDATA sections occur in the canonical form. They are not necessary since all syntactically-significant characters in Canonical XML are escaped in the fashion described in this specification.
Canonical XML does not include CDATA end marker information items.
Canonical XML does not include namespace declaration information items.
The process of canonicalizing an XML document depends on its standalone document declaration. If the declaration is present and its value is "yes", then assuming the XML document satisfies the Standalone Document Declaration validity constraint, no external portion of the DTD can contain material which affects its canonical form.
In all other cases, the process of canonicalization requires reading the DTD. The following information from the DTD affects the canonical form of an XML document:
Default attribute values.
Declarations of general entities which are referenced in the document.
Attribute type declarations which affect the process of attribute value normalization.
Note that the process of canonicalization is effectively impossible for a non-standalone document for which some external component of the DTD cannot be retrieved. Implementors of software which is designed to produce Canonical XML should provide an interface to users such that this error condition can be signaled.
The canonical form of an XML document is standalone.
The canonical form of an XML document contains no general entity references - all such references are expanded so that the canonical form contains only the replacement text. Since it contains no DTD, it also contains no parameter entity references.
Suppose a file named "e1.xml" contains the following text, with no trailing newline (#A) character.
Hallelujah, I'm a bum!then if the following XML document is stored in a file in the same directory
<!DOCTYPE d [ <!ENTITY lsb '['> <!ENTITY rsb ']'> <!ENTITY bum SYSTEM "e1.xml"> ]> <d>&lsb;&bum;&rsb;</d>its canonical form is
<d>[Hallelujah, I'm a bum!]</d>
This section describes the syntax of Canonical XML. This syntax is a proper subset of the syntax of XML 1.0. The canonical form of an XML document is identical to its original form except as described in this section.
Each Canonical XML document must match the production labeled canonXML in the grammar below, where the notation and the semantics of the word "match" are those described in the XML 1.0 specification.
[1] | canonXML | ::= | (PI #xA)* element #xA (PI #xA)* | |
[2] | element | ::= | Stag (Datachar | element | PI)* Etag | |
[3] | Stag | ::= | '<' Name NSDecl? (Att NSDecl?)* '>' | |
[4] | Etag | ::= | '</' Name '>' | |
[5] | NSDecl | ::= | #x20 'xmlns:' Prefix '=' '"' Attvalchar* '"' | |
[6] | Att | ::= | #x20 Name '=' '"' Attvalchar* '"' | |
[7] | Datachar | ::= | '&' | '<' | '>' | '
' | |
| (Char - ('&' | '<' | '>' | #xD )) | ||||
[8] | Attvalchar | ::= | '&' | '<' | '"' | '	' | '
' | '
' | |
| (Char - ('&' | '<' | '"' | #x9 | #xA | #xD)) | ||||
[9] | Name | ::= | (Prefix ':')? NCName | |
[10] | Prefix | ::= | 'n' [1-9] [0-9]* | |
[11] | PI | ::= | '<?' PITarget (#x20 (Char+ - (Char* '?>' Char*)))? '?>' | |
[12] | PITarget | ::= | NCName - (('X' | 'x') ('M' | 'm') ('L' | 'l')) |
The remainder of this section expresses additional constraints beyond those expressed in the grammar and provides further explanatory material on key aspects of Canonical XML.
Canonical XML uses UTF-8 as the character encoding.
For example, consider the following small document:
<?xml version="1.0" encoding="ISO-8859-1"?> <lang>Español</lang>Since it is encoded in ISO-8859-1 ("ISO Latin"), the character "ñ" is stored as #xF1. In Canonical XML, however, that character would be stored using UTF-8 in two bytes whose values are #xC3 and #xB1.
The Unicode standard [Unicode] allows multiple different representations of certain "composed characters" (a simple example is "ç"). Thus two XML documents with content that is equivalent for the purposes of most applications may contain differing character sequences. The W3C has recommended a normalized form for combining-character representation [CharModel] and further recommended that conversion to this form take place at transmission time, a practice called "early normalization". In the presence of early normalization, two Canonical XML documents with Unicode-equivalent content should not exhibit differences due to combining-character representation choices.
The XML 1.0 specification requires XML processors to perform certain simple transformations on white-space characters in XML documents, when they serve as line separators and when they appear in attribute values. For each character in the result of the transformation, there will be a character information item as described by the Information Set. For example, in an XML 1.0 document:
Where an element contains two lines are separated by CR-NL (#xD, #xA), the information set contains a single NL (#xA) character information item.
Where an element or attribute value contains the string "", the information set contains a single CR (#xD) character information item.
Where a CDATA attribute value contains a TAB (#x9) character, the information set contains a single space (#x20) character information item.
When a non-CDATA attribute value contains a TAB (#x9) character, the information set contains a single space (#x20) character information item if the TAB character immediately followed a non white-space character, and, otherwise contains nothing at all.
Where an attribute value contains the string "	", the information set contains a TAB character (#x9).
All character information items are represented in a Canonical XML document by their UTF-8 encoding, with the following exceptions:
In character data and attribute values, the character information items "<" and "&" are represented by "<" and "&" respectively.
In character data, the character information item ">" is represented by ">".
In attribute values, the double-quote character information item (") is represented by """.
In character data, the carriage-return (#xD) character information item is represented by "
".
In attribute values, the character information items TAB (#x9), linefeed (#xA), and carriage-return (#xD) are represented by "	", "
", and "
" respectively.
Canonical-XML documents have a prolog which contains only those Processing Instructions appearing before the start-tag of the root element but not within the Document Type Definition. Each PI is followed by a single newline (#xA) character. These PIs and newline characters make up the whole content of the prolog. If there are no such PIs, the first character is the "<" marking the beginning of the root element's start-tag.
For the following XML document
<!DOCTYPE x PUBLIC "myX" "x.dtd" [ <!ENTITY a "aVal"> ]> <x>y</x>the canonical form is
<x>y</x>
If PIs are involved
<?t1 t1-body ?> <!DOCTYPE x PUBLIC "myX" "x.dtd" [ <?t2 t2-body ?> <!ENTITY a "aVal"> ]> <?xml-stylesheet href="mystyle.css" type="text/css" ?> <?rating mostly-harmless?> <x>y</x><?t3 ?>the canonical form is
<?t1 t1-body ?> <?xml-stylesheet href="mystyle.css" type="text/css" ?> <?rating mostly-harmless?> <x>y</x> <?t3?>
The epilog of all Canonical-XML documents contains a single newline (#xA) character, which immediately follows the ">" marking the end of the root element's end-tag. If the epilog contains Processing Instructions they are preserved in the Canonical-XML epilog, each followed by a newline (#xA) character.
For the following XML document
<x>y</x><?audio stop here ?> <!-- Local variables: mode: xml End: --><?pi?>the canonical form is
<x>y</x> <?audio stop here ?> <?pi?>
In Canonical XML, all elements have a start-tag and an end-tag. For elements which have no content, the end-tag follows the start-tag with no intervening characters.
For the following element
<x> <a n="1"/><b n="2"/> <c n="3"/></x>the canonical form is
<x> <a n="1"></a><b n="2"></b> <c n="3"></c></x>
In Canonical XML, for end-tags and start-tags which contain no attributes, the ">" character closing the tag follows the element type immediately with no intervening white space. Any attributes and namespace declarations follow with each attribute and namespace declaration preceded by one space (#x20) character. When the element type and the attribute names do not have namespaces, the attributes are sorted lexicographically by attribute name (based on Unicode character code points); the ordering when namespaces are present is described in [5.9 Namespaces].
The canonical form of an XML document includes all its attributes, whether provided explicitly or by default in the original document.
For the following element
<x a="Earth" ñ="Wind" z="Fire" >!!</x >the canonical form is
<x a="Earth" z="Fire" ñ="Wind">!!</x>
In the canonical form of an XML document, attribute values are normalized in the fashion required of an XML processor.
In Canonical XML, attribute names and values are separated by a single "=" character and no spaces. All attribute values are delimited by double-quote (") characters. Within attribute values, all occurrences of double-quote are replaced by """.
For the following start-tag
<x a = '"Don't!", he cried.' b = "'>'">the canonical form is
<x a=""Don't!", he cried." b="'>'">
In Canonical XML, there is no Document Type Definition and thus no PIs contained in it. PIs which precede and follow the root element are normalized as follows:
The white-space separating the PI Target from the rest of the PI contents is replaced by a single space (#x20) character.
The "?>" sequence which closes the PI is followed by a single newline (#xA) character.
PIs which are contained in the content of an element are normalized as follows:
The white-space separating the PI Target from the rest of the PI contents is replaced by a single space (#x20) character.
For the following XML document
<?pi1 v1 ?><?pi2 v2 ?><root>Hello <?audio bang! ?> he said.</root><?pi3?>the canonical form is
<?pi1 v1 ?> <?pi2 v2 ?> <root>Hello <?audio bang! ?> he said.</root> <?pi3?>
In Canonical XML, namespace prefixes always have the form
n1
, n2
and so on. The positive integer
following the n
is called the index of the prefix.
A start-tag always contains namespace declarations for exactly those prefixes that are used in the element type and the attribute names occurring in the start-tag. Namespace declarations are never inherited.
NOTE: This approach was chosen so that canonicalization is context-independent: the canonical form of an element is independent of where it occurs in the document.
The default namespace is never used. An attribute name never has the same prefix as the element type or another attribute name. The namespace declaration for a prefix immediately follows the element type or attribute that uses the prefix. Attributes are ordered primarily by the lexicographic order of the namespace URI with which the prefix of the attribute name is associated, and secondarily by the lexicographic order of the local part of the attribute name. A null namespace URI is considered to precede a non-null namespace URI: thus all attributes without prefixes precede all attributes with prefixes.
In the start-tag namespace prefixes occur in order of prefix index.
The index of the first namespace prefix in the start-tag is always 1.
The indices of the prefixes occurring in the start-tag are always
consecutive integers. Thus if the element type has a prefix, its
prefix will be n1
; the prefix of the first attribute name
in the start-tag that has a prefix will be n2
if the
element type has a prefix, and n1
otherwise; for
subsequent attributes, the index of the prefix of the attribute name
will be one greater than the index of the prefix of the name of the
preceding attribute.
For example, for the following element
<doc xmlns:x="http://w3.org/2" xmlns:y="http://w3.org/1"> <x:e a="a"/> <x:e x:a="x:a"/> <e x:a="x:a"/> <e x:a="x:a" y:a="y:a"/> <e x:a="x:a" a="a"/> <e x:a="x:a" x:b="x:b"/> </doc>
the canonical form is
<doc> <n1:e xmlns:n1="http://w3.org/2" a="a"></n1:e> <n1:e xmlns:n1="http://w3.org/2" n2:a="x:a" xmlns:n2="http://w3.org/2"></n1:e> <e n1:a="x:a" xmlns:n1="http://w3.org/2"></e> <e n1:a="y:a" xmlns:n1="http://w3.org/1" n2:a="x:a" xmlns:n2="http://w3.org/2"></e> <e a="a" n1:a="x:a" xmlns:n1="http://w3.org/2"></e> <e n1:a="x:a" xmlns:n1="http://w3.org/2" n2:b="x:b" xmlns:n2="http://w3.org/2"></e> </doc>
The work of producing this specification was accomplished by the membership of the W3C XML Syntax Working Group:
Joel Nava, Adobe (Co-chair); Tim Bray, Invited Expert (Co-chair, Co-editor); James Clark, Invited Expert (Co-editor); James Tauber, Invited Expert (Co-editor); Bert Bos, W3C (W3C Liaison); Joseph Reagle, W3C (W3C Liaison); Gary Bisaga, Mitre; Tim Boland, NIST, Invited Expert; Charles Frankston, Microsoft; Paul Grosso, Arbortext; Eduardo Gutentag, Sun Microsystems; Michael Hyman, Microsoft; Murata Mokoto, Fuji Xerox; Michael Sperberg-McQueen, U. Ill. and W3C; Steph Tryphonas, Microstar; François Yergeau, Alis