The Extensible Markup Language (XML) provides a set of rules for defining markup languages intended for use in encoding data objects, and specifies behavior for certain software modules that access them.
This document is a private skunkworks and has no official standing of any kind, not having been reviewed by any organization in any way.
This draft was assembled by Tim Bray from text edited by himself, John Cowan, Dave Hollander, Andrew Layman, Eve Maler, Jonathan Marsh, Jean Paoli, C. Michael Sperberg-McQueen, and Richard Tobin, the editors of XML's first and second editions, Namespaces in XML, XML Infoset, and XML Base. There should be no suggestion that anybody other than Tim Bray approves of the content or even the existence of the present document.
The copyright statement above applies to almost all the text assembled for this document, but should not be taken as an indication that the W3C approves of the contents or existence of this document.
This document specifies XML SW. The recipe for the construction of XML SW is as follows: XML 1.0 [XML 2e], minus DTDs (and therefore necessarily entities), plus XML Base [XML Base], plus the XML Information Set [XML Infoset], plus XML Namespaces [XMLNamespaces]. The intent is to avoid introducing any modification to the semantics of any of the ingredient specifications, thus all of the syntax and behavior described in this document should be equivalent to that specified in one W3C Recommendation or another.
2 Documents, Elements, and Attributes
2.1 Start-Tags, End-Tags, and Empty-Element Tags
2.2 XML Namespaces
2.2.1 Declaring Namespaces
2.2.2 Using Qualified Names
2.2.3 Namespace Declaration Scope and Overriding
2.2.4 Namespace Defaulting
2.2.5 Uniqueness of Attributes
2.3 Parent, Child, and Root Elements
2.4 Reserved Attributes
2.4.1 White Space Handling
2.4.2 Language Identification
2.4.3 Base URI Specification
3 Other Markup
3.1 Prolog and Document Type Declaration
3.3 Processing Instructions
3.4 CDATA Sections
4 Characters and Text
4.2 Character References
4.3 Character Data and Markup
4.4 Common Syntactic Constructs
4.5 Character Encoding in XML Documents
5 The Information Set
5.1 Base URI
5.2 "Unknown" and "No Value"
5.3 Synthetic Infosets
5.4 End-of-Line Handling
5.5 Information Items
5.5.1 The Document Information Item
5.5.2 Element Information Items
5.5.3 Attribute Information Items
5.5.4 Processing Instruction Information Items
5.5.5 Character Information Item
5.5.6 Comment Information Items
5.5.7 The Document Type Declaration Information Item
5.5.8 Namespace Information Items
6.1 Syntax Checking
6.2 Use of the XML Information Set by Other Specifications
6.3 XML Processors and the XML Information Set
7 Notation and Terminology
A.1 Normative References
A.2 Other References
B Character Classes
C XML and SGML (Non-Normative)
D Autodetection of Character Encodings (Non-Normative)
D.1 Detection Without External Encoding Information
D.2 Priorities in the Presence of External Encoding Information
E Production Notes (Non-Normative)
Extensible Markup Language, abbreviated XML, describes a class of data objects called XML documents and partially describes the behavior of computer programs which process them. By construction, XML documents are conforming Standard Generalized Markup Language (SGML) [ISO 8879] documents.
[Definition: A software module called an XML processor is used to read XML documents and provide access to their content and structure.] [Definition: It is assumed that an XML processor is doing its work on behalf of another module, called the application.] This specification describes the required behavior of an XML processor in terms of how it must read XML data and (in 5 The Information Set) the information it must provide to the application.
The process at the W3C that led to this document was originated in 1996 by Jon Bosak, and involved a very large number of contributors from within and without the W3C. Lists of contributors may be found in the specifications on which this one is based.
This specification, together with associated standards (Unicode and ISO/IEC 10646 for characters, Internet RFC 1766 for language identification tags, ISO 639 for language name codes, and ISO 3166 for country name codes), provides all the information necessary to understand XML Version SW and construct computer programs to process it.
This version of the XML specification may be distributed freely, as long as all text and legal notices remain intact.
[Definition: A data object is an XML document if:]
Taken as a whole, it matches the production labeled document.
It meets the further constraints found in the running text, well-formedness constraints, and normative appendices of this specification.
An example of an XML document:
[Definition: Each XML document contains one or more elements, the boundaries of which are either delimited by start-tags and end-tags, or, for empty elements, by an empty-element tag. Each element has a type, identified by name, sometimes called its "generic identifier" (GI), and may have a set of attribute specifications.] Each attribute specification has a name and a value.
|[WFC: Element Type Match]|
The QName in an element's end-tag must match the element type in the start-tag.
An example containing three XML elements:
<Greeting xml:lang="en"><emph>Hello</emph> world! <html:img src="smiley.jpg"/></Greeting>
This specification does not constrain the semantics, use, or (beyond syntax) names of the element types and attributes, except that:
Names beginning with
a match to
(('X'|'x')('M'|'m')('L'|'l')) are reserved for standardization
in this or future versions of this specification.
Widely-used semantics are assigned to three attributes whose names begin with "xml:".
[Definition: The beginning of every non-empty XML element is marked by a start-tag.]
|||::=||[WFC: Unique Att Spec]|
|[WFC: Prefix Declared]|
|[WFC: Prefix Declared]|
The QName in the start- and end-tags gives the element's type. [Definition: The QName-AttValue
pairs are referred to as the attribute specifications of the
element], [Definition: with the QName in each pair referred to as the attribute name]
and [Definition: the content of the AttValue (the text between the
delimiters) as the attribute value.]
that the order of attribute specifications in a start-tag or empty-element
tag is not significant.
No attribute name may appear more than once in the same start-tag or empty-element tag.
An example of a start-tag:
<termdef id="dt-dog" term="dog">
[Definition: The end of every element that begins with a start-tag must be marked by an end-tag containing a name that matches the element's type as given in the start-tag:]
|||::=||[WFC: Prefix Declared]|
An example of an end-tag:
[Definition: An element with no content is said to be empty.] The representation of an empty element is either a start-tag immediately followed by an end-tag, or an empty-element tag. [Definition: An empty-element tag takes a special form:]
|||::=||[WFC: Unique Att Spec]|
|[WFC: Prefix Declared]|
Empty-element tags may be used for any element which has no content.
Examples of empty elements:
<IMG align="left" src="http://www.example.org/Icons/madonna" /> <br></br> <br/>
The names that appear as element types and attribute names serve as labels for the logical components of an XML document. [Definition: Software modules are often designed to process a particular set of elements and attributes and their content, identifying them using these labels. Let us refer to such a set, understood by some software module, as a markup vocabulary.]
We envision applications of XML where a single XML document may contain elements and attributes from more than one markup vocabulary. One motivation for this is modularity; if such a markup vocabulary exists which is well-understood and for which there is useful software available, it is better to re-use this markup rather than re-invent it.
Such documents, containing multiple markup vocabularies, pose problems of recognition and collision. Software modules need to be able to recognize the elements and attributes which they are designed to process, even in the face of "collisions" occurring when markup intended for some other software package uses the same element type or attribute name.
These considerations require that document constructs should have universal names, whose scope extends beyond their containing document. This section describes a mechanism, XML namespaces, which accomplishes this.
[Definition: URI references which identify namespaces are considered identical when they are exactly the same character-for-character.] Note that URI references which are not identical in this sense may in fact be functionally equivalent. Examples include URI references which differ only in case.
Names from XML namespaces may appear as qualified names, which contain a single colon, separating the name into a namespace prefix and a local part. The prefix, which is mapped to a URI reference, selects a namespace. The combination of the universally managed URI namespace and the document's own namespace produces identifiers that are universally unique. Mechanisms are provided for prefix scoping and defaulting.
URI references can contain characters not allowed in names, so cannot be used directly as namespace prefixes. Therefore, the namespace prefix serves as a proxy for a URI reference. An attribute-based syntax described below is used to declare the association of the namespace prefix with a URI reference.
[Definition: A namespace is declared using a family of reserved attributes. Such an attribute's name must either be xmlns or have xmlns: as a prefix. ]
Here is an example namespace declaration, which associates the
namespace prefix eg with the namespace name
<x xmlns:eg='http://example.com/schema'> <!-- the "eg" prefix is bound to http://example.com/schema for the "x" element and contents --> </x>
|||::=||[WFC: Leading "XML"]|
[Definition: The attribute's value, a URI reference, is the namespace name identifying the namespace.] The namespace name, to serve its intended purpose, should have the characteristics of uniqueness and persistence. It is not a goal that it be directly usable for retrieval of a schema (if any exists).
[Definition: If the attribute is ""xmlns"", then the NCName gives the namespace prefix, used to associate element and attribute names with the namespace name in the attribute value in the scope of the element to which the declaration is attached.] In such declarations, the namespace name may not be empty.
[Definition: If the attribute name matches DefaultAttName, then the namespace name in the attribute value is that of the default namespace in the scope of the element to which the declaration is attached.] In such a default declaration, the attribute value may be empty. Default namespaces and overriding of declarations are discussed in 2.2.3 Namespace Declaration Scope and Overriding and 2.2.4 Namespace Defaulting.
[Definition: If the qualified name contains a colon, then the portion before the colon, referred to as the namespace prefix (nonterminal Prefix) provides the link to the namespace.] [Definition: While a qualified name may not contain a prefix and colon, it always contains a local part (nonterminal LocalPart) which appears after the colon if there is one, and otherwise makes up the whole of the qualified name.]
If there is a prefix, it must have been associated with a namespace URI reference in a namespace declaration.
An example of a qualified name serving as an element type:
<x xmlns:eg='http://example.com/schema'> <!-- the 'price' element's namespace is http://example.com/schema --> <eg:price units='Euro'>32.18</eg:price> </x>
Note that the prefix functions only as a placeholder for a namespace name. Applications should use the namespace name, not the prefix, in constructing names whose scope extends beyond the containing document.
An example of a qualified name serving as an attribute name:
<x xmlns:eg='http://example.com/schema'> <!-- the 'taxClass' attribute's namespace is http://example.com/schema --> <lineItem eg:taxClass="exempt">Baby food</lineItem> </x>
If the QName has a
namespace prefix, that
prefix, unless it is "
xmlns", must have been
declared in a namespace declaration
attribute in either the start-tag of the element where the prefix
is used or in an an ancestor element (i.e. an element in whose
prefixed markup occurs).
xml is by definition bound to the
The prefix "
xmlns" is used only for namespace
and is not itself bound to any namespace name.
The namespace declaration is considered to apply to the element where it is specified and to all elements within the content of that element, unless overridden by another namespace declaration with the same NSAttName part:
<?xml version="SW"?> <!-- all elements here are explicitly in the HTML namespace --> <html:html xmlns:html='http://www.w3.org/1999/xhtml'> <html:head><html:title>Frobnostication</html:title></html:head> <html:body><html:p>Moved to <html:a href='http://frob.com'>here.</html:a></html:p></html:body> </html:html>
Multiple namespace prefixes can be declared as attributes of a single element, as shown in this example:
<?xml version="SW"?> <!-- both namespace prefixes are available throughout --> <bk:book xmlns:bk="http://www.example.com/books/" xmlns:isbn='http://www.example.com/isbn/"> <bk:title>Cheaper by the Dozen</bk:title> <isbn:number>1568491379</isbn:number> </bk:book>
A default namespace
(declared with an attribute named just "
to apply to the element where it is declared (if that element has no
namespace prefix), and to all elements
with no prefix within the content of that
If the URI reference in a default namespace declaration is empty, then
elements in the scope of the declaration are not considered to be in
Note that default namespaces do not apply directly to attributes.
<?xml version="SW"?> <!-- elements are in the HTML namespace, in this case by default --> <html xmlns='http://www.w3.org/1999/xhtml'> <head><title>Frobnostication</title></head> <body><p>Moved to <a href='http://example.com/frob/'>here</a>.</p></body> </html>
Defaulted namespaces can mix with those that are explicitly specified:
<?xml version="SW"?> <!-- unprefixed element types are from "books" --> <book xmlns="http://www.example.com/books/" xmlns:isbn="http://www.example.com/isbn/"> <title>Cheaper by the Dozen</title> <isbn:number>1568491379</isbn:number> </book>
A larger example of namespace scoping and defaulting:
<?xml version="SW"?> <!-- initially, the default namespace is "books" --> <book xmlns="http://www.example.com/books/" xmlns:isbn="http://www.example.com/isbn/"> <title>Cheaper by the Dozen</title> <isbn:number>1568491379</isbn:number> <notes> <!-- make HTML the default namespace for some commentary --> <p xmlns=http://www.w3.org/1999/xhtml'> This is a <i>funny</i> book! </p> </notes> </book>
The default namespace can be set to the empty string. This has the same effect, within the scope of the declaration, of there being no default namespace.
<?xml version='SW'?> <Beers> <!-- the default namespace is now that of HTML --> <table xmlns='http://www.w3.org/TR/REC-html40'> <th><td>Name</td><td>Origin</td><td>Description</td></th> <tr> <!-- no default namespace inside table cells --> <td><brandName xmlns="">Huntsman</brandName></td> <td><origin xmlns="">Bath, UK</origin></td> <td> <details xmlns=""><class>Bitter</class><hop>Fuggles</hop> <pro>Wonderful hop, light alcohol, good summer beer</pro> <con>Fragile; excessive variance pub to pub</con> </details> </td> </tr> </table> </Beers>
In XML documents conforming to this specification, no tag may contain two attributes which:
have identical names, or
For example, each of the
bad start-tags is illegal in the
<!-- http://www.example.com/ is bound to n1 and n2 --> <x xmlns:n1="http://www.example.com/" xmlns:n2="http://www.example.com/" > <bad a="1" a="2" /> <bad n1:a="1" n2:a="2" /> </x>
However, each of the following is legal, the second because the default namespace does not apply to attribute names:
<!-- http://www.example.com/ is bound to n1 and is the default --> <x xmlns:n1="http://www.example.com/" xmlns="http://www.example.com/" > <good a="1" b="2" /> <good a="1" n1:a="2" /> </x>
An XML document matches the document production, which implies that:
It contains one or more elements.
[Definition: There is exactly one element, called the root, or document element, no part of which appears in the content of any other element.] For all other elements, if the start-tag is in the content of another element, the end-tag is in the content of the same element. More simply stated, the elements, delimited by start- and end-tags, nest properly within each other.
[Definition: As a consequence of this,
for each non-root element
C in the document, there is one other element
in the document such that
C is in the content of
is not in the content of any other element that is in the content of
is referred to as the parent of
a child of
Example of a root, parent and child elements.
<root>This root element is the parent of the "parent" element. <parent>This parent element is a child of the "root" element and parent of the "child" element. <child>This child element is a child of the "parent" element.</child> </parent> </root>
This section describes several attributes whose
names begin "
associating them with predefined semantics useful in a wide variety
A special attribute named
may be attached to an element to signal an intention that in that element,
white space should be preserved by applications.
This is a common application requirement, for example in poetry and source
The allowed values of this attribute are "default" and
The value "default" signals that applications' default white-space
processing modes are acceptable for this element; the value "preserve"
indicates the intent that applications preserve all the white space. This
declared intent is considered to apply to all elements within the content
of the element where it is specified, unless overriden with another instance
The root element of any document is considered to have signaled no intentions as regards application space handling, unless it provides a value for this attribute or the attribute is declared with a default value.
An example of the use of
<div> <p xml:space="default">In this paragraph, line breaks and indentation mean nothing.</p> <p xml:space="preserve">Here, space matters: \o/ | / \ </p></div>
In document processing, it is often useful to identify the natural or formal
language in which the content is written. A special attribute
xml:lang may be inserted in documents to specify the language
used in the contents and attribute values of any element in an XML document.
values of the attribute are language identifiers as defined by [IETF RFC 1766], Tags
for the Identification of Languages, or its successor on the IETF
[IETF RFC 1766] tags are constructed from two-letter language codes as defined by [ISO 639], from two-letter country codes as defined by [ISO 3166], or from language identifiers registered with the Internet Assigned Numbers Authority [IANA-LANGCODES]. It is expected that the successor to [IETF RFC 1766] will introduce three-letter language codes for languages not presently covered by [ISO 639].
<p xml:lang="en">The quick brown fox jumps over the lazy dog.</p> <p xml:lang="en-GB">What colour is it?</p> <p xml:lang="en-US">What color is it?</p> <sp who="Faust" desc='leise' xml:lang="de"> <l>Habe nun, ach! Philosophie,</l> <l>Juristerei, und Medizin</l> <l>und leider auch Theologie</l> <l>durchaus studiert mit heißem Bemüh'n.</l> </sp>
The intent declared with
xml:lang is considered to apply to
all attributes and content of the element where it is specified, unless overridden
with an instance of
xml:lang on another element within that content.
This section describes a reserved attribute named
xml:base with semantics similar to that of HTML BASE,
for defining base URIs for parts of XML documents.
The terms base URI and relative URI are used in this section as they are defined in [RFC2396].
xml:base may be
inserted in XML documents to
specify a base URI other than the base URI of the document or external
entity. The value of this attribute is interpreted as a URI Reference as
defined in RFC 2396 [RFC2396], after processing
according to Section 3.1.
Here is an example of
xml:base in a simple document containing
<?xml version="SW"?> <doc xml:base="http://example.org/today/" xmlns:xlink="http://www.w3.org/1999/xlink"> <head> <title>Virtual Library</title> </head> <body> <paragraph>See <link xlink:type="simple" xlink:href="new.xml">what's new</link>!</paragraph> <paragraph>Check out the hot picks of the day!</paragraph> <olist xml:base="/hotpicks/"> <item> <link xlink:type="simple" xlink:href="pick1.xml">Hot Pick #1</link> </item> <item> <link xlink:type="simple" xlink:href="pick2.xml">Hot Pick #2</link> </item> <item> <link xlink:type="simple" xlink:href="pick3.xml">Hot Pick #3</link> </item> </olist> </body> </doc>
The URIs in this example resolve to full URIs as follows:
"what's new" resolves to the URI "http://example.org/today/new.xml"
"Hot Pick #1" resolves to the URI "http://example.org/hotpicks/pick1.xml"
"Hot Pick #2" resolves to the URI "http://example.org/hotpicks/pick2.xml"
"Hot Pick #3" resolves to the URI "http://example.org/hotpicks/pick3.xml"
The set of characters allowed in
is the same as for XML, namely [Unicode]. However, some
Unicode characters are disallowed from URI references, and thus
processors must encode and escape these
characters to obtain a valid URI reference from the attribute value.
The disallowed characters include all non-ASCII characters, plus the excluded characters listed in Section 2.4 of [RFC2396], except for the number sign (#) and percent sign (%) characters and the square bracket characters re-allowed in [RFC2732]. Disallowed characters must be escaped as follows:
Each disallowed character is converted to UTF-8 [RFC2279] as one or more bytes.
Any bytes corresponding to a disallowed character are escaped with the URI escaping mechanism (that is, converted to %HH, where HH is the hexadecimal notation of the byte value).
The original character is replaced by the resulting character sequence.
This section discribes markup that can appear in an XML document that does not serve to encode the logical structure of the document.
[Definition: XML documents should begin with an XML declaration which specifies the version of XML being used.] For example, the following is a complete XML document.
<?xml version="SW"?> <greeting>Hello, world!</greeting>
The version number "
SW" should be used to indicate
conformance to this version of this specification; it is an error for a document
to use the value "
SW" if it does not conform to
this version of this specification.
Processors may signal an error if they receive documents labeled with versions
they do not support.
[Definition: For compatibility with XML 1.0, a document type declaration may appear in an XML document before the first element.]
An example of an XML document with a document type declaration:
<?xml version="SW"?> <!DOCTYPE greeting SYSTEM "hello.dtd"> <greeting>Hello, world!</greeting>
An example of a comment:
<!-- declarations for <head> & <body> -->
that the grammar does not allow a comment ending in
following example is not well-formed.
<!-- B+, B, or B--->
[Definition: Processing instructions (PIs) allow documents to contain instructions for applications.]
PIs are not part of the document's character
data, but must be passed through to the application.
[Definition: The PI begins
with a target (PITarget) used to
identify the application
to which the instruction is directed.]
The target names "
and so on are reserved for standardization in this or future versions of this
[Definition: CDATA sections
may occur anywhere character data may occur; they are used to escape blocks
of text containing characters which would otherwise be recognized as markup.
CDATA sections begin with the string "
and end with the string "
Within a CDATA section, only the CDEnd string is
recognized as markup, so that left angle brackets and ampersands may occur
in their literal form; they need not (and cannot) be escaped using "
&". CDATA sections cannot nest.
[Definition: XML documents contain text, a sequence of characters, which may represent markup or character data.] [Definition: A character is an atomic unit of text as specified by ISO/IEC 10646 [ISO/IEC 10646] (see also [ISO/IEC 10646-2000]). Legal characters are tab, carriage return, line feed, and the legal characters of Unicode and ISO/IEC 10646. The versions of these standards cited in A.1 Normative References were current at the time this document was prepared. New characters may be added to these standards by amendments or new editions. Consequently, XML processors must accept any character in the range specified for Char. The use of "compatibility characters", as defined in section 6.8 of [Unicode] (see also D21 in section 3.6 of [Unicode3]), is discouraged.]
|||::=||/* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */|
The same encoding must be used for for all the characters in an XML document. All XML processors must accept the UTF-8 and UTF-16 encodings of 10646; the mechanisms for signaling which of the two is in use, or for bringing other encodings into play, are discussed later, in 4.5 Character Encoding in XML Documents.
[Definition: A character reference in an XML document stands for a specific character in the ISO/IEC 10646 character set, for example one not directly accessible from available input devices.]
|[WFC: Legal Character]|
Characters referred to using character references must match the production for Char.
If the character reference begins with "
the digits and letters up to the terminating "
provide a hexadecimal
representation of the character's code point in ISO/IEC 10646. If it begins
just with "
&#", the digits up to the terminating
provide a decimal representation of the character's code point.
[Definition: For readability, a set of predefined
patterns is also provided for the purpose of escaping XML's
This has exactly the same effect as using character references:
& and so on.]
Text consists of intermingled character data and markup. [Definition: Markup takes the form of start-tags, end-tags, empty-element tags, character references, comments, CDATA section delimiters, document type declarations, processing instructions, XML declarations, and any white space that is in the document, outside the document element, and not inside any other markup.]
[Definition: All text that is not markup constitutes the character data of the document.]
The ampersand character (&) and the left angle bracket (<) may appear
in their literal form only when used as markup delimiters, or
within a comment, a processing
instruction, or a CDATA section.
If they are needed elsewhere, they must be escaped
using either numeric character references
or the strings "
&" and "
respectively. The right angle bracket (>) may be represented using the string "
and must, for compatibility, be escaped
>" or a character reference when it
appears in the string "
]]>" in content, when
that string is not marking the end of a CDATA
In the content of elements, character data is any string of characters
which does not contain the start-delimiter of any markup. In a CDATA section,
character data is any string of characters not including the CDATA-section-close
To allow attribute values to contain both single and double quotes, the
apostrophe or single-quote character (') may be represented as "
and the double-quote character (") as "
This section defines some symbols used widely in the grammar.
S (white space) consists of one or more space (#x20) characters, carriage returns, line feeds, or tabs.
Characters are classified for convenience as letters, digits, or other characters. A letter consists of an alphabetic or syllabic base character or an ideographic character. Full definitions of the specific characters in each class are given in B Character Classes.
[Definition: A Name is a token beginning
with a letter or one of a few punctuation characters, and continuing with
letters, digits, hyphens, underscores, colons, or full stops, together known
as name characters.] Names beginning with the string "
or any string which would match
(('X'|'x') ('M'|'m') ('L'|'l')),
are reserved for standardization in this or future versions of this specification.
To support XML namespaces (see 2.2 XML Namespaces), it is necessary to give element types and attribute names as a quoted pair of labels; the Qualified Name (QName) and No-colon Name (NCName) nonterminals support this.
Literal data is any quoted string not containing the quotation mark used as a delimiter for that string. Literals are used for specifying the values of attributes (AttValue), and certain components of the document type declaration.
XML documents must contain Unicode characters, but there are a variety of techniques for encoding characters into bytes for storage. All XML processors must be able to read XML documents in both the UTF-8 and UTF-16 encodings. The terms "UTF-8" and "UTF-16" in this specification do not apply to character encodings with any other labels, even if the encodings or labels are very similar to UTF-8 or UTF-16.
XML documents encoded in UTF-16 must begin with the Byte Order Mark described by Annex F of [ISO/IEC 10646], Annex H of [ISO/IEC 10646-2000], section 2.4 of [Unicode], and section 2.7 of [Unicode3] (the ZERO WIDTH NO-BREAK SPACE character, #xFEFF). This is an encoding signature, not part of either the markup or the character data of the XML document. XML processors must be able to use this character to differentiate between UTF-8 and UTF-16 encoded documents.
Although an XML processor is required to read only XML documents in the UTF-8 and UTF-16 encodings, it is recognized that other encodings are used around the world, and it may be desired for XML processors to read XML documents that use them. In the absence of external character encoding information (such as MIME headers), XML documents which are stored in an encoding other than UTF-8 or UTF-16 must begin with an XML declaration containing an encoding declaration:
|||::=||/* Encoding name contains only Latin characters */|
The EncName is the name of the encoding used.
In an encoding declaration, the values "
ISO-10646-UCS-2", and "
ISO-10646-UCS-4" should be used
for the various encodings and transformations of Unicode / ISO/IEC 10646,
the values "
ISO-8859-n" (where n
is the part number) should be used for the parts of ISO 8859, and
the values "
EUC-JP" should be used for the various encoded
forms of JIS X-0208-1997.
is recommended that character encodings registered (as charsets)
with the Internet Assigned Numbers Authority [IANA-CHARSETS],
other than those just listed, be referred to using their registered names;
other encodings should use names starting with an "x-" prefix.
XML processors should match character encoding names in a case-insensitive
way and should either interpret an IANA-registered name as the encoding registered
at IANA for that name or treat it as unknown (processors are, of course, not
required to support all IANA-registered encodings).
In the absence of information provided by an external transport protocol (e.g. HTTP or MIME), it is an error for an XML document including an encoding declaration to be presented to the XML processor in an encoding other than that named in the declaration, or for an XML document which begins with neither a Byte Order Mark nor an encoding declaration to use an encoding other than UTF-8. Note that since ASCII is a subset of UTF-8, ordinary ASCII XML documents do not strictly need an encoding declaration.
It is a fatal error when an XML processor encounters a data object with an encoding that it is unable to process. It is a fatal error if an XML document is determined (via default, encoding declaration, or higher-level protocol) to be in a certain encoding but contains octet sequences that are not legal in that encoding. It is also a fatal error if an XML document contains no encoding declaration and its content is not legal UTF-8 or UTF-16.
Examples of XML declarations containing encoding declarations:
<?xml version='SW' encoding='UTF-8'?> <?xml version="SW" encoding='EUC-JP'?>
This section defines an abstract data set called the XML Information Set (Infoset). It exists to provide:
A consistent set of definitions for use in other specifications that need to refer to the information in an XML document.
The contents of the information set for an XML document are designed to convey its structure and content as expressed by its markup and character data. However, there are some items of markup which have no effect on the contents of the information set: examples include CDATA sections and character references.
[Definition: An XML document's information set consists of a number of information items; the information set for any XML document contains at least a document information item and several others.] [Definition: An information item is an abstract description of some part of an XML document: each information item has a set of associated named properties.]
The XML Information Set does not require or favor a specific interface or class of interfaces. This specification presents the information set as a modified tree for the sake of clarity and simplicity, but there is no requirement that the XML Information Set be made available through a tree structure; other types of interfaces, including (but not limited to) event-based and query-based interfaces, are also capable of providing information conforming to the XML Information Set.
The terms "information set" and "information item" are similar in meaning to the generic terms "tree" and "node", as they are used in computing. However, the former terms are used in this specification to reduce possible confusion with other specific data models. Information items do not map one-to-one with the nodes of the DOM or the "tree" and "nodes" of the XPath data model.
Several information items have a "base URI" property. These are computed as specified in 2.4.3 Base URI Specification. Note that retrieval of a resource may involve redirection at the parser level (for example, in an entity resolver) or below; in this case the base URI is the final URI used to retrieve the resource after all redirection.
The value of these properties does not reflect any URI escaping that may be required for retrieval of the resource, but it may include escaped characters if these were specified in the XML document, or returned by a server in the case of redirection.
In some cases (such as a document read from a string or a pipe) the rules in 2.4.3 Base URI Specification may result in a base URI being application dependent. In these cases this specification does not define the value of the "base URI" property.
When resolving relative URIs the "base
URI" property should be used in
preference to the values of
xml:base attributes; they may be
the case of Synthetic Infosets.
Some properties may sometimes have the value
value, and it is
said that a property value is unknown or that a property has no value
respectively. These values are distinct from each other and from all other
values. In particular they are distinct from the empty string, the empty set,
and the empty list, each of which simply has no members. This specification
does not use the term "null" since in some communities it has
connotations which may not match those intended here.
This specification describes the information set resulting from parsing an XML document. Information sets may be constructed by other means, for example by use of an API such as the DOM or by transforming an existing information set.
An information set corresponding to a real document will necessarily be consistent in various ways; for example the "in-scope namespaces" property of an element will be consistent with the "namespace attributes" properties of the element and its ancestors. This may not be true of an information set constructed by other means; in such a case there is no XML document corresponding to the information set, and to serialize it will require resolution of the inconsistencies (for example, by outputting namespace declarations that correspond to the namespaces in scope).
XML documents are often stored in computer files which, for editing convenience, are organized into lines. These lines are typically separated by some combination of the characters carriage-return (#xD) and line-feed (#xA).
To simplify the tasks of applications, the information set items corresponding to line ends in the character data and attribute values of an XML document appear in a form "normalized" as follows: all appearances of either the literal two-character sequence "#xD#xA" or a standalone literal #xD are normalized into a single character information item for a #A character.
[Definition: There is exactly one document information item in the information set, and all other information items are accessible from the properties of the document information item, either directly or indirectly through the properties of other information items.]
The document information item has the following properties:
children: An ordered list of child information items, in document order. The list contains exactly one element information item, for the document element. The list also contains one processing instruction information item for each processing instruction outside the document element, and one comment information item for each comment outside the document element. If there is a document type declaration, the list also contains a document type declaration information item.
base URI: The base URI of the XML document.
character encoding scheme: The name of the character encoding scheme in which the XML document is expressed (see 4.5 Character Encoding in XML Documents).
version: A string representing the XML version of the XML document. This property is derived from the XML declaration optionally present at the beginning of the document entity, and has no value if there is no XML declaration.
[Definition: There is an element information item for each element appearing in the XML document. One of the element information items is the value of the document element property of the document information item, corresponding to the root of the element tree, and all other element information items are accessible by recursively following its "children" property.]
An element information item has the following properties:
prefix: The namespace prefix part of the element type. If the type is unprefixed, this property has no value. Note that applications should use the namespace name rather than the prefix to identify elements.
children: An ordered list of child information items, in document order. This list contains element, processing instruction, character, and comment information items, one for each element, processing instruction, character, and comment appearing immediately within the element. If the element is empty, this list has no members.
attributes: An unordered set of attribute information items, one for each of the attributes of the element. Namespace declarations do not appear in this set. If the element has no attributes, this set has no members.
An unordered set of attribute
information items, one for each of the
attached to this element.
A declaration of the form
which undeclares the default namespace, counts as a namespace declaration. By
definition, all namespace attributes (including those named
"prefix" property has no value) have a namespace URI of
http://www.w3.org/2000/xmlns/. If the element has no namespace
this set has no members.
An unordered set of
namespace information items, one for
each of the
namespaces in effect for this element.
This set always contains an item with
xml which is by definition bound to the namespace name
http://www.w3.org/XML/1998/namespace. It does not contain an item
with the prefix
xmlns (used for declaring namespaces), since an
application can never
encounter an element or attribute with that prefix. The set includes
namespace items corresponding to all of the members of
the "namespace attributes" property,
except for any representing a declaration of the form
xmlns="", which does not
declare a namespace but rather undeclares the default namespace.
resolving the prefixes of qualified names this property should be used in
preference to the
"namespace attributes" property, which may be inconsistent in
the case of Synthetic Infosets.
base URI: The base URI of the element.
[Definition: There is an attribute information item for each attribute of each element in an XML document, including those which are namespace declarations. The latter however appear as members of an element's "namespace attributes" property rather than its "attributes" property.]
Attribute values appear in the information set in a "normalized" form, not necessarily identical to the form which appears in the XML document. Normalization is accomplished by applying the algorithm below, or by using some other method that produces the same result.
All line breaks must have been normalized on input to #xA as described in 5.4 End-of-Line Handling, so the rest of this algorithm operates on text normalized in this way.
Begin with a normalized value consisting of the empty string.
For each character in the unnormalized attribute value, beginning with the first and continuing to the last, do the following:
For a character reference, append the referenced character to the normalized value.
For a white space character (#x20, #xD, #xA, #x9), append a space character (#x20) to the normalized value.
For another character, append the character to the normalized value.
Note that if the unnormalized attribute value contains a character reference to a white space character other than space (#x20), the normalized value contains the referenced character itself (#xD, #xA or #x9). This contrasts with the case where the unnormalized value contains a white space character (not a reference), which is replaced with a space character (#x20) in the normalized value.
Following are examples of attribute normalization. The attribute specifications in the left column below would be normalized to the character sequences of the right column.
|Attribute specification||Normalized Sequence|
An attribute information item has the following properties:
prefix: The namespace prefix part of the attribute name. If the name is unprefixed, this property has no value. Note that applications should use the namespace name rather than the prefix to identify attributes.
normalized value: The attribute value, normalized as described above.
owner element: The element information item which contains this information item in its "attributes" property.
A processing instruction information item has the following properties:
content: A string representing the content of the processing instruction, excluding the target and any white space immediately following it. If there is no such content, the value of this property is an empty string.
A character information item has the following properties:
character code: The ISO 10646 character code (in the range 0 to #x10FFFF, though not every value in this range is a legal XML character code) of the character.
parent: The element information item which contains this information item in its "children" property.
[Definition: There is optionally a comment information item for each XML comment in an XML document. XML processors are allowed to ignore comments and are not required to provide comment information items.]
A comment information item has the following properties:
A document type declaration information item has the following properties:
system identifier: The SystemLiteral, if one appears in the document type declaration, without any additional URI escaping applied by the processor.
public identifier: The PubidLiteral in the document type declaration, if one is provided, after being processed by replacing each string of white space with a single space character (#x20), and removing leading and trailing white space.
parent The document information item.
A namespace information item has the following properties:
prefix: The prefix whose binding this item describes. Syntactically, this is the part of the attribute name following the xmlns: prefix. If the attribute name is simply xmlns, so that the declaration is of the default namespace, this property has no value.
namespace name: The namespace name to which the prefix is bound.
Conforming XML processors must detect and report violations of this specification's grammar and well-formedness constraints in the content of data objects (which, if such violations exist, are by definition not XML documents).
When any such violation or any other fatal error is encountered, the XML processor may continue processing the data to search for further errors and may report such errors to the application. In order to support correction of errors, the processor may make unprocessed data from the document (with intermingled character data and markup) available to the application. Once a fatal error is detected, however, the processor must not continue normal processing - i.e. it must not continue making Infoset items available to the application.
One of the purposes of the Information Set (see 5 The Information Set) is to provide a set of definitions for use by other specifications.
Specifications conformant to this specification, when referring to the Infoset, must:
Indicate the information items and properties that are needed to implement the specification.
Specify how other information items and properties are treated (for example, they might be passed through unchanged).
Note any information required from an XML document that is not defined by the Infoset.
Note any difference in the use of terms defined by the Infoset (this should be avoided).
The formal grammar of XML is given in this specification using a simple Extended Backus-Naur Form (EBNF) notation. Each rule in the grammar defines one symbol, in the form
symbol ::= expression
Symbols are written with an initial capital letter if they are the start symbol of a regular language, otherwise with an initial lower case letter. Literal strings are quoted.
Within the expression on the right-hand side of a rule, the following expressions are used to match strings of one or more characters:
N is a hexadecimal integer, the expression matches the
character in ISO/IEC 10646 whose canonical (UCS-4) code value, when interpreted
as an unsigned binary number, has the value indicated. The number of leading
zeros in the
#xN form is insignificant; the number of leading
zeros in the corresponding code value is governed by the character encoding
in use and is not significant for XML.
matches any Char with a value in the range(s) indicated (inclusive).
matches any Char with a value among the characters enumerated. Enumerations and ranges can be mixed in one set of brackets.
matches any Charwith a value outside the range indicated.
matches any Char with a value not among the characters given. Enumerations and ranges of forbidden values can be mixed in one set of brackets.
matches a literal string matching that given inside the double quotes.
matches a literal string matching that given inside the single quotes.
These symbols may be combined to match more complex patterns as follows,
B represent simple expressions:
expression is treated as a unit and may be combined as described
in this list.
A or nothing; optional
A followed by
operator has higher precedence than alternation; thus
A B | C D
is identical to
(A B) | (C D).
A | B
B but not both.
A - B
matches any string that matches
A but does not match
matches one or more occurrences of
has higher precedence than alternation; thus
A+ | B+ is identical
(A+) | (B+).
matches zero or more occurrences of
has higher precedence than alternation; thus
A* | B* is identical
(A*) | (B*).
Other notations used in the productions are:
/* ... */
[ wfc: ... ]
well-formedness constraint; this identifies by name a constraint associated with a grammar production, violation of which is a fatal error.
The terminology used to describe XML documents is defined in the body of this specification. The terms defined in the following list are used in building those definitions and in describing the actions of an XML processor:
[Definition: Conforming documents and XML processors are permitted to but need not behave as described.]
[Definition: Conforming documents and XML processors are required to behave as described; otherwise they are in error.]
[Definition: A violation of the rules of this specification; results are undefined. Conforming software may detect and report an error and may recover from it.]
[Definition: Conforming software may or must (depending on the modal verb in the sentence) behave as described; if it does, it must provide users a means to enable or disable the behavior described.]
[Definition: (Of strings or names:) Two strings or names being compared must be identical. Characters with multiple possible representations in ISO/IEC 10646 (e.g. characters with both precomposed and base+diacritic forms) match only if they have the same representation in both strings. No case folding is performed. (Of strings and rules in the grammar:) A string matches a grammatical production if it belongs to the language generated by that production.]
[Definition: Marks a sentence describing a feature of XML included solely to ensure that XML remains compatible with SGML.]
Following the characteristics defined in the Unicode standard, characters are classed as base characters (among others, these contain the alphabetic characters of the Latin alphabet, ideographic characters, and combining characters (among others, this class contains most diacritics). Digits and extenders are also distinguished.
The character classes defined here can be derived from the Unicode 2.0 character database as follows:
Name start characters must have one of the categories Ll, Lu, Lo, Lt, Nl.
Name characters other than Name-start characters must have one of the categories Mc, Me, Mn, Lm, or Nd.
Characters in the compatibility area (i.e. with character code greater than #xF900 and less than #xFFFE) are not allowed in XML names.
Characters which have a font or compatibility decomposition (i.e. those with a "compatibility formatting tag" in field 5 of the database -- marked by field 5 beginning with a "<") are not allowed.
The following characters are treated as name-start characters rather than name characters, because the property file classifies them as Alphabetic: [#x02BB-#x02C1], #x0559, #x06E5, #x06E6.
Characters #x20DD-#x20E0 are excluded (in accordance with Unicode 2.0, section 5.14).
Character #x00B7 is classified as an extender, because the property list so identifies it.
Character #x0387 is added as a name character, because #x00B7 is its canonical equivalent.
Characters ':' and '_' are allowed as name-start characters.
Characters '-' and '.' are allowed as name characters.
XML is designed to be a subset of SGML, in that every XML document should also be a conforming SGML document. For a detailed comparison of the additional restrictions that XML places on documents beyond those of SGML, see [Clark].
The XML encoding declaration functions as an internal label on each XML document, indicating which character encoding is in use. Before an XML processor can read the internal label, however, it apparently has to know what character encoding is in use--which is what the internal label is trying to indicate. In the general case, this is a hopeless situation. It is not entirely hopeless in XML, however, because XML limits the general case in two ways: each implementation is assumed to support only a finite set of character encodings, and the XML encoding declaration is restricted in position and content in order to make it feasible to autodetect the character encoding in use in an XML document in normal cases. Also, in many cases other sources of information are available in addition to the XML data stream itself. Two cases may be distinguished, depending on whether the XML document is presented to the processor without, or with, any accompanying (external) information. We consider the first case first.
Because an XML document not accompanied by external
encoding information and not in UTF-8 or UTF-16 encoding must
begin with an XML encoding declaration, in which the first characters must
<?xml', any conforming processor can detect, after two
to four octets of input, which of the following cases apply. In reading this
list, it may help to know that in UCS-4, '<' is "
and '?' is "
#x0000003F", and the Byte Order Mark
required of UTF-16 data streams is "
#xFEFF". The notation ## is used to denote any byte value except
diff="chg">that two consecutive ##s cannot be both 00.
With a Byte Order Mark:
|UCS-4, big-endian machine (1234 order)|
|UCS-4, little-endian machine (4321 order)|
|UCS-4, unusual octet order (2143)|
|UCS-4, unusual octet order (3412)|
Without a Byte Order Mark:
||UCS-4 or other encoding with a 32-bit code unit and ASCII characters encoded as ASCII values, in respectively big-endian (1234), little-endian (4321) and two unusual byte orders (2143 and 3412). The encoding declaration must be read to determine which of UCS-4 or other supported 32-bit encodings applies.|
|UTF-16BE or big-endian ISO-10646-UCS-2 or other encoding with a 16-bit code unit in big-endian order and ASCII characters encoded as ASCII values (the encoding declaration must be read to determine which)|
|UTF-16LE or little-endian ISO-10646-UCS-2 or other encoding with a 16-bit code unit in little-endian order and ASCII characters encoded as ASCII values (the encoding declaration must be read to determine which)|
||UTF-8, ISO 646, ASCII, some part of ISO 8859, Shift-JIS, EUC, or any other 7-bit, 8-bit, or mixed-width encoding which ensures that the characters of ASCII have their normal positions, width, and values; the actual encoding declaration must be read to detect which of these applies, but since all of these encodings use the same bit patterns for the relevant ASCII characters, the encoding declaration itself may be read reliably|
|EBCDIC (in some flavor; the full encoding declaration must be read to tell which code page is in use)|
|Other||UTF-8 without an encoding declaration, or else the data stream is mislabeled (lacking a required encoding declaration), corrupt, fragmentary, or enclosed in a wrapper of some kind|
In cases above which do not require reading the encoding declaration to determine the encoding, section 4.3.3 still requires that the encoding declaration, if present, be read and that the encoding name be checked to match the actual encoding of the XML document. Also, it is possible that new character encodings will be invented that will make it necessary to use the encoding declaration to determine the encoding, in cases where this is not required at present.
This level of autodetection is enough to read the XML encoding declaration and parse the character-encoding identifier, which is still necessary to distinguish the individual members of each family of encodings (e.g. to tell UTF-8 from 8859, and the parts of 8859 from each other, or to distinguish the specific EBCDIC code page in use, and so on).
Because the contents of the encoding declaration are restricted to characters from the ASCII repertoire (however encoded), a processor can reliably read the entire encoding declaration as soon as it has detected which family of encodings is in use. Since in practice, all widely used character encodings fall into one of the categories above, the XML encoding declaration allows reasonably reliable in-band labeling of character encodings, even when external sources of information at the operating-system or transport-protocol level are unreliable. Character encodings such as UTF-7 that make overloaded usage of ASCII-valued bytes may fail to be reliably detected.
Once the processor has detected the character encoding in use, it can act appropriately, whether by invoking a separate input routine for each case, or by calling the proper conversion function on each character of input.
Like any self-labeling system, the XML encoding declaration will not work if any software changes the XML document's character set or encoding without updating the encoding declaration. Implementors of character-encoding routines should be careful to ensure the accuracy of the internal and external information used to label the XML document.
The second possible case occurs when the XML document is accompanied by
information, as in some file systems and some network protocols. When multiple
sources of information are available, their relative priority and the preferred
method of handling conflict should be specified as part of the higher-level
protocol used to deliver XML. In particular, please refer
to [IETF RFC 2376] or its successor, which defines the
application/xml MIME types and provides some useful guidance.
In the interests of interoperability, however, the following rule is recommended.
If an XML document is in a file, the Byte-Order Mark and encoding declaration are used (if present) to determine the character encoding. All other heuristics and sources of information are solely for error recovery.
If an XML document is delivered with a MIME type of text/xml, then
charset parameter on the MIME type determines the character
encoding method; all other heuristics and sources of information are solely
for error recovery.
If an XML document is delivered with a MIME type of application/xml, then the Byte-Order Mark and encoding-declaration PI are used (if present) to determine the character encoding. All other heuristics and sources of information are solely for error recovery.
These rules apply only in the absence of protocol-level documentation; in particular, when the MIME types text/xml and application/xml are defined, the recommendations of the relevant RFC will supersede these rules.