Extensible Markup Language Frequently Asked Questions


What is XML?
XML stands for Extensible Markup Language. XML is a system for defining, validating, and sharing document formats. XML uses tags (for example <em>emphasis</em> for emphasis), to distinguish document structures, and attributes (for example, in <A HREF="http://www.xml.com/">, HREF is the attribute name, and http://www.xml.com/ is the attribute value) to encode extra document information. XML will look very familiar to those who know about SGML and HTML.
Yes. XML has been carefully designed with the goal that every valid XML document should also be an SGML document. There are some areas of difference between XML and SGML, but these are minor, should not cause practical problems, and will almost certainly reconciled with SGML in the near future.
No. An XML processor can read clean, valid, HTML, and with a few small changes an HTML browser like Netscape Navigator or Microsoft Internet Explorer would be able to read XML (the designers of XML would really like the authors of Navigator and Internet Explorer to make those changes).
The biggest difference between XML and HTML is that in XML, you can define your own tags for your own purposes, and if you want, share those tags with other users.
How, exactly, do XML and SGML differ?
First of all, XML leaves out many features of SGML (we'd list them, but the list would only make sense to a real SGML expert, and it appears as an appendix to the XML Specification).
There are a few areas where XML and SGML really differ:
What does "well-formed" mean?
The concept of a well-formed document is something that is really new in XML. A document that is well-formed is easy for a computer program to read, and ready for network delivery.
Specifically, in a well-formed document:
What does "valid" mean?
People who know SGML are used to the concept of validation; in XML, validation means exactly the same thing it does in SGML.
For those who are not familiar with this idea, a valid document must have a document type declaration, which is a grammar or set of rules that define what tags can appear in the document and how they must nest within each other. The document type declaration also is used to declare entities, re-usable chunks of text that can appear many times but only have to be transmitted once. A document is valid when it conforms to the rules in the document type declaration.
Validity is useful because an XML-savvy editor can use the type declaration to help (and in fact require) users to create documents that are valid; such documents are much easier to use and (especially) re-use than those which can contain any old set of tags in any old order.
What does XML mean to SGML product vendors?
On the technology front, SGML products should be able to read valid XML documents as they sit, as long as they are in 7-bit ASCII. To read internationalized XML documents, (for example in Japanese) SGML software will need modification to handle the ISO standard 10646 character set, and probably also a few industry-but-not-ISO standard encodings such as JIS and Big5.
To write XML, SGML products are going to have to be modified to use the special XML syntax for empty elements.
On the business front, a lot depends on whether the Web browsers learn XML. If they do, SGML product vendors should brace for a sudden, dramatic demand for products and services from all the technology innovators who are, at the moment, striving to get their own extensions into HTML, and will (correctly) see XML as the way to make this happen.
If the browsers remain tied to the fixed set of HTML tags, then XML will simply be an easy on-ramp to SGML, important probably more because the spec is short and simple than because of its technical characteristics. This will probably still generate an increase in market size, but not at the insane-seeming rate that would result from the browers' adoption of XML.
Why are empty elements different in XML?
Because they are easier to parse. Right now, an SGML parser must know how to read and understand markup declarations to tell empty elements from those which have contents and end-tags. In the XML world, empty elements can be spotted simply because they end with the string "/>". This will make the application of style-sheets much easier, and hopefully encourage the world's HTML browsers to learn XML.