Features and Rules of ISO 8879
A Summary for Use
In Discussions of the W3C SGML Working Group
And Editorial Review Board
Document W3C-SGML-ERB DD-1996-0002
C. M. Sperberg-McQueen
12 September 1996
Table of Contents
This document summarizes some of the salient features of ISO 8879
which may or may not require revision in XML. The
`features' included here are not limited to those
called features by the standard; they include other
characteristics and rules as well.
In virtually all cases, the inclusion of an item in these lists may
be interpreted as a suggestion that the W3C SGML Working Group
and Editorial Review Board will need to
decide explicitly whether XML should (a) work exactly the same way
as SGML, (b) work in only one of the ways allowed by SGML, (c) provide
the same functionality in some other way, or (d) suppress the feature
/ functionality in question. That is, after each statement, the
question "Should this be true in XML as well?" may be
understood implicitly, even if not explicit.
In some cases, an entry is present only as a reminder of some SGML
construct which the compiler or others have found hard to follow or
obscurely expressed in ISO 8879, and which should be described more
clearly in XML documentation.
In addition to mentioning some SGML features and rules, the lists
below indicate whether those features or rules are omitted or modified
by various proposals for subsets or simplifications of SGML. The
proposals thus summarized are:
- Bas (Basic SGML), as defined in ISO 8879, clause
15.1.1. A basic SGML document is one which uses
only:
- the reference concrete syntax
- the reference capacity set
- only the features
shorttag
and omittag
- LA (Lexical Analyzer for HTML), as defined in
Dan Connolly, "A Lexical Analyzer for HTML and
Basic SGML: W3C Working Draft 15-Jun-96" at
http://www.w3.org/pub/WWW/TR/WD-sgml-lex. As the name of the document suggests, LA is a slight
simplification of Basic SGML; in cases of doubt, therefore, entries
for LA have been made to agree with those for Bas. In some cases,
however, LA prohibits constructs which are legal in Bas. In others,
the LA document divides responsibility for SGML conformance between
the LA processor and the client software; in these cases, the entry
for LA describes what the LA software proper does, with a note
indicating what work is left to the client.
- MGML (Minimal Generalized Markup Language), as
defined in Tim Bray, "MGML - an SGML Application
for Describing Document Markup Languages", unpublished draft
paper for SGML '96 (
http://www.textuality.com/mgml/index.html) and attachments, especially the
`reference DSD' (document structure definition) as
of 18 August 1996. Entries have been corrected by Tim Bray, and
in some cases reflect plans for later revision of the proposal.
- Min (Minimal SGML), as defined in ISO 8879, clause
15.1.2. A minimal SGML document is defined as one which
uses:
- the core concrete syntax (i.e. the reference concrete syntax
without any short reference delimiters)
- the reference capacity set
- no features
- NSGML (Normalised SGML), as defined in
Henry Thompson, David McKelvie, and Steve Finch,
"The Normalised SGML Library (NSL)",
NSL Version 1.4.4, Documentation version Fri Aug 2 14:13:40 BST 1996
(
http://www.ltg.ed.ac.uk/corpora/nsldoc/nsldoc.html),
in particular the section
"Definition of NSGML".
Those present have been reviewed by Henry Thompson, and in one
case (
conref
) reflect plans for the next release of the
NSGML Library.
- PSGML (Poor Folks SGML), as defined in C. M.
Sperberg-McQueen, "PSGML: Poor-Folks SGML: A Subset
of SGML for Use in Distributed Applications", Document UIC CC
DB92-10, 8 October 1992 (
http://www.uic.edu/~cmsmcq/uic/db92-10.tei or
http://www.uic.edu/~cmsmcq/uic/db92-10.html)
- SL (SGML Lite), as defined in
Bert Bos, "`SGML-Lite' - an
easy to parse subset of SGML", 4 July 1995
(
http://grid.let.rug.nl/~bert/Stylesheets/SGML-Lite.html).
The SGML Lite document prescribes the core concrete syntax, but gives
a full syntax declaration for the reference concrete syntax, which
includes the standard short references; the entries below reflect the
core syntax, not the reference syntax. The
omittag
feature
is not discussed; the entries for SL reflect the apparent intention
of setting OMITTAG NO
.
In cases of doubt, the entries for SL agree with those of Bas.
- SO (SGML Online), as defined in Eliot Kimber,
"SGML Revision, Proposal for Minimal SGML Feature
Set" (unfinished, unpublished draft, distributed privately via
email on 1996-06-03, now accessible at
http://www.textuality.com/sgml-erb/kimber/index.html). SO is formulated largely as a proposal
to add new optional features to SGML, to enable
applications to avoid supporting certain constructs which are not now
optional. The entries for SO in the lists below describe the
constructs which would be kept, modified, or suppressed by
applications making use of the proposed new options. In cases where
the SO document is silent, its entries have been made to agree with
those of Min, on the theory that the point of SO is to make possible
the construction of light-weight SGML parsers, and that SO
implementors are likely to make use of all the simplifications already
offered by Min.
- TEI (TEI Interchange Format), as defined in Association for Computers and the Humanities (ACH), Association for
Computational Linguistics (ACL), and Association for Literary and
Linguistic Computing (ALLC), Guidelines for Electronic Text
Encoding and Interchange (TEI P3), ed. C. M. Sperberg-McQueen
and Lou Burnard (Chicago, Oxford: Text Encoding Initiative,
1994), especially section 28.1.3 "TEI
Interchange Format" and chapter 39 "Formal
Grammar for the TEI-Interchange-Format Subset of SGML". The
TEI Guidelines are available on the World-Wide Web at
http://etext.virginia.edu/TEI.html and at
http://dynaweb.ebt.com/usrbooks/teip3/1.toc.
The sections relevant in this context are collected in a single
file (for easier printing) in
http://www-tei.uic.edu/orgs/tei/ml/tif.html.
In
preparing this summary, I have also consulted Association for
Computers and the Humanities (ACH), Association for Computational
Linguistics (ACL), and Association for Literary and Linguistic
Computing (ALLC), Guidelines for the Encoding and Interchange
of Machine-Readable Texts (TEI P1), ed. C. M. Sperberg-McQueen
and Lou Burnard (Chicago, Oxford, 1990), section 2.2.
Except as noted, entries have not been reviewed by the authors of the
proposals, and so may be in error. Some of these schemes also include
proposals that do not fit readily into the format of this list, and
are not described here. For full and authoritative information, the
documenation for each proposal should be consulted.
Other schemes may be added if we learn of them and they seem
to be of interest.
In summarizing the schemes, the following abbreviations
are used:
- sic (`thus'). The schemes indicated
agree with ISO 8879 on this point.
- om. (`omitted by'). The schemes indicated
omit or suppress the construct, concept, or syntax described.
- mod. (`modified by'). The schemes indicated
modify or restrict the construct, concept, or syntax in some way (which
may be described briefly -- for full information, however, consult
the relevant documentation).
- dna (`does not apply to').
The rule does not apply to the schemes indicated,
because it applies to a situation which cannot arise
in those schemes, normally because they have suppressed
one of the constructs involved. In some cases, there is a
fine line between om. and
dna; the distinction should not be
made to bear too much weight.
- sil. (`passed over in silence by').
The schemes indicated
don't address the issue one way or another, and the compiler
is not confident that this silence indicates agreement with 8879.
If no note is present, it normally means all the schemes
collated to date agree with 8879. In isolated cases, it may
mean the compiler just skipped over the item due to inattention,
boredom, or a conviction that a note was not necessary.
Notes are not normally given for quantity definitions, because
few of the schemes collated propose changes to the default
or minimum quantities; most either leave the quantities as
they are, or propose ignoring or dropping them entirely.
For example, consider the following entry:
- 7.5 defines three forms of minimized end-tags:
- empty end-tag
- unclosed end-tag
- null end-tag
These are allowed if and only if SHORTTAG YES
is specified.
[Sic Bas;
om. MGML, Min;
mod. PSGML.
PSGML allows empty end-tags, despite having
SHORTTAG NO
, but not null or unclosed
end-tags.]
This means (1) that minimized end-tags are defined in clause 7.5, and
(2) that in basic SGML, all three forms of minimized
end-tags are legal and may be used as described in 8879, while
none may be used in MGML or minimal SGML, and only one in PSGML.
When more than one item is derived from a clause, a letter is
attached to the clause number to keep the items distinct; it is
not part of the formal clause numbering of ISO 8879.