MGML - an SGML Application for Describing Document Markup Languages

by Tim Bray, Textuality.


The Standard Generalized Markup Language is the most fully developed specification of the use of descriptive markup languages for electronic documents. The idea of descriptive markup is simple and powerful, and in fact has proved to be a basic requirement for many advanced information processing applications.

Unfortunately, the adoption of SGML has proved surprisingly difficult, expensive and slow, given that the underlying ideas are simple and self-evidently good. Some of the perceived reasons have included:

  1. The SGML standard itself is large, complex, and difficult to understand.
  2. The standard specifies several optional and advanced markup features, some of which remain unimplemented.
  3. Some of the features of SGML have proven counter-productive in practical use.
  4. Practical use of SGML requires learning several other languages, including the language used to write DTD's, various stylesheeting and formatting languages, and the SGML/Open Entity Catalogue language.
  5. The design of SGML takes little account of the contemporary theory of formal languages and finite automata. One practical result is that SGML parsers are unable to make use of some advanced tools and techniques made possible by that theory. Consequently, they are large and complex pieces of computer software; as such they (a) suffer from reliability problems, (b) have in practice proven difficult to integrate into applications, and (c) change slowly in response to advances in software and document processing technology.

Nonetheless, there remains a consensus that SGML's basic design partition into entities, elements, and attributes is correct and useful. One result is a common tendency, in strategic projects involving SGML, to avoid using many advanced features and operate within the bounds of a highly restricted subset. This approach has generally met with success. However, this restricted subset has been re-invented by each successive group that has attacked the problem.

It is our opinion that SGML exhibits an extreme case of the "80-20 syndrome"; that is to say, 80% of the benefit is gained by applying only 20% of the machinery. It is the goal of this project to formalize the definition of this useful subset, which we call Minimal Generalized Markup Language, MGML.

The design goals are that MGML shall:

  1. be an SGML application, and process a proper subset of SGML documents
  2. provide full support for the basic mechanisms (entities, elements, and attributes) which have made SGML successful
  3. unify the syntax of the meta-langage and the generated languages (the DTD and the instances)
  4. be defined by a simple, compact, formal specification that allows the easy implementation of MGML processors by taking advantage of standard formal-language technology.
  5. exclude those portions of the SGML design which impair ease of understanding, use, and portability

The Specification of MGML

The syntactic structure of MGML, enabling markup to be destinguished from data, is hardwired and has been straightforwardly and completely implemented using lex-style regular expressions .

MGML is based on the Document Structure Definition (DSD). A DSD is a set of structure definitions that apply to all documents of a given class. The required content and structure of a DSD are defined by the MGML Reference DSD. The behavior of a conforming MGML processor is defined in the list appearing below in this document, and in commentary text attached to the structure definitions in the MGML Reference DSD. These behavior specifications and the MGML Reference DTD together constitute the sole and complete definition of MGML.

The MGML Reference DSD defines a total of 21 elements and 18 attributes. In printed form, it occupies only 5 pages. An electronic form may be obtained here. To help in understanding, a real SGML DTD for the MGML Reerence DSD may be obtained here.

A reference parser for a slightly earlier version, including fairly complete entity processing, implemented as two lex modules, one C module, and one yacc module, comprised about 1000 lines of code.

A conforming MGML processor shall:

  1. Optionally, for any DSD, write a corresponding SGML declaration and SGML Document Type Definition which define a class of documents including all those accepted as valid by an MGML processor with respect to the DSD. Thus, every MGML document is an SGML document.
  2. Scan the text of each element's content to distinguish markup and data.
  3. Replace entity references by their entities.
  4. Validate the element and attribute structure against the model described by the DSD.
  5. Supply all defaulted attributes.
  6. Provide to an external processing system (a) complete information about the entity, element, and attribute structure and (b) access to its content.