An Introduction to XML Processing with Lark and Larval

Lark 5-January-1998

Tim Bray (Textuality) <tbray@textuality.com>

Abstract

Lark is a non-validating XML processor implemented in the Java language; it attempts to achieve good trade-offs among compactness, completeness, and performance. Larval is a validating XML processor built on the same code base as Lark. This report gives an overview of the motivations for, facilities offered by, and usage of, the Lark processor.

This document and the Lark software are copyright 1997 by Tim Bray; all rights reserved. However, Lark is available on the Internet for general public use.

This note applies to the final beta of version 1.0 of Lark, and release 0.8 of Larval, in use in January 1998. In this note, the name Lark refers to both Lark and Larval, unless otherwise noted.


An Introduction to XML Processing with Lark and Larval

Version 1.0 final beta

Table of Contents

1. Why Lark?
    1.1 Motivations
    1.2 Conclusions
2. Lark Feature Set Overview
    2.1 Compactness
    2.2 Performance
    2.3 Completeness
    2.4 API
    2.5 Error Handling
    2.6 Text Segment Management
    2.7 Entity Handling
    2.8 Concurrency
3. Lark: Constructors, Methods, and the Handler
    3.1 The Handler Class
    3.2 The Lark Class
    3.3 Controlling Lark's Behavior
    3.4 Initiating Parsing
4. The XmlInputStream Class
5. Elements, Attributes, and Entities
6. Text Handling and Segmentation
7. Internals
    7.1 The Parsing Engine
    7.2 String and Character Processing
    7.3 Validation
8. To-Do List
9. How to Get Lark

1. Why Lark?

1.1 Motivations

Lark's creation was driven by the following motivations:

Personal Gratification
Writing language processors is fun, particularly when you have the chance to fix the language.
Desire to Learn Java
It's about time, and Java seems like the appropriate language for the job.
Test Compliance with Design Goals
In particular, XML Design Principle #4, which states that "It shall be easy to write programs which process XML documents."
Expore the API Design Space
There is a chance, while XML is young, to make some real progress in the design of processor API's. The design of Lark makes very few assumptions about the user interface; thus Lark should be useful as an experimental testbed in this area.
$$$
Perhaps Lark will turn out to be useful. I have not the slightest desire to start another software company (been there, done that, got the T-shirts), but it would be nice to figure out a way to get paid for the time I've put in writing it.

1.2 Conclusions

Yes, writing Lark was fun. In particular, none of the innocent-looking things in XML turned out, in practice, to be too horribly difficult. And Java is indeed a Happy Hunting Ground for programmers.

On the design-goal-compliance front, the good news is that if you wanted a program to process XML that you knew was well-formed, you could probably bash it out in perl (don't know about Java) in a day or so. On the other hand, if you want to build a general-purpose tool that does all of XML and provides helpful error messages and a useful API, the nominal week is not nearly enough. The development of Lark has consumed about a month at this point in time, stretched over a year's elapsed time.

I do think that Lark will be useful for exploring API designs. Of course, none of this will happen unless there are people out there who want to use an XML processor for something or other. Among other things, Lark currently has no user interface at all; while I don't mind editing the Driver.java file and recompiling to run tests, presumably a UI would be a good thing to have.

As for the financial aspects, I'm kind of gloomy. I think most XML processors are going to be purpose-built for the needs of particular applications, and will thus hide inside them. Which is good; XML's simplicity makes this approach cost-effective. Failing that, processors will be full-dress validating parsers with incremental parsing for authoring support. So I'm not sure that there's all that much need for a standalone processor; but I'd love to be wrong.

Just in case, for the moment I'm going to be giving away all the .class files, and some of the Java source code, but not the source code for the three classes with the hard bits. In any case, they're sure to be buggy at this stage and I wouldn't want to be letting them out of my hands with a bit more polishing. If you can see a way to get a little revenue out of this project, give me a call.

2. Lark Feature Set Overview

2.1 Compactness

(The figures below refer to Lark 0.97; they will be updated for 1.0 when it comes out of final beta status.)

Since an XML processor is often going to run on the client and presumably need to be delivered over the network, it must be compact. At the moment, the total byte count is around 45K, which is not too bad.

There is some more scope for compression, when some useful facilities appear in the Java class libraries that ought to be there; e.g. usable symbol table and better Unicode support.

2.2 Performance

At the moment, Lark, running under the Win95 J++ "Jview" application viewer on an underconfigured P100 notebook, runs at about 200K/second. I am fairly happy with this performance, and doubt whether a full-featured processor implemented in Java can really be made to run much faster.

2.3 Completeness

Lark is a processor only; it does not attempt to validate. It does read the DTD, with parameter entity processing; it processes attribute list declarations (to find default values) and entity declarations. Lark is relatively full-featured; it implements (I think) everything in the XML spec and reports violations of well-formedness.

Lark's error-handling is draconian. After encountering the first well-formedness error, no further internal data structures are built or returned to the application. However, Lark does continue processing the document looking for more syntax errors (and in fact performing some fairly aggressive heuristics on the tag stack in order to figure out what's going on), and calling the doSyntaxError application callback (see below) to report further such errors.

Larval is a full validating XML processor; it reports violations of validity constraints, but does not apply draconian error handling to them.

2.4 API

Lark presents as a set of Java classes. Those named Element, Attribute, and Entity are obvious in their function. One lesson of this activity is that it may be possible for such classes to be shared independent of the parser architecture; it would be very handy if all XML-processing Java apps used the same Element class, at least as a basis for subclassing.

The Text and Segment classes do Lark's character data management; details are below.

From an application's point of view, the Lark and Handler classes are central. Handler has methods such as doPI, doSTag, doEntityReference, doEtag; the application passes a Handler instance to Lark's readXML method, and Lark calls the routines as it recognizes significant objects in the XML document. The base class provides do-nothing methods; the intent is that an application would subclass Handler to provide the appropriate processing semantics.

Along with presenting this event stream to the application, Lark can optionally build a parse tree, and if so doing, can optionally save copies of the XML document's character data, all in parallel with providing the event stream. Lark provides methods to toggle these behaviors. These methods may be used in the Handler callbacks while Lark is running, to build the parse tree or save the text for only a subset of the document.

In building the parse tree, Lark is guaranteed to update only elements which are still open; i.e. for which the end-tag has not been seen. All other sections of the tree, and the entire tree once Lark has hit end-of-file, may be manipulated freely by the application.

An instance of Lark may be initialized with an optional list of element GI's which are to be considered as those of empty elements, whether or not the XML "/>" syntax is used. A typical set might begin: "HR", "BR", "IMG", ....

Another initializer allows a set of entities to be predefined.

Lark also provides the application access to the "entity tree"; there is a method that toggles whether Lark attempts to retrieve and parse external entities.

2.5 Error Handling

Lark makes a serious effort to be robust, by providing useful error messages and continuing to do so after the first error. The error handling is good enough that I now use Lark as my primary tool to debug broken XML files.

2.6 Text Segment Management

Probably due to my background in the indexing and search business, Lark pays more attention than is perhaps strictly necessary to the location of objects within the XML file. Lark informs the application of the containing entity and begin/end offsets of each element. A chunk of character data in an element, which may look contiguous to an application, may in fact map to several different byte ranges in different entities, due to the effect of entity references. Also, the use of encodings such as UTF8 may cause the number of bytes of underlying document to differ from the number of bytes that makes up the characters in a chunk of text. Lark represents Text objects as a vector of Segment objects, each of which gives information about the source offset and length, and the number of characters in the segment. If text-saving has not been turned on, the segments contain no character data, but still contain the offset information.

2.7 Entity Handling

Lark does full XML entity handling. Java provides facilities which make this trivially easy. The application can turn the inclusion of external text entities on and off. Lark makes no use of PUBLIC identifiers, aside from passing them to the callback in the Handler upon recognizing the declaration.

2.8 Concurrency

Lark is thread-safe. Multiple Larks can run in parallel threads, with other threads doing useful processing of under-construction document trees.

3. Lark: Constructors, Methods, and the Handler

3.1 The Handler Class

The Handler class is just a package of methods that act as callback routines for Lark. No arguments are used in constructing a handler.

Handler has a method named element, which Lark uses to construct a new Element object every time this is required (rather than calling new Element() directly). The idea is that this can be subclassed to return, rather than an Element object, an instance of a user-built subclass of Element.

The first argument of all the other Handler methods is an Entity object, which contains information about the location of the object just encountered and the current state of the entity stack. All of the Handler methods return booleans. If any handler method returns true, Lark's readXML method returns control to whatever called it. In the base Handler class, all the methods return false by default, except for the doSyntaxError method, which generates a message (designed to look like output from James Clark's SP) via System.err.println(). The methods are summarized below; the names and types of the arguments should make their meaning self-explanatory.

  // Attlist declaration.  The first string will be the GI,
  //  the next 3N the parts of the attlist; each triple contains
  //  the attr name, its type, and either null, meaning no default value,
  //  or the default value as a Text object
  public boolean doAttlist(Entity e, Object[] parts)
  {
    return false;
  }

  // Doctype declaration
  public boolean doDoctype(Entity e, String rootType, 
    String publicID, String systemID)
  {
    return false;
  }

  // Entity reference detected; e is the containing entity,
  //  not the one that was detected
  public boolean doEntityReference(Entity e, String name)
  {
    return false;
  }

  // End-tag detected; the element should be complete if the tree is being
  //  built; otherwise, the only useful information is the end-offset and
  //  the type.
  public boolean doETag(Entity e, Element element)
  {
    return false;
  }

  // Internal (text) entity declaration
  public boolean doInternalEntity(Entity e, String name, char[] value)
  {
    return false;
  }

  public boolean doPI(Entity e, String target, String remainder)
  {
    return false;
  }

  // Start tag detected; the element will in general be
  //  incomplete, unless element.empty() is true
  public boolean doSTag(Entity e, Element element)
  {
    return false;
  }

  // Error condition detected; the entity e will have the
  //  current locational information filled in
  public boolean doSyntaxError(Entity e, String message, int c)
  {
    System.out.print("Syntax error at line " + 
      e.lineCount() + ":" + e.lineOffset() + ":E: " + message);
    if (c != ' ')
      System.out.print(" - character ("+ (char) c+")");
    System.out.println();

    return false;
  }

  // NDATA entity declaration
  public boolean doSystemBinaryEntity(Entity e, String name,
                                      String extID, String notation)
  {
    return false;
  }

  // External text entity declaration; external ID is a URL
  public boolean doSystemTextEntity(Entity e, String name, String extID)
  {
    return false;
  }

  // Detected a text segment
  public boolean doText(Entity ent, Element el, char[] text, int length)
  {
    return false;
  }

  // detected a validity error
  public boolean doValidityError(Entity ent, String message)
  {
    System.out.println("Lark:" + ent.description() + ":" + 
      ent.lineCount() + ":" + ent.lineOffset() + ":V: " + message);
    return false;
  }

  // A warning from Lark
  public boolean doWarning(Entity e, String message)
  {
    System.out.println("Warning at line " + 
      e.lineCount() + ":" + e.lineOffset() + ":W: " + message);
    return false;
  }

  // used when a new element is needed.  If you want to 
  //  subclass element use your own, here's the place to hook in
  public Element element()
  {
    return new Element();
  }

Notes:

3.2 The Lark Class

The Lark class actually does the parsing. It has the following constructors:

  public Lark() throws Exception
  public Lark(String[] emptyElementTypes) throws Exception
  public Lark(String[] entityNames, String[] entityValues) throws Exception
  public Lark(String[] emptyElementTypes,
              String[] entityNames,
              String[] entityValues) throws Exception

emptyElementTypes is a list of GI's which Lark considers, when encountered, to be empty, whether or not they use the trailing "/>" XML empty-element syntax.

entityNames and entityValues are arrays which respectively give the names and values of entities which Lark is to consider to have been predefined.

3.3 Controlling Lark's Behavior

Lark has three methods for toggling behaviors:

  public void buildTree(boolean build)
  public void saveText(boolean save)
  public void processExternalEntities(boolean process)
  public void setDocumentEntityName(String name)

buildTree instructs Lark whether or not to build a parse tree of elements as they are encountered. The default behavior is to build no tree. The tree is built downward from the element current at the time of invocation. When that element ends, tree-building stops.

saveText instructs Lark whether or not to save, in the parse tree, all the character data as it is encountered. This behavior cannot be turned on unless the tree-building has been.

processExternalEntities instructs Lark whether or not to read and process the contents of external text entities. The default behavior is not to read and process them. External entities include the external subset of the DTD and external parameter entities.

setDocumentEntityName can be used to instruct Lark how the base document should be referred to in error messages.

Each of these methods can be used from within the Handler callbacks; in particular, Lark calls the Handler doSTag and doEntityReference methods before deciding whether to build the tree, save the text, or read the entity, so a Handler can control this behavior upon notification of the element or entity.

In addition, Larval has a method for turning validation on and off:

  public void validate(boolean validate)

Turning validation on, as a side-effect, invokes processExternalEntities(true).

3.4 Initiating Parsing

Lark's readXML methods are used to initiate parsing. There are four:

  // no input stream provided - assume current state is valid
  public Element readXML(Handler handler)
    throws java.io.IOException, Exception, InterruptedException

  // XmlInputStream provided - assume it's new
  public Element readXML(Handler handler, XmlInputStream input)
     throws java.io.IOException, InterruptedException, Exception

  // XmlInputStream provided - assume it's new; also we have a URL
  public Element readXML(Handler handler, XmlInputStream input, String baseURL)
     throws java.io.IOException, InterruptedException, Exception

  // byte-buffer provided - make an inputstream and put it on stack
  public Element readXML(Handler handler, byte[] buffer)
     throws java.io.IOException, InterruptedException, Exception

In each case, the caller must provide a Handler (or more likely an instance of a class subclassed from Handler). Lark can read from an XmlInputStream (subclassed from BufferedInputStream - see below) or from a byte array.

As noted above, Lark returns from readXML whenever one of the Handler methods returns true; in which case it maintains the parsing state and readXML can be re-invoked. It also returns when it encounters end of input on the document entity.

The Element object returned by readXML is not meaningful if tree-building has not been turned on. If tree-building has been turned on, it is the topmost node in the tree built by Lark - this may not be the node for the root element if tree-building was not turned on until later in the tree.

4. The XmlInputStream Class

XmlInputStream is a subclass of BufferedInputStream. It adds one additional method:

  public int getXmlChar() throws IOException

This is intended to retrieve enough bytes from the BufferedInputStream to constitute one Unicode character, and return it. Obviously, the logic is highly dependent on the input encoding.

XmlInputStream has all the same constructors as BufferedInputStream, plus one extra:

  XmlInputStream(char[] charBuf)

which allows getXmlChar() to work out of a character array. Lark uses this facility to support internal text entities.

It is remarkable that Java's built-in BufferedInputStream has no getCharacter() method; with any luck, this oversight will soon be corrected, allowing XmlInputStream to be retired. Java does have facilities for reading characters from a DataInputStream, but their design makes them unusable by XML (or by any other application I can think of). Or perhaps I just didn't understand the documentation.

5. Elements, Attributes, and Entities

These types are self-explanatory, contain essentially no nontrivial logic, and are designed for general use in XML applications. Attribute contains a name and a value, and trivial methods for getting and setting them.

public class Attribute
{
  String mName;
  String mValue;
  Text   mText;

  public Attribute(String name, String value)

  // constructor based on Text
  public Attribute(String name, Text text)

  public String name()
  public void setName(String name)
  
  public String value()
  public void setValue(String value)
  public void setValue(Text text)
}

Element is more complex. Readers are urged to look at the source code, which is heavily commented. A few fields and methods of special interest are excerpted below:

public class Element
{
  Attribute[] mAttrs;   // Attributes
  String mType;

  // Children are a Vector because they might be Elements
  //  and they might be Text objects
  Vector mChildren = new Vector(); 

  // Parent element.  null does not mean this is the root,
  //  it just means this is where Lark started constructing
  //  the tree.
  Element mParent;

  public String type();

  public Attribute[] allAttributes()
  public void setAllAttributes(Attribute[] attributes)

  // get & set attribute by name
  public Attribute attribute(String name)
  public void setAttribute(String name, String value)

  public Vector children()
  public Element parent()
}

Note that Lark only creates one String object for each unique element type. This means that the String objects returned by the type() method may be simply compared for equality to establish whether two elements are the same type.

The Entity class' fields are shown below: it has only trivial get and set methods for these fields.

  // locational info
  int mOffset;
  int mLineCount;
  int mLineOffset; // how far into this line

  boolean mTextIsReal; // not true for internal entities
  boolean mPE = false; // true if this is a parameter entity
  URL mBaseURL = null; // for resolving external URLs

  XmlInputStream mInput; // read characters for this entity
  String mDescription;   // metadata
  Entity mParent;        // entity tree; null means doc entity

Note that Entities differ from Elements in one important respect: Lark always constructs the Entity tree.

6. Text Handling and Segmentation

The Text object exists mostly to serve as an abstraction layer in front of objects which present as Strings, but may be stored in a segmented fashion. This supports lazy evaluation, the String being (expensively) constructed only when it is requested. Here is the complete source code:

class Text
{
  // This is a Text object, which can appear in the Children()
  //  of an element.  The text itself is actually stored in a 
  //  Vector of Segments; this object really only exists to
  //  provide a nice singular object to sit in the parse
  //  tree, and to avoid recalculating the String value
  //  of a multi-segment object.
  Vector mSegments = new Vector();
  String mString;

  public void addSegment(Segment segment)
  {
    mSegments.addElement(segment);
    mString = null; // invalidate string
  }
  public Vector segments() { return mSegments; }

  // construct string only on request
  public String string()
  {
    int i;
    Segment seg;

    if (mString == null)
    {
      mString = new String();
      for (i = 0; i < mSegments.size(); i++)
      {
        // there must be a better way
        seg = (Segment) mSegments.elementAt(i);
        mString = mString.concat(seg.string());
      }
    }
    return mString;
  }
}

The Segment class is much more complex, and tries to provide a means of storing information about character data, and optionally the character data itself, efficiently while keeping track of source offset information. Its implementation, particularly the constructors, is fairly complex, but the methods that an application might use are straightforward:

  public Entity entity()
  public int sourceOffset() 
  public int sourceLength() 
  public int dataOffset() 
  public int dataLength() 
  public String string() 

7. Internals

7.1 The Parsing Engine

Lark's implementation is based on the use of a custom-built Deterministic Finite Automaton. The DFA is attributed (certain transitions trigger actions, e.g. "recognize a tag") and uses a push-down stack to avoid redundancy, for example in recognizing quoted strings or the SYSTEM keyword.

Automata offer obvious performance advantages, but can easily become large in size. In particular, a naive implementation of a DFA to process 16-bit Unicode characters would need 65,535 transitions out of each state. Not only would such a large DFA would be wasteful of memory, but since the compiled form must be transmitted as part of the program, it would be wasteful of bandwidth.

In Lark's case the DFA is made compact by:

  1. The simplicity of XML's grammar; a full XML DFA needs only a couple of hundred states.
  2. Using a 7-bit DFA, and dispatching to special processing logic in the case where the input character is numerically greater than 127. This is practical since all of XML's special syntax characters are ASCII, and most characters greater than 127 cause DFA state changes only in a few DFA states, and then only when they are Name or NameStart characters. Thus the amount of special processing for non-ASCII characters is very modest. Nonetheless, Lark will run slower on predominantly non-ASCII texts.
  3. Efficient packing of the DFA for transmission. Any XML DFA will contain a great number of transitions on space characters, name characters, name-start characters, and on any non-delimiter; such transitions can be compacted very nicely.

In the case of Lark, the DFA is written in a simple special-purpose language; it is processed by a perl program to produce the compiled and compressed version that is included in the file Lark.class. (Arguably, it should be encoded in XML, and processed by Lark, rather than a perl application, and this would be an aesthetically pleasing project for sometime in the future).

As of this final beta, the DFA had 227 states, 8,732 transitions, and 159 unique DFA triggers. The automaton is compiled into a bunch of Java static final variable declarations and combined with 2,040 lines of "real" Java code into the file Lark.java; the corresponding Lark.class file is 37K in size. At run-time, the automaton expands to occupy approximately 55K bytes of memory.

7.2 String and Character Processing

The only other noteworthy characteristic of the Lark implementation is the special care that must be taken in handling strings and characters. XML files are rich in identifiers: element types, attribute names, and entity names. While Java comes equipped with a Hashtable object that might be considered a candidate for a symbol table, its String implementation is sufficiently inefficient that naive recourse to a String constructor and Hashtable lookup for every identifier would result in very poor performance.

Thus Lark (and I suspect every other serious XML parser) makes heavy use of counted arrays of characters, and has a purpose-built symbol table. It is interesting that this table, which relies on a straightforward binary search, is apparently comparable in speed to Java's Hashtable.get() implementation.

7.3 Validation

Larval makes use modules in another package, named textuality.validator; the most important classes are named DTD, Attlist, and Validator.

The implementation is extremely naive; there is ad-hoc code for validating attribute declarations and values, as well as for elements declared to have ANY, EMPTY, or Mixed content.

For element content, the declaration is parsed into a tree of Grammar objects that are more or less isomorphic to the declaration in the DTD. At run-time, a method called probe pokes around in this tree, leaving a trail in a java.util.Vector called crumbs for obvious reasons. This was done purely for reasons of laziness, which produced its usual reward; next time I'll get out the Dragon book and do it right. Larval is not a finished product, but it's already pretty useful.

8. To-Do List

9. How to Get Lark

All of the classes, and all the Java source code, are here: lark.tar.gz.