Tim Bray (Textuality) <tbray@textuality.com>
Lark is a non-validating XML processor implemented in the Java language; it attempts to achieve good trade-offs among compactness, completeness, and performance. Larval is a validating XML processor built on the same code base as Lark. This report gives an overview of the motivations for, facilities offered by, and usage of, the Lark processor.
This document and the Lark software are copyright © 1997 by Tim Bray; all rights reserved. However, Lark is available on the Internet for general public use.
This note applies to the final beta of version 1.0 of Lark, and release 0.8 of Larval, in use in January 1998. In this note, the name Lark refers to both Lark and Larval, unless otherwise noted.
Lark's creation was driven by the following motivations:
Yes, writing Lark was fun. In particular, none of the innocent-looking things in XML turned out, in practice, to be too horribly difficult. And Java is indeed a Happy Hunting Ground for programmers.
On the design-goal-compliance front, the good news is that if you wanted a program to process XML that you knew was well-formed, you could probably bash it out in perl (don't know about Java) in a day or so. On the other hand, if you want to build a general-purpose tool that does all of XML and provides helpful error messages and a useful API, the nominal week is not nearly enough. The development of Lark has consumed about a month at this point in time, stretched over a year's elapsed time.
I do think that Lark will be useful for exploring API designs. Of course, none of this will happen unless there are people out there who want to use an XML processor for something or other. Among other things, Lark currently has no user interface at all; while I don't mind editing the Driver.java file and recompiling to run tests, presumably a UI would be a good thing to have.
As for the financial aspects, I'm kind of gloomy. I think most XML processors are going to be purpose-built for the needs of particular applications, and will thus hide inside them. Which is good; XML's simplicity makes this approach cost-effective. Failing that, processors will be full-dress validating parsers with incremental parsing for authoring support. So I'm not sure that there's all that much need for a standalone processor; but I'd love to be wrong.
Just in case, for the moment I'm going to be giving away all the .class files, and some of the Java source code, but not the source code for the three classes with the hard bits. In any case, they're sure to be buggy at this stage and I wouldn't want to be letting them out of my hands with a bit more polishing. If you can see a way to get a little revenue out of this project, give me a call.
(The figures below refer to Lark 0.97; they will be updated for 1.0 when it comes out of final beta status.)
Since an XML processor is often going to run on the client and presumably need to be delivered over the network, it must be compact. At the moment, the total byte count is around 45K, which is not too bad.
There is some more scope for compression, when some useful facilities appear in the Java class libraries that ought to be there; e.g. usable symbol table and better Unicode support.
At the moment, Lark, running under the Win95 J++ "Jview" application viewer on an underconfigured P100 notebook, runs at about 200K/second. I am fairly happy with this performance, and doubt whether a full-featured processor implemented in Java can really be made to run much faster.
Lark is a processor only; it does not attempt to validate. It does read the DTD, with parameter entity processing; it processes attribute list declarations (to find default values) and entity declarations. Lark is relatively full-featured; it implements (I think) everything in the XML spec and reports violations of well-formedness.
Lark's error-handling is draconian. After encountering the first well-formedness error, no further internal data structures are built or returned to the application. However, Lark does continue processing the document looking for more syntax errors (and in fact performing some fairly aggressive heuristics on the tag stack in order to figure out what's going on), and calling the doSyntaxError application callback (see below) to report further such errors.
Larval is a full validating XML processor; it reports violations of validity constraints, but does not apply draconian error handling to them.
Lark presents as a set of Java classes. Those named
Element
, Attribute
, and Entity
are obvious in their function.
One lesson of this activity is that it may be possible for such
classes to be shared independent of the parser architecture; it would be
very handy if all XML-processing Java apps used the same Element class,
at least as a basis for subclassing.
The Text
and Segment
classes do Lark's
character data management; details are below.
From an application's point of view, the Lark
and
Handler
classes
are central.
Handler has methods such as doPI
, doSTag
,
doEntityReference
, doEtag
; the application
passes a Handler instance to Lark's readXML
method, and Lark
calls the routines as it recognizes significant objects in the XML document.
The base class provides do-nothing methods; the intent is that
an application would subclass Handler to provide the appropriate processing
semantics.
Along with presenting this event stream to the application, Lark can optionally build a parse tree, and if so doing, can optionally save copies of the XML document's character data, all in parallel with providing the event stream. Lark provides methods to toggle these behaviors. These methods may be used in the Handler callbacks while Lark is running, to build the parse tree or save the text for only a subset of the document.
In building the parse tree, Lark is guaranteed to update only elements which are still open; i.e. for which the end-tag has not been seen. All other sections of the tree, and the entire tree once Lark has hit end-of-file, may be manipulated freely by the application.
An instance of Lark may be initialized with an optional list of element
GI's which are to be considered as those of empty elements, whether or not the
XML "/>" syntax is used. A typical set might begin:
"HR", "BR", "IMG", ....
Another initializer allows a set of entities to be predefined.
Lark also provides the application access to the "entity tree"; there is a method that toggles whether Lark attempts to retrieve and parse external entities.
Lark makes a serious effort to be robust, by providing useful error messages and continuing to do so after the first error. The error handling is good enough that I now use Lark as my primary tool to debug broken XML files.
Probably due to my background in the indexing and search business, Lark pays
more attention than is perhaps strictly necessary to the location of objects
within the XML file.
Lark informs the application of the containing entity and begin/end offsets
of each element.
A chunk of character data in an element, which may look contiguous to an
application, may in fact map to several different byte ranges in different
entities, due to the effect of entity references.
Also, the use of encodings such as UTF8 may cause the number of bytes of
underlying document to differ from the number of bytes that makes up the
characters in a chunk of text.
Lark represents Text objects as a vector of Segment
objects,
each of which gives information about the source offset and length, and the
number of characters in the segment.
If text-saving has not been turned on, the segments contain no character
data, but still contain the offset information.
Lark does full XML entity handling. Java provides facilities which make this trivially easy. The application can turn the inclusion of external text entities on and off. Lark makes no use of PUBLIC identifiers, aside from passing them to the callback in the Handler upon recognizing the declaration.
Lark is thread-safe. Multiple Larks can run in parallel threads, with other threads doing useful processing of under-construction document trees.
The Handler class is just a package of methods that act as callback routines for Lark. No arguments are used in constructing a handler.
Handler has a method named element
, which Lark
uses to construct a new Element object every time this is required (rather
than calling new Element()
directly).
The idea is that this can be subclassed to return, rather than an Element
object, an instance of a user-built subclass of Element.
The first argument of all the other Handler methods is an
Entity
object, which contains information about the location of
the object just
encountered and the current state of the entity stack.
All of the Handler methods return booleans. If any handler method returns
true, Lark's readXML
method returns control to whatever
called it.
In the base Handler class, all the methods return false by default, except for
the doSyntaxError method, which generates a message (designed to
look like output from James Clark's SP) via System.err.println()
.
The methods are summarized below; the names and types of the arguments should
make their meaning self-explanatory.
// Attlist declaration. The first string will be the GI, |
Notes:
Title
, there will be only one
instance of the string Title
in Lark; thus element types, entity
names and attribute names can be compared for equality simply with
==
, rather the Java String.equals()
method.doText
, the length argument is provided because the
characters may not occupy the whole text buffer, which is re-usable; thus if
the application wishes to save those characters, it must copy them out and
store them.doText
calls, one for each
text segment thus created.
See the discussion of the Text and Segment classes for a fuller explanation of
the motivation for and usage of text segments.doInternalEntity
, doSystemTextEntity
,
and doSystemBinaryEntity
methods signal the declaration, not
the invocation, of these entities.doSTag
and doETag
is re-used; the same
one passed in each time.
In this scenario, the only function of the Element object is as a container
for data fields.
When tree building is turned on, each element gets its own object, and they are
linked together in the tree.The Lark class actually does the parsing. It has the following constructors:
public Lark() throws Exception |
emptyElementTypes
is a list of GI's which Lark considers,
when encountered, to be empty, whether or not they use the trailing "/>" XML
empty-element syntax.
entityNames
and entityValues
are arrays which
respectively give the names and values of entities which Lark is to consider
to have been predefined.
Lark has three methods for toggling behaviors:
public void buildTree(boolean build) |
buildTree
instructs Lark whether or not to build a parse
tree of elements as they are encountered.
The default behavior is to build no tree.
The tree is built downward from the element current at the time
of invocation.
When that element ends, tree-building stops.
saveText
instructs Lark whether or not to save, in the parse
tree, all the character data as it is encountered.
This behavior cannot be turned on unless the tree-building has been.
processExternalEntities
instructs Lark whether or not to
read and process
the contents of external text entities.
The default behavior is not to read and process them.
External entities include the external subset of the DTD and
external parameter entities.
setDocumentEntityName
can be used to instruct Lark
how the base document should be referred to in error messages.
Each of these methods can be used from within the Handler callbacks;
in particular, Lark calls the Handler doSTag
and
doEntityReference
methods before deciding whether to build the
tree, save the text, or read the entity, so a Handler can control this
behavior upon notification of the element or entity.
In addition, Larval has a method for turning validation on and off:
public void validate(boolean validate) |
Turning validation on, as a side-effect, invokes
processExternalEntities(true)
.
Lark's readXML
methods are used to initiate parsing.
There are four:
// no input stream provided - assume current state is valid |
In each case, the caller must provide a Handler (or more likely an instance of a class subclassed from Handler). Lark can read from an XmlInputStream (subclassed from BufferedInputStream - see below) or from a byte array.
As noted above, Lark returns from readXML whenever one of the Handler methods returns true; in which case it maintains the parsing state and readXML can be re-invoked. It also returns when it encounters end of input on the document entity.
The Element object returned by readXML is not meaningful if tree-building has not been turned on. If tree-building has been turned on, it is the topmost node in the tree built by Lark - this may not be the node for the root element if tree-building was not turned on until later in the tree.
XmlInputStream is a subclass of BufferedInputStream. It adds one additional method:
public int getXmlChar() throws IOException |
This is intended to retrieve enough bytes from the BufferedInputStream to constitute one Unicode character, and return it. Obviously, the logic is highly dependent on the input encoding.
XmlInputStream has all the same constructors as BufferedInputStream, plus one extra:
XmlInputStream(char[] charBuf) |
which allows getXmlChar() to work out of a character array. Lark uses this facility to support internal text entities.
It is remarkable that Java's built-in BufferedInputStream has no
getCharacter() method; with any luck, this oversight will soon be
corrected, allowing XmlInputStream to be retired.
Java does have facilities for reading characters from a
DataInputStream
, but their design makes them unusable by XML
(or by any other application I can think of).
Or perhaps I just didn't understand the documentation.
These types are self-explanatory, contain essentially no nontrivial logic, and are designed for general use in XML applications. Attribute contains a name and a value, and trivial methods for getting and setting them.
public class Attribute |
Element is more complex. Readers are urged to look at the source code, which is heavily commented. A few fields and methods of special interest are excerpted below:
public class Element |
Note that Lark only creates one String object for each unique element type. This means that the String objects returned by the type() method may be simply compared for equality to establish whether two elements are the same type.
The Entity class' fields are shown below: it has only trivial get and set methods for these fields.
// locational info |
Note that Entities differ from Elements in one important respect: Lark always constructs the Entity tree.
The Text object exists mostly to serve as an abstraction layer in front of objects which present as Strings, but may be stored in a segmented fashion. This supports lazy evaluation, the String being (expensively) constructed only when it is requested. Here is the complete source code:
class Text |
The Segment class is much more complex, and tries to provide a means of storing information about character data, and optionally the character data itself, efficiently while keeping track of source offset information. Its implementation, particularly the constructors, is fairly complex, but the methods that an application might use are straightforward:
public Entity entity() |
Lark's implementation is based on the use of a custom-built
Deterministic Finite Automaton. The DFA is attributed (certain transitions
trigger actions, e.g. "recognize a tag") and uses a push-down stack to avoid
redundancy, for example in recognizing quoted strings or the
SYSTEM
keyword.
Automata offer obvious performance advantages, but can easily become large in size. In particular, a naive implementation of a DFA to process 16-bit Unicode characters would need 65,535 transitions out of each state. Not only would such a large DFA would be wasteful of memory, but since the compiled form must be transmitted as part of the program, it would be wasteful of bandwidth.
In Lark's case the DFA is made compact by:
In the case of Lark, the DFA is written in a simple special-purpose
language; it is processed by a perl program to produce the compiled and
compressed version that is included in the file Lark.class
.
(Arguably, it should be encoded in XML, and processed by Lark, rather than a
perl application, and this would be an aesthetically pleasing project for
sometime in the future).
As of this final beta, the DFA had 227 states, 8,732 transitions, and 159
unique DFA triggers.
The automaton is compiled into a bunch of Java static final
variable declarations and
combined with 2,040 lines of "real" Java code into the file
Lark.java
; the corresponding Lark.class
file is 37K
in size.
At run-time, the automaton expands to occupy approximately 55K bytes
of memory.
The only other noteworthy characteristic of the Lark implementation is the special care that must be taken in handling strings and characters. XML files are rich in identifiers: element types, attribute names, and entity names. While Java comes equipped with a Hashtable object that might be considered a candidate for a symbol table, its String implementation is sufficiently inefficient that naive recourse to a String constructor and Hashtable lookup for every identifier would result in very poor performance.
Thus Lark (and I suspect every other serious XML parser) makes heavy use
of counted arrays of characters, and has a purpose-built symbol table.
It is interesting that this table, which relies on a straightforward
binary search, is apparently comparable in speed to Java's
Hashtable.get()
implementation.
Larval makes use modules in another package, named
textuality.validator
; the most important classes are
named DTD
, Attlist
, and Validator
.
The implementation is extremely naive; there is ad-hoc code for
validating attribute declarations and values, as well as for
elements declared to have ANY
, EMPTY
, or
Mixed content.
For element content, the declaration is parsed into a tree of
Grammar
objects that are more or less isomorphic
to the declaration in the DTD.
At run-time, a method called probe
pokes around in this tree,
leaving a trail in a java.util.Vector
called crumbs
for obvious reasons.
This was done purely for reasons of laziness, which produced its usual reward;
next time I'll get out the Dragon book and do it right.
Larval is not a finished product, but it's already pretty useful.
All of the classes, and all the Java source code, are here: lark.tar.gz.