XML and Perl The Perl Conference 2.0 Road Map What is markup? What is XML? How does Unicode work? Why should you use XML? How can you use XML in Perl? Who Am I? Failed poet & old database engineer 1986-89: New Oxford English Dictionary project 1989-96: Open Text, co-founder through IPO Now: Independent, co-editor XML 1.0 spec, technical editor of XML.com, Seybold Fellow tbray@textuality.com, +1-604-708-9592 Where To Get This Talk http://www.textuality.com/talks/px What is Markup? Three Kinds of Markup Presentational Markup Procedural Markup Descriptive Markup No Markup THREEKINDSOFMARKUPPR ESENTATIONALMARKUPPR OCEDURALMARKUPDESCRI PTIVEMARKUP Procedural Markup {\f1\fs48 Three Kinds of Markup\par }\pard \nowidctlpar \widctlpar\adjustright {\f1 \par {\pntext\pard\plain\f1 \fs28\cgrid \hich\af1\dbch\af0\loch\f1 1.\tab}}\pard \fi-360\li360\nowidctlpar\widctlpar\jclisttab\tx360 {\*\pn \pnlvlbody\ilvl0\ls1\pnrnot0\pndec\pnstart1 \pnindent360\pnhang{\pntxta .}}\ls1\adjustright {\f1\fs28 Presentational \par {\pntext\pard\plain\f1\fs28 \cgrid \hich\af1\dbch\af0\loch\f1 2.\tab}}\pard \fi-360 \li360\nowidctlpar\widctlpar\jclisttab\tx360{\*\pn \pnlvlbody\ilvl0\ls1\pnrnot0\pndec\pnstart1\pnindent360 \pnhang{\pntxta .}}\ls1\adjustright {\f1\fs28 Procedural \par {\pntext\pard\plain\f1\fs28\cgrid \hich\af1\dbch\af0 \loch\f1 3.\tab}}\pard \fi-360\li360\nowidctlpar\widctlpar \jclisttab\tx360{\*\pn \pnlvlbody\ilvl0\ls1\pnrnot0\pndec \pnstart1\pnindent360\pnhang{\pntxta .}}\ls1\adjustright {\f1\fs28 Descriptive}{ \f1\fs28 \par }} Descriptive Markup Three Kinds of Markup Presentational Markup Procedural Markup Descriptive Markup ]]> On Markup In the old days, (troff/TeX), markup was procedural and overt In modern WP and DTP, markup is procedural and hidden WYSIWYG is a lie! Why Descriptive Markup is Good For You Can repurpose for multiple uses Can do clever search & retrieval You own it, not your vendor What is XML? This Isn't XML Suite in F This work of J.S. Bach is written to be played, not on the modern cello, but on the much softer-toned viola da gamba.

An italian specimen of 1532: ]]> no top-level element <TITLE> doesn't match </title> what's &nbsp? what does <i> mean? neither <p> nor <IMG> have an end-tag. no quotes on "vdg.jpg" This is Well-Formed XML Suite in F This work of J.S. Bach is written to be played, not on the modern cello, but on the much softer-toned viola da gamba.

An italian specimen of 1532:

]]>
This is Valid XML ]> Suite in F

This work of J.S. Bach is written to be played, not on the modern cello, but on the much softer-toned viola da gamba.

An italian specimen of 1532:

]]>
Valid XML, but Better Suite in F

This work of J.S. Bach is written to be played, not on the modern cello, but on the much softer-toned viola da gamba.

An italian specimen of 1532:

]]>
XML: History and Politics In 1986, ISO approved an international standard for descriptive markup, named SGML In 1996, HTML was running out of steam... ... it looked like SGML had some of the answers... ... but SGML has technical and political problems... SGML - (arcane features) + (new acronym) = XML! XML: Design Goals
    XML shall be straightforwardly usable over the Internet XML shall support a wide variety of applications XML shall be compatible with SGML It shall be easy to write programs which process XML documents The number of optional features in XML is to be kept to the absolute minimum, ideally zero XML documents should be human-legible and reasonably clear The XML design should be prepared quickly The design of XML shall be formal and concise XML documents shall be easy to create Terseness in XML markup is of minimal importance
XML From 50,000 Feet A meta-language for descriptive markup: you invent your own tags Small spec: < 40 pages Built-in internationalization via Unicode Built-in error-handling Optimized for network operations Tons of support from "the big boys." Where to Find Out About XML http://www.w3.org/xml http://www.xml.com http://www.sil.org/sgml/xml.html XML Terminology 1 ]>

A commentary on the &W3C;'s XML spec is at XML.com

Check it out!

]]>
home and W3C are entities &home; and &W3C; are entity references There are four elements, of three element types: show, link, and p There is one attribute, whose name is href and whose value is http://www.xml.com/xml/pub/axml
XML Terminology 2 The XML spec defines XML Document and XML Processor An XML Document is anything that's "well-formed" An XML Processor is a piece of software that reads XML on behalf of an application How Does Unicode Work? The Unicode Spectrum Unicode == ISO 10646 38,886 16-bit characters (20,902 CJK) Every character ever available with a computer 1 million "surrogate" characters Unicode for Programmers 16-bit formats: UTF-16 and UCS-2 (wchar_t in C, char in Java) 8-bit format: UTF-8 (char in C) Perl currently uses UTF-8 internally, can read UTF-16, ASCII, ISO-8859-1, and UTF-8 Go to www.unicode.org and buy the book! Unicode and XML <?xml version="1.0" encoding="ISO-8859-1" ?> XML processors required to read UTF-8 and UTF-16! Unfortunately, there's not much out there... ... but ASCII, EBCDIC, JIS, KO18-R, Big5, etc. are all full of Unicode characters ... ... so they are legal XML too ... ... but you have to tell the processor! Why Should You Use XML? Recently I Set Up a Linux Box Recently I Set up a Linux Box Section "Pointer" Protocol "MouseMan" Device "/dev/mouse" # When using XQUEUE, comment out the above two lines, # and uncomment the following line. # Protocol "Xqueue" # ... parts left out # ChordMiddle is an option for some 3-button Logitech mice ChordMiddle EndSection From XFree86Config Recently I Set Up a Linux Box boot=/dev/hda map=/boot/map install=/boot/boot.b prompt timeout=100 other=/dev/hda1 label=Win95 table=/dev/hda image=/boot/vmlinuz label=linux root=/dev/hda3 read-only lilo.conf Recently I Set Up a Linux Box [homes] comment = Home Directories browseable = yes read only = no create mode = 0750 [printers] comment = All Printers browseable = no printable = yes public = no From smb.conf Recently I Set Up a Linux Box From inetd.conf Recently I Set Up a Linux Box From fvwm2rc95 About Syntax Syntax is boring Inventing syntax is a waste of time Writing code to parse your own syntax is a waste of time Learning a new syntax for each configuration file is a waste of time So stop wasting time and leave the syntax to XML! Why is the Web So Slow? Browser task: render HTML Server tasks: full-text search, database apps, session management, network interface, template processing, etc. etc. etc. To make the Web faster, run more code in the browser! Today's Browser Architecture The Document Object Model Replaces "Dynamic HTML" Language-independent Browser-independent OS-independent The Next-Gen Browser Metadata Model + XML Syntax = RDF Commercial MIS systems are largely metadata-driven The Web has no metadata - hence brute-force web robots Coming soon from the W3C, Resource Description Framework (RDF): simple data model, XML syntax, let 100 vocabularies bloom Who Loves XML? XML's attractiveness to a product vendor is in inverse proportion to their market share XML's attractiveness to people who just want pretty pages is not very high XML's attractiveness to people who invest a lot in creating information is overwhelming How Can You Use XML in Perl? The expat XML Processor expat written in C by James Clark Blindingly fast Stream-based callback API The XML::Parser Package No reliance in principle on expat Can send raw expat events to your own modules Comes with a range of prepackaged handlers, invoked by the Style argument XML::Parser is best-seen as a testbed for constructing APIs Some Test Data The Old Testament

Source of original ASCII files unknown.

SGML markup by Jon Bosak, 1992-1994.

XML version by Jon Bosak, 1996-1998.

This work may be freely distributed internationally.

The First Book of Moses, Called GENESIS.

In the beginning God created the heaven and the earth.

]]>
The XML::Parser "Debug" Style 'debug'; parsefile $p 'Beginning.xml'; ========================================= tstmt \\ () tstmt ttitle \\ () tstmt ttitle || The Old Testament tstmt ttitle // tstmt || #10; tstmt fm \\ () tstmt fm || #10; tstmt fm p \\ () tstmt fm p || Source of original ASCII files unknown. tstmt fm p // tstmt fm || #10; tstmt fm p \\ () tstmt fm p || SGML markup by Jon Bosak, 1992-1994. tstmt fm p // tstmt fm || #10; tstmt fm p \\ () tstmt fm p || XML version by Jon Bosak, 1996-1998. tstmt fm p // tstmt fm || #10; tstmt fm p \\ () tstmt fm p || This work may be freely distributed internationally. tstmt fm p // tstmt fm || #10; tstmt fm // tstmt || #10; tstmt book \\ (title Genesis) tstmt book || #10; tstmt book bktlong \\ () tstmt book bktlong || The First Book of Moses, Called GENESIS. tstmt book bktlong // tstmt book || #10; tstmt book chapter \\ (n 1) tstmt book chapter || #10; tstmt book chapter v \\ (n 1) tstmt book chapter v p \\ () tstmt book chapter v p || In the beginning God created the heaven and the earth. tstmt book chapter v p || #10; tstmt book chapter v p // tstmt book chapter v // tstmt book chapter || #10; tstmt book chapter // tstmt book || #10; tstmt book // tstmt || #10; tstmt //]]> The XML::Parser "subs" Style For each start-tag <foo>, calls sub foo For each end-tag </foo>, calls sub foo_ For each chunk of text, calls sub characters Maintains element stack in @Context The XML::Parser "subs" Style 'subs'; parsefile $p 'Beginning.xml'; sub p { print "@{$p->{Context}}\n"; } sub characters { print "$_[1]"; } ========================================= Text: The Old TestamentText: Text: Para: tstmt fm Text: Source of original ASCII files unknown. Text: Para: tstmt fm Text: SGML markup by Jon Bosak, 1992-1994. Text: Para: tstmt fm Text: XML version by Jon Bosak, 1996-1998. Text: Para: tstmt fm Text: This work may be freely distributed internationally. Text: Text: Text: Text: The First Book of Moses, Called GENESIS. Text: Text: Para: tstmt book chapter v Text: In the beginning God created the heaven and the earth. Text: Text: Text: Text:]]> The XML::Parser "tree" Style 'tree'; parsefile $p 'Beginning.xml'; require 'dumpvar.pl'; dumpvar('main', 'p'); ========================================= $p = XML::Parser=HASH(0x80fb144) 'Parser' => 135366672 'Pkg' => 'main' 'RawEvents' => 'XML::Parser::Tree' 'Style' => 'tree' 'Tree' => ARRAY(0x80cec3c) 0 'tstmt' 1 ARRAY(0x8128690) 0 HASH(0x8128678) empty hash 1 'ttitle' 2 ARRAY(0x81286f0) 0 HASH(0x81286d8) empty hash 1 0 2 'The Old Testament' 3 0 4 ' ' 5 'fm' 6 ARRAY(0x81287c8) 0 HASH(0x81287b0) empty hash 1 0 2 ' ' 3 'p' 4 ARRAY(0x811a11c) 0 HASH(0x811a104) empty hash 1 0 2 'Source of original ASCII files unknown.' 5 0 6 ' ' 7 'p' 8 ARRAY(0x811a1f4) 0 HASH(0x811a1dc) empty hash 1 0 2 'SGML markup by Jon Bosak, 1992-1994.' 9 0 10 ' ' 11 'p' 12 ARRAY(0x811a2cc) 0 HASH(0x811a2b4) empty hash 1 0 2 'XML version by Jon Bosak, 1996-1998.' 13 0 14 ' ' 15 'p' 16 ARRAY(0x811a3a4) 0 HASH(0x811a38c) empty hash 1 0 2 'This work may be freely distributed internationally.' 17 0 18 ' ' 7 0 8 ' ' 9 'book' 10 ARRAY(0x812332c) 0 HASH(0x811a4c4) 'title' => 'Genesis' 1 0 2 ' ' 3 'bktlong' 4 ARRAY(0x81233bc) 0 HASH(0x81233a4) empty hash 1 0 2 'The First Book of Moses, Called GENESIS.' 5 0 6 ' ' 7 'chapter' 8 ARRAY(0x81234b8) 0 HASH(0x8123494) 'n' => 1 1 0 2 ' ' 3 'v' 4 ARRAY(0x812356c) 0 HASH(0x8123548) 'n' => 1 1 'p' 2 ARRAY(0x81235cc) 0 HASH(0x81235b4) empty hash 1 0 2 'In the beginning God created the heaven and the earth.' 3 0 4 ' ' 5 0 6 ' ' 9 0 10 ' ' 11 0 12 ' ' 'Userdata' => 135369360]]> The XML::Parser "stream" Style Calls sub StartTag for each start-tag, sub EndTag for each end-tag Calls sub Text for text $_ is the text that was recognized For StartTag and EndTag, $_[0] is the element type For StartTag, %_ is a hash of attribute values by name Default action for all callbacks is print; Some Larger Test Data The New Testament

Source of original ASCII files unknown.

SGML markup by Jon Bosak, 1992-1994.

XML version by Jon Bosak, 1996-1998.

This work may be freely distributed internationally.

The Gospel According to SAINT MATTHEW. Matthew Chapter 1 1

The book of the generation of Jesus Christ, the son of David, the son of Abraham.

]]>
... and so on: 1,170,010 bytes in total Note that verse-numbers are elements, not attributes
The "Stream" Style: Assignment 1 Turn the verse numbers into attributes 'stream'; parsefile $p $ARGV[0]; sub StartTag { if ($_[0] eq "vn") { print ""; } else { print; } }]]> The "Stream" Style: Assignment 1 But There's More Than One Way To Do It! 'stream'; parsefile $p $ARGV[0]; sub StartTag { if (/"; } else { print; } }]]> The "Stream" Style: Assignment 2 It seems that there's one paragraph per verse (stupid); is this true? 'stream'; parsefile $p $ARGV[0]; sub StartTag { if ($_[0] eq "v") { $pCount = 0; } elsif ($_[0] eq "p" && grep(/^v$/, @{$p->{Context}})) { $pCount++ }; } sub EndTag { if ($pCount > $maxPCount) { $maxPCount = $pCount; } } sub Text { } sub EndDocument { print "Max P's per V: $maxPCount\n"; }]]> The "Stream" Style: Assignment 3 OK, lose the superfluous paragraph tags 'stream'; parsefile $p $ARGV[0]; sub StartTag { unless ($_[0] eq "p" && grep(/^v$/, @{$p->{Context}})) { print }; } sub EndTag { unless ($_[0] eq "p" && grep(/^v$/, @{$p->{Context}})) { print }; }]]> The "Stream" Style: Assignment 3 There's More Than One Way To Do It! 'stream'; parsefile $p $ARGV[0]; sub StartTag { unless (/

/ && $p->{Context}[-1] eq "v") { print }; } sub EndTag { unless (/<.p>/ && $p->{Context}[-1] eq "v") { print }; }]]> The "Stream" Style: Assignment 4 Also, lose the useless trailing newline 'stream'; parsefile $p $ARGV[0]; sub Text { chop if ($p->{Context}[-1] eq "v"); print; }]]> The "Stream" Style: Assignment 5 Find Jesus! 'stream'; parsefile $p $ARGV[0]; sub Text { $J++ if (/Jesus/ && $p->{Context}[-1] eq "v") } sub StartTag { $V++ if ($_[0] eq "v"); } sub EndTag { } # default is to print, remember sub EndDocument { print "$J of $V verses mention Jesus\n"; }]]> The "Stream" Style: Final Assignment Build a glossary of terms in the XML specification It is assumed that an XML processor is doing its work on behalf of another module, called the application.]]> The "Stream" Style: Final Assignment Build a glossary of terms in the XML specification 'stream'; parsefile $p $ARGV[0]; sub StartDocument { print "List of Terms\n"; print "

List of Terms

\n
\n"; } sub StartTag { if (/$_{term}
"; } elsif (//) { print ""; } # some termdefs include grammar productions, sigh elsif (/{Context}})) { print "
"; } } sub EndTag { if (/<.termdef/) { print "
\n"; } elsif (/<.term>/) { print ""; } elsif (/<.prod/ && grep(/^termdef$/, @{$p->{Context}})) { print ""; } elsif (/<.lhs/ && grep(/^termdef$/, @{$p->{Context}})) { print " ::= "; } } sub Text { if (grep(/^termdef$/, @{$p->{Context}}) && !grep(/^head$/, @{$p->{Context}})) { s/&/&/g; s/\n"; }]]> And A Parting Question print if (/perl is terrific/i); What matches this? Match? is terrific]]> Match? terrific ]]> Match? is terrific ]]> Match? not so terrific, said ]]> Match? is terrific, commented Bray ]]> Match? ... perl is &adjective;!]]>