So you've heard a lot about it lately but what exactly is "pull parsing"? Simply, pull parsing is a programming interface to access XML document information. The dominant APIs for writing XML applications has been the Document Object Model (DOM) and the Simple API for XML (SAX). However, pull parsing provides an alternative method for application developers when the DOM tree-based model and the SAX event-push model are not appropriate. Using the pull parsing paradigm, the application can actively control the parsing of the XML document by traversing document nodes, one event at a time.
Currently, a number of XML pull-parsing APIs exist to as an alternative to the DOM and SAX models. For example, the XMLPull API was created by merging two parallel efforts. In addition, JSR 173 was formed to develop a pull-parsing API for the Java programming language. NekoPull was invented for two reasons: to fix the inadequacies the author sees in other pull-parsing designs; and to add native pull-parsing capability to Xerces2.
The XML pull-parsing API of NekoPull is very simple. The application instantiates a pull parser instance and requests document information events. The parser, in turn, parses a little bit of the XML document and returns an object representing a piece of the document's information. The application then queries the type of the event and casts the object to a specific event type in order to access the information relating to that event. The application continues this process of requesting document events until there are no more events or until the application decides to stop parsing.
The core of the NekoPull API is XMLEvent which can
be found in the org.cyberneko.pull package. This
class represents the base class for all document events returned
by the NekoPull parser, defined by the XMLPullParser
interface. The specific event objects are specified in the
org.cyberneko.pull.event package. The complete
hierarchy is shown in the following diagram:
org.cyberneko.pull
XMLEvent
XMLEventIterator
XMLPullParser extends XMLEventIterator
org.cyberneko.pull.event
BoundedEvent extends XMLEvent
CDATAEvent extends BoundedEvent
CharactersEvent extends XMLEvent
CommentEvent extends XMLEvent
DoctypeDeclEvent extends XMLEvent
DocumentEvent extends BoundedEvent
ElementEvent extends BoundedEvent
GeneralEntityEvent extends BoundedEvent
PrefixMappingEvent extends BoundedEvent
ProcessingInstructionEvent extends XMLEvent
TextDeclEvent extends XMLEvent
Even though it appears that NekoPull contains a lot of classes,
the XMLEvent class and its descendents are merely
object structures with a few public fields. Programmers only
need to concern themselves with those classes that are of direct
importance — typically the ElementEvent and
CharactersEvent classes. But before accessing the
document information, you must first create an XML pull parser.
Currently, there are no "factory" methods for creating a NekoPull
parser instance. (Perhaps one will be added in the future as
the need arises.) Therefore, you must create a new instance of
the NekoPull implementation that you want to use. Currently,
NekoPull comes with only one implementation which is based on
Xerces2. The class is called Xerces2 and can be
found in the org.cyberneko.pull.parsers package.
The following code shows how to instantiate the NekoPull implementation based on Xerces2.
// import org.cyberneko.pull.XMLPullParser; // import org.cyberneko.pull.parsers.Xerces2; XMLPullParser parser = new Xerces2();
Calling the Xerces2 constructor with no parameters
will create a NekoPull pull-parser instance using Xerces2's
standard parser configuration. However, the Xerces2
class can also be used with any XNI parser configuration that
implements the XMLPullParserConfiguration interface.
The following example shows the Xerces2 pull-parser using the
NekoHTML parser (version 0.6.6 or higher).
// import org.apache.xerces.xni.parser.XMLPullParserConfiguration; // import org.cyberneko.html.HTMLConfiguration; // import org.cyberneko.pull.XMLPullParser; // import org.cyberneko.pull.parsers.Xerces2; XMLPullParserConfiguration config = new HTMLConfiguration(); XMLPullParser parser = new Xerces2(config);
Note: NekoPull is an API built on XNI but it does not have to be directly bound to the Xerces2 implementation of XNI. The preferred method would be to implement a "native" XML parser written specifically to the NekoPull API. However, this version of NekoPull only provides this implementation.
After the parser is created, the application can initiate
parsing and iterate document events. In order to start
parsing, create an XNI XMLInputSource object
and call the setInputSource method as shown
below.
// import org.apache.xerces.xni.parser.XMLInputSource; XMLInputSource source = new XMLInputSource(null, "data/pull/test03.xml", null); parser.setInputSource(source);
And that's it! You're now ready to iterate over the document events.
The XMLEventIterator interface defines the way
in which the application iterates over document information.
Currently, the XMLEventIterator interface only
contains a single, low-level method to query document events
as shown below.
// import java.io.IOException;
// import org.apache.xerces.xni.XNIException;
// import org.cyberneko.pull.XMLEvent;
/**
* Returns the next event in the document or null if there are
* no more events. This method will return one and only one event
* if it is available; it will never return an event chain (i.e.
* an event with a non-null next field).
*/
public XMLEvent nextEvent() throws XNIException, IOException;
The XMLPullParser interface extends the
XMLEventIterator interface to allow implementation
optimizations between the document parsing and event propagation
code as well as for convenience. Therefore, the application can
iterate document events directly from the parser instance after
calling the setInputSource method.
Let's put all of the pieces together and take a look at a sample program using NekoPull. The following code parses an XML document and outputs a format similar to NSGMLS. While simplistic, this code will highlight the basics of using the NekoPull API.
// import org.apache.xerces.xni.parser.XMLInputSource; // import org.cyberneko.pull.XMLEvent; // import org.cyberneko.pull.XMLPullParser; // import org.cyberneko.pull.event.CharactersEvent; // import org.cyberneko.pull.event.ElementEvent; // import org.cyberneko.pull.parsers.Xerces2; // create parser and set input source XMLPullParser parser = new Xerces2(); XMLInputSource source = new XMLInputSource(null, "data/pull/test03.xml", null); parser.setInputSource(source); // iterate document events XMLEvent event; while ((event = parser.nextEvent()) != null) { if (event.type == XMLEvent.ELEMENT) { ElementEvent elementEvent = (ElementEvent)event; if (elementEvent.start) { System.out.println("("+elementEvent.element.rawname); } else { System.out.println(")"+elementEvent.element.rawname); } } else if (event.type == XMLEvent.CHARACTERS) { CharactersEvent charsEvent = (CharactersEvent)event; System.out.println("\""+charsEvent.text); } } // free resources parser.cleanup();
Note:
The source code for this example is provided in the class
called TestPullParser that can be found in the
src/pull/sample/ directory.
The sample code can be broken down into three parts:
The first step has already been explained so let's take a
look at step two. As already stated, the parser interface
extends the event iterator interface so the application can
iterate document events by calling the nextEvent
method directly on the XMLPullParser instance. The
parser will continue to return XMLEvent objects
from this method until the entire document has been parsed.
Therefore, the primary code in an application using NekoPull
will look like the following:
XMLEvent event; while ((event = parser.nextEvent()) != null) { // do something... }
The actual data contained in the event object depends on its
type. But we will explain the event hierarchy in more detail
in the section titled Event Objects.
For now, just keep in mind that a polymorphic event object
is returned by the parser each time nextEvent
is called.
The final step, freeing resources, allows the parser to close
any remaining input streams and release additional resources.
Very often, applications using the pull-parser approach to
parse XML documents will parse a little bit of the document,
execute some logic, and stop before the end of the
document is reached. Therefore, it's a good idea to always
call the cleanup method when finished (e.g. in
a finally block):
XMLPullParser parser = new Xerces2(); try { // initiate parsing and iterate events } finally { parser.cleanup(); }
Executing this code produces the following output: (The command line should be written on a single line. It is split among multiple lines for readability.)
> java -cp nekopull.jar;nekopullSamples.jar;xmlParserAPIs.jar;xercesImpl.jar sample.TestPullParser (root "This is (i "really )i " cool! )root
The event object is a structure of publicly accessible fields
that hold the event's information. Every event object derives
from the base class
XMLEvent
which only contains information common to all events. The
type field specifies the event's type and will
match one of the following constants defined in the
XMLEvent class:
Each event type has a corresponding event object defined in the
org.cyberneko.pull.event package. Each of these
objects holds additional information specific to the event.
The two most important event objects are the
ElementEvent object that corresponds to the
XMLEvent.ELEMENT type and which contains the
following fields:
boolean start (inherited from BoundedEvent)
boolean empty
QName element
XMLAttributes attributes
and the CharactersEvent object that corresponds
to XMLEvent.CHARACTERS contains:
XMLString text
boolean ignorable
Note:
For performance reasons, characters events may be returned in
different events from sequential calls to nextEvent.
This is similar to the way that the SAX characters
method of the ContentHandler is defined. However,
it is assumed that the XMLEventIterator interface
will be expanded over time to include more convenient methods
to iterate the document, or a utility class will be included
to do the same thing.
Regardless of the event type, though, the application needs to
cast the object returned by the parser to the appropriate
class. In the following code snippet, element events are cast
to an ElementEvent object before accessing the
element information. The element event, like other types of
events, is bounded — this means that it has a
start and an end event. All of these events extend the
BoundedEvent class has a start field
to determine which boundary the event type represents. For
example:
if (event.type == XMLEvent.ELEMENT) {
ElementEvent elementEvent = (ElementEvent)event;
if (elementEvent.start) {
System.out.println("("+elementEvent.element.rawname);
}
else {
System.out.println(")"+elementEvent.element.rawname);
}
}
Accessing event information is relatively straightforward and even people new to XNI will find it easy to learn as it has many similarities with the SAX API. For complete information regarding the event objects in NekoPull, though, please refer to the JavaDoc.
Additional sample applications are provided to illustrate the
use of the NekoPull parser. The Counter,
DocumentTracer, and Writer samples
(analagous to the samples included with Xerces2) can be found
in the src/pull/sample/ directory of this package.
All samples take document URIs as command line arguments.
Parsing the "data/pull/test03.xml" file with the Counter
class produces the following output: (The command line should
be written on a single line. It is split among multiple lines
for readability.)
> java -cp nekopull.jar;xmlParserAPIs.jar;xercesImpl.jar sample.Counter data/pull/test03.xml data/pull/test03.xml: 0 ms (2 elems, 0 attrs, 0 spaces, 20 chars)
Parsing the "data/pull/test03.xml" file with the DocumentTracer
class produces the following output: (The command line should
be written on a single line. It is split among multiple lines
for readability.)
> java -cp nekopull.jar;xmlParserAPIs.jar;xercesImpl.jar sample.DocumentTracer data/pull/test03.xml event.type=DOCUMENT (org.cyberneko.pull.event.DocumentEvent) .start=true .locator=org.apache.xerces.impl.XMLEntityManager$EntityScanner@20dcd9 .encoding="UTF-8" event.type=COMMENT (org.cyberneko.pull.event.CommentEvent) .text=" simple XML document " event.type=ELEMENT (org.cyberneko.pull.event.ElementEvent) .start=true .element=localpart="root",rawname="root" .start=true .empty=false event.type=CHARACTERS (org.cyberneko.pull.event.CharactersEvent) .text="This is " .ignorable=false event.type=ELEMENT (org.cyberneko.pull.event.ElementEvent) .start=true .element=localpart="i",rawname="i" .start=true .empty=false event.type=CHARACTERS (org.cyberneko.pull.event.CharactersEvent) .text="really" .ignorable=false event.type=ELEMENT (org.cyberneko.pull.event.ElementEvent) .start=false .element=localpart="i",rawname="i" .start=false .empty=false event.type=CHARACTERS (org.cyberneko.pull.event.CharactersEvent) .text=" cool!" .ignorable=false event.type=ELEMENT (org.cyberneko.pull.event.ElementEvent) .start=false .element=localpart="root",rawname="root" .start=false .empty=false event.type=DOCUMENT (org.cyberneko.pull.event.DocumentEvent) .start=false
Parsing the "data/pull/test03.xml" file with the Writer
class produces the following output: (The command line should
be written on a single line. It is split among multiple lines
for readability.)
> java -cp nekopull.jar;xmlParserAPIs.jar;xercesImpl.jar sample.Writer data/pull/test03.xml <!-- simple XML document --> <root>This is <i>really</i> cool!</root>
The current design and implementation of NekoPull is my "first guess" at designing an XML pull-parsing API built on the Xerces Native Interface. I want to continue refining the programming interface in the hope of making an API that is the easiest and the most useful to XML application developers. As such, further experimentation and API changes should be expected as NekoPull progresses.
I am very interested in user feedback. If you are interested in the design of NekoPull or have any suggestions for improving it, please let me know.