Usage Instructions

Table of Contents


What is "Pull Parsing"?

So you've heard a lot about it lately but what exactly is "pull parsing"? Simply, pull parsing is a programming interface to access XML document information. The dominant APIs for writing XML applications has been the Document Object Model (DOM) and the Simple API for XML (SAX). However, pull parsing provides an alternative method for application developers when the DOM tree-based model and the SAX event-push model are not appropriate. Using the pull parsing paradigm, the application can actively control the parsing of the XML document by traversing document nodes, one event at a time.

Currently, a number of XML pull-parsing APIs exist to as an alternative to the DOM and SAX models. For example, the XMLPull API was created by merging two parallel efforts. In addition, JSR 173 was formed to develop a pull-parsing API for the Java programming language. NekoPull was invented for two reasons: to fix the inadequacies the author sees in other pull-parsing designs; and to add native pull-parsing capability to Xerces2.

Using NekoPull

The XML pull-parsing API of NekoPull is very simple. The application instantiates a pull parser instance and requests document information events. The parser, in turn, parses a little bit of the XML document and returns an object representing a piece of the document's information. The application then queries the type of the event and casts the object to a specific event type in order to access the information relating to that event. The application continues this process of requesting document events until there are no more events or until the application decides to stop parsing.

Package Hierarchy

The core of the NekoPull API is XMLEvent which can be found in the org.cyberneko.pull package. This class represents the base class for all document events returned by the NekoPull parser, defined by the XMLPullParser interface. The specific event objects are specified in the org.cyberneko.pull.event package. The complete hierarchy is shown in the following diagram:

package hierarchy

Even though it appears that NekoPull contains a lot of classes, the XMLEvent class and its descendents are merely object structures with a few public fields. Programmers only need to concern themselves with those classes that are of direct importance — typically the ElementEvent and CharactersEvent classes. But before accessing the document information, you must first create an XML pull parser.

Instantiating Parser

Currently, there are no "factory" methods for creating a NekoPull parser instance. (Perhaps one will be added in the future as the need arises.) Therefore, you must create a new instance of the NekoPull implementation that you want to use. Currently, NekoPull comes with only one implementation which is based on Xerces2. The class is called Xerces2 and can be found in the org.cyberneko.pull.parsers package.

The following code shows how to instantiate the NekoPull implementation based on Xerces2.

// import org.cyberneko.pull.XMLPullParser;
// import org.cyberneko.pull.parsers.Xerces2;

XMLPullParser parser = new Xerces2();

Calling the Xerces2 constructor with no parameters will create a NekoPull pull-parser instance using Xerces2's standard parser configuration. However, the Xerces2 class can also be used with any XNI parser configuration that implements the XMLPullParserConfiguration interface. The following example shows the Xerces2 pull-parser using the NekoHTML parser (version 0.6.6 or higher).

// import org.apache.xerces.xni.parser.XMLPullParserConfiguration;
// import org.cyberneko.html.HTMLConfiguration;
// import org.cyberneko.pull.XMLPullParser;
// import org.cyberneko.pull.parsers.Xerces2;

XMLPullParserConfiguration config = new HTMLConfiguration();
XMLPullParser parser = new Xerces2(config);

Note: NekoPull is an API built on XNI but it does not have to be directly bound to the Xerces2 implementation of XNI. The preferred method would be to implement a "native" XML parser written specifically to the NekoPull API. However, this version of NekoPull only provides this implementation.

Initiating Parsing

After the parser is created, the application can initiate parsing and iterate document events. In order to start parsing, create an XNI XMLInputSource object and call the setInputSource method as shown below.

// import org.apache.xerces.xni.parser.XMLInputSource;

XMLInputSource source = new XMLInputSource(null, "data/pull/test03.xml", null);
parser.setInputSource(source);

And that's it! You're now ready to iterate over the document events.

Iterating Events

The XMLEventIterator interface defines the way in which the application iterates over document information. Currently, the XMLEventIterator interface only contains a single, low-level method to query document events as shown below.

// import java.io.IOException;
// import org.apache.xerces.xni.XNIException;
// import org.cyberneko.pull.XMLEvent;

/**
 * Returns the next event in the document or null if there are
 * no more events. This method will return one and only one event
 * if it is available; it will never return an event chain (i.e.
 * an event with a non-null next field).
 */
public XMLEvent nextEvent() throws XNIException, IOException;

The XMLPullParser interface extends the XMLEventIterator interface to allow implementation optimizations between the document parsing and event propagation code as well as for convenience. Therefore, the application can iterate document events directly from the parser instance after calling the setInputSource method.

Sample Code

Let's put all of the pieces together and take a look at a sample program using NekoPull. The following code parses an XML document and outputs a format similar to NSGMLS. While simplistic, this code will highlight the basics of using the NekoPull API.

// import org.apache.xerces.xni.parser.XMLInputSource;
// import org.cyberneko.pull.XMLEvent;
// import org.cyberneko.pull.XMLPullParser;
// import org.cyberneko.pull.event.CharactersEvent;
// import org.cyberneko.pull.event.ElementEvent;
// import org.cyberneko.pull.parsers.Xerces2;

// create parser and set input source
XMLPullParser parser = new Xerces2();
XMLInputSource source = new XMLInputSource(null, "data/pull/test03.xml", null);
parser.setInputSource(source);

// iterate document events
XMLEvent event;
while ((event = parser.nextEvent()) != null) {
    if (event.type == XMLEvent.ELEMENT) {
        ElementEvent elementEvent = (ElementEvent)event;
        if (elementEvent.start) {
            System.out.println("("+elementEvent.element.rawname);
        }
        else {
            System.out.println(")"+elementEvent.element.rawname);
        }
    }
    else if (event.type == XMLEvent.CHARACTERS) {
        CharactersEvent charsEvent = (CharactersEvent)event;
        System.out.println("\""+charsEvent.text);
    }
}

// free resources
parser.cleanup();

Note: The source code for this example is provided in the class called TestPullParser that can be found in the src/pull/sample/ directory.

The sample code can be broken down into three parts:

  1. creating parser and initiating parsing;
  2. iterating document events; and
  3. freeing resources.

The first step has already been explained so let's take a look at step two. As already stated, the parser interface extends the event iterator interface so the application can iterate document events by calling the nextEvent method directly on the XMLPullParser instance. The parser will continue to return XMLEvent objects from this method until the entire document has been parsed. Therefore, the primary code in an application using NekoPull will look like the following:

XMLEvent event;
while ((event = parser.nextEvent()) != null) {
    // do something...
}

The actual data contained in the event object depends on its type. But we will explain the event hierarchy in more detail in the section titled Event Objects. For now, just keep in mind that a polymorphic event object is returned by the parser each time nextEvent is called.

The final step, freeing resources, allows the parser to close any remaining input streams and release additional resources. Very often, applications using the pull-parser approach to parse XML documents will parse a little bit of the document, execute some logic, and stop before the end of the document is reached. Therefore, it's a good idea to always call the cleanup method when finished (e.g. in a finally block):

XMLPullParser parser = new Xerces2();
try {
    // initiate parsing and iterate events
}
finally {
    parser.cleanup();
}

Executing this code produces the following output: (The command line should be written on a single line. It is split among multiple lines for readability.)

> java -cp nekopull.jar;nekopullSamples.jar;xmlParserAPIs.jar;xercesImpl.jar 
        sample.TestPullParser
(root
"This is
(i
"really
)i
" cool!
)root

Event Objects

The event object is a structure of publicly accessible fields that hold the event's information. Every event object derives from the base class XMLEvent which only contains information common to all events. The type field specifies the event's type and will match one of the following constants defined in the XMLEvent class:

Object Contents

Each event type has a corresponding event object defined in the org.cyberneko.pull.event package. Each of these objects holds additional information specific to the event. The two most important event objects are the ElementEvent object that corresponds to the XMLEvent.ELEMENT type and which contains the following fields:

and the CharactersEvent object that corresponds to XMLEvent.CHARACTERS contains:

Note: For performance reasons, characters events may be returned in different events from sequential calls to nextEvent. This is similar to the way that the SAX characters method of the ContentHandler is defined. However, it is assumed that the XMLEventIterator interface will be expanded over time to include more convenient methods to iterate the document, or a utility class will be included to do the same thing.

Casting Event Object

Regardless of the event type, though, the application needs to cast the object returned by the parser to the appropriate class. In the following code snippet, element events are cast to an ElementEvent object before accessing the element information. The element event, like other types of events, is bounded — this means that it has a start and an end event. All of these events extend the BoundedEvent class has a start field to determine which boundary the event type represents. For example:

    if (event.type == XMLEvent.ELEMENT) {
        ElementEvent elementEvent = (ElementEvent)event;
        if (elementEvent.start) {
            System.out.println("("+elementEvent.element.rawname);
        }
        else {
            System.out.println(")"+elementEvent.element.rawname);
        }
    }

Accessing event information is relatively straightforward and even people new to XNI will find it easy to learn as it has many similarities with the SAX API. For complete information regarding the event objects in NekoPull, though, please refer to the JavaDoc.

Samples

Additional sample applications are provided to illustrate the use of the NekoPull parser. The Counter, DocumentTracer, and Writer samples (analagous to the samples included with Xerces2) can be found in the src/pull/sample/ directory of this package. All samples take document URIs as command line arguments.

Counter

Parsing the "data/pull/test03.xml" file with the Counter class produces the following output: (The command line should be written on a single line. It is split among multiple lines for readability.)

> java -cp nekopull.jar;xmlParserAPIs.jar;xercesImpl.jar 
       sample.Counter data/pull/test03.xml
data/pull/test03.xml: 0 ms (2 elems, 0 attrs, 0 spaces, 20 chars)

DocumentTracer

Parsing the "data/pull/test03.xml" file with the DocumentTracer class produces the following output: (The command line should be written on a single line. It is split among multiple lines for readability.)

> java -cp nekopull.jar;xmlParserAPIs.jar;xercesImpl.jar 
       sample.DocumentTracer data/pull/test03.xml
event.type=DOCUMENT (org.cyberneko.pull.event.DocumentEvent)
     .start=true
     .locator=org.apache.xerces.impl.XMLEntityManager$EntityScanner@20dcd9
     .encoding="UTF-8"
 event.type=COMMENT (org.cyberneko.pull.event.CommentEvent)
      .text=" simple XML document "
 event.type=ELEMENT (org.cyberneko.pull.event.ElementEvent)
      .start=true
      .element=localpart="root",rawname="root"
      .start=true
      .empty=false
  event.type=CHARACTERS (org.cyberneko.pull.event.CharactersEvent)
       .text="This is "
       .ignorable=false
  event.type=ELEMENT (org.cyberneko.pull.event.ElementEvent)
       .start=true
       .element=localpart="i",rawname="i"
       .start=true
       .empty=false
   event.type=CHARACTERS (org.cyberneko.pull.event.CharactersEvent)
        .text="really"
        .ignorable=false
  event.type=ELEMENT (org.cyberneko.pull.event.ElementEvent)
       .start=false
       .element=localpart="i",rawname="i"
       .start=false
       .empty=false
  event.type=CHARACTERS (org.cyberneko.pull.event.CharactersEvent)
       .text=" cool!"
       .ignorable=false
 event.type=ELEMENT (org.cyberneko.pull.event.ElementEvent)
      .start=false
      .element=localpart="root",rawname="root"
      .start=false
      .empty=false
event.type=DOCUMENT (org.cyberneko.pull.event.DocumentEvent)
     .start=false

Writer

Parsing the "data/pull/test03.xml" file with the Writer class produces the following output: (The command line should be written on a single line. It is split among multiple lines for readability.)

> java -cp nekopull.jar;xmlParserAPIs.jar;xercesImpl.jar 
       sample.Writer data/pull/test03.xml
<!-- simple XML document -->
<root>This is <i>really</i> cool!</root>

Future Directions

The current design and implementation of NekoPull is my "first guess" at designing an XML pull-parsing API built on the Xerces Native Interface. I want to continue refining the programming interface in the hope of making an API that is the easiest and the most useful to XML application developers. As such, further experimentation and API changes should be expected as NekoPull progresses.

I am very interested in user feedback. If you are interested in the design of NekoPull or have any suggestions for improving it, please let me know.