Pipeline Processing

Table of Contents


Overview

NekoStyle defines a document processing framework and includes a library of standard components to help simplify the automation of XML document processing. Despite the architecture's simple design and small implementation, there is really no limit to what can be accomplished. And if the bundled processors don't support the features that you need, you can extend the functionality with custom processor components.

At the heart of the NekoStyle framework is the Processor interface located in the org.cyberneko.style package. This interface defines a component that accepts an XML document as input and outputs an XML document. In this way, each component can be viewed as a black-box whose implementation is independent from the other components. The documentation includes a complete list of processors that come bundled with NekoStyle.

The Processor interface defines the following methods:

public String getName();
public DocumentNode process(DocumentNode node,
                            ParameterMap params,
                            Context context)
    throws ProcessorException;

The node object wraps the DOM node to be processed; the params object contains a map of the processor parameters; and the context object is a reference to the pipeline's context.

The context contains information that is available to all processors. The context also contains a cache for communicating document references between processors in the pipeline. Each processor can set and query documents contained within the cache using a unique document identifier name. As a general rule, though, processors should consider the cache read-only.

Individual processors for storing and loading documents to/from the cache are provided with the NekoStyle package. The DocumentStore and DocumentLoad processors — which are part of the org.cyberneko.style.core package — can be used to store and load documents, respectively. By moving these operations to separate processors, it simplifies the implementation of the other processors and allows the user to explicitly manage the document cache instead of knowing which processors implicitly write to the cache.

The Pipeline Processor

While processing components can be used independently from each other, you achieve the most use from connecting a series of processors together to form a document processing pipeline. So that you don't have to implement a pipeline framework yourself, NekoStyle includes its own pipeline processing implementation called, coincidentally enough, Pipeline. This class can be found in the org.cyberneko.style.core package.

In order to use the Pipeline component, you only need to provide a suitable script to drive the pipeline. The following sections describe this script syntax in more detail and explain how it works.

Grammar

The grammar for the script documents, written in XML, is easy to learn and use. In fact, the entire functionality of the pipeline processing is defined by only five elements: <pipeline>, <inline>, <processor>, <param>, and <property>. Most tasks can be accomplished with these five elements in combination with XSLT stylesheets. The grammar of the pipeline script is shown below:

<!ELEMENT pipeline (pipeline|inline|processor|property)*>
<!ATTLIST pipeline export (true|false) "true">

<!ELEMENT inline (pipeline|inline|processor)*>
<!ATTLIST inline export (true|false) "true">

<!ELEMENT processor (param*)>
<!ATTLIST processor classname CDATA #REQUIRED>

<!ELEMENT param EMPTY>
<!ATTLIST param name  NMTOKEN #REQUIRED
                value CDATA   #REQUIRED>

<!ELEMENT property EMPTY>
<!ATTLIST property name  NMTOKEN #REQUIRED
                   value CDATA   #REQUIRED>

The root element of the script should be <pipeline>. This element contains the list of the processes (and sub-processes) that you want executed. As the DTD grammar shows, the elements allowed within the <pipeline> element are: <pipeline>, <inline>, <processor>, and <property>. Each child element, or "stage", is run in sequence by the Pipeline processor, passing the output from each stage to the next stage. Note: The input to the first stage in the pipeline is the script itself. This allows subsequent processors to react based on the contents of the pipeline script.

The <pipeline> and <inline> elements are container elements that can each contain processors and/or other containers in any order. Both of these elements allow an attribute called "export" which defines whether documents added to the cache within the container are exported to the parent context. The default value is "true" which means that any documents added to the context cache within the container override documents with the same identifier in the parent cache. Note: Setting the "export" attribute to "false" allows the document cache in sub-processes to be reclaimed by the JVM when the container process ends.

The <inline> element behaves the same as the <pipeline> element with one crucial difference: the output of the last stage within the container is inlined. This means that the contents of the output are treated as another pipeline and executed within the parent process as if that pipeline had always been there! This allows the pipeline to be dynamically generated on-the-fly based on the contents of the script, settings, or the content of any document processed in the system. Very cool.

The <param> element is only used within the <processor> element in order to configure the process appropriately. A list of the parameters supported by each built-in processor can be found in the Processors document. In addition to the parameters that each processor accepts, the user can define additional parameters to be passed to the process. These user-defined parameters may be used to control other behaviors within the processor. For example, all of the parameters passed to the XSLTProcessor component are, in turn, passed to the stylesheet that it uses. This allows the pipeline script to pass settings to the stylesheet to control its behavior.

With all of the processor and user-defined parameters, name collisions could occur. In order to avoid this situation, the processor parameter names follow the pattern ProcessorName.paramname. For example, the parameter to tell the XMLParser processor which file to parse is "XMLParser.href".

Finally, the <property> element allows you to set property values within the script for the purpose of text replacement. Only the first setting of a property will be stored — this allows the invoker to override default property values within the script. For additional information on properties, refer to the section titled "Replacement Text in Pipeline Scripts".

Running the Pipeline Processor

The Pipeline processor is implemented as a standalone program and is also available as an Ant build task. However, this section describes how to run pipelines from the command line. For information on how to use the processor in your Ant build, please refer to the Ant Tasks documentation.

The nekostyle.jar file included with NekoStyle is packaged so that the Pipeline processor is invoked if used with Java's "-jar" command-line option. For example:

> java -jar nekostyle.jar (options) file ...

When run without arguments, the program shows the usage information as shown by the following:

> java -jar nekostyle.jar
usage: java -jar nekostyle.jar (options) file ...

options:
  -p name value  Sets a property's replacement text value.
  -v | -q        Turns verbose output on/off.
  -              Parse pipeline from standard input.

The file arguments specified on the command line represent URIs to the pipeline scripts. To read a pipeline script from the standard input, use the dash ("-") argument.

The -p option allows you to set a pipeline property on the from the command-line. To learn what properties are and how to use them, refer to the advanced topic titled "Replacement Text in Pipeline Scripts".

Sample Pipelines

Parsing and Printing

The most common operation to be performed is parsing a document and then serializing, or "printing", that document back to a file. The following pipeline parses an XML file called "document.xml" and prints it to standard out:

<pipeline>

 <processor classname='org.cyberneko.style.parsers.XMLParser'>
  <param name='XMLParser.href' value='document.xml'/>
 </processor>

 <processor classname='org.cyberneko.style.printers.XMLPrinter'/>

</pipeline>

Transformation

The second most common operation is transforming an input document to some other format. This is typically done using XSLT so NekoStyle provides a JAXP/TrAX-based XSLT processor called, appropriately, XSLTProcessor. This class can be found in the org.cyberneko.style.processors package.

The XSLT processor transforms the input document using the stylesheet referenced by the "XSLTProcessor.style" parameter. The value of this parameter is the identifier of a document that has previously been stored in the context's document cache. Note: If this parameter is missing or if there is no document in the cache with the specified identifier, it is not a fatal error! Instead, a warning is issued and the identity transform is performed.

The following pipeline loads a stylesheet called "docbook2html.xsl"; parses an XML document called "index.xml"; transforms the document using the stylesheet; and writes the output to a file called "index.html".

<pipeline>

 <!-- load stylesheet -->
 <processor classname='org.cyberneko.style.parsers.XMLParser'>
  <param name='XMLParser.href' value='docbook2html.xsl'/>
 </processor>
 <processor classname='org.cyberneko.style.core.DocumentStore'>
  <param name='DocumentStore.id' value='docbook2html'/>
 </processor>

 <!-- parse, transform, and print document -->
 <processor classname='org.cyberneko.style.parsers.XMLParser'>
  <param name='XMLParser.href' value='index.xml'/>
 </processor>
 <processor classname='org.cyberneko.style.processors.XSLTProcessor'>
  <param name='XSLTProcessor.style' value='docbook2html'/>
 </processor>
 <processor classname='org.cyberneko.style.printers.HTMLPrinter'>
  <param name='HTMLPrinter.href' value='index.html'/>
 </processor>

</pipeline>

Conserving Memory

As the size of your pipeline process grows, so does the number of documents that you are storing in the document cache of the processor's context. Therefore, it's a good idea to use sub-pipelines in order to allow the JVM a chance to reclaim memory for documents that are no longer needed.

This is often the case when you use a stylesheet to perform a local transformation on one file that is not needed again. The following example uses a sub-pipeline to perform a two step transformation from a document using a specific XML grammar, into a document in a general format, and then into HTML. The first transformation is specific to the one file whereas the stylesheet used in the second transformation is likely to be used multiple times.

<pipeline>

 <!-- load stylesheet -->
 <processor classname='org.cyberneko.style.parsers.XMLParser'>
  <param name='XMLParser.href' value='docbook2html.xsl'/>
 </processor>
 <processor classname='org.cyberneko.style.core.DocumentStore'>
  <param name='DocumentStore.id' value='docbook2html.xsl'/>
 </processor>

 <!-- parse, transform, and print document -->
 <pipeline export='false'>

  <processor classname='org.cyberneko.style.parsers.XMLParser'>
   <param name='XMLParser.href' value='changes2docbook.xsl'/>
  </processor>
  <processor classname='org.cyberneko.style.core.DocumentStore'>
   <param name='DocumentStore.id' value='changes2docbook'/>
  </processor>

  <processor classname='org.cyberneko.style.parsers.XMLParser'>
   <param name='XMLParser.href' value='changes.xml'/>
  </processor>
  <processor classname='org.cyberneko.style.processors.XSLTProcessor'>
   <param name='XSLTProcessor.style' value='changes2docbook'/>
  </processor>
  <processor classname='org.cyberneko.style.processors.XSLTProcessor'>
   <param name='XSLTProcessor.style' value='docbook2html'/>
  </processor>
  <processor classname='org.cyberneko.style.printers.HTMLPrinter'>
   <param name='HTMLPrinter.href' value='changes.html'/>
  </processor>

 </pipeline>

</pipeline>

Notice the sub-pipeline with the export attribute value set to "false". Any document added to the cache within this pipeline does not propagate to the parent context. Specifically, the "changes2docbook.xsl" file, which is only used for the "changes.xml" file, is eligible for garbage collection after processing the sub-pipeline.

Advanced Features

A lot of useful XML processing can be achieved by following the samples in the previous section to parse documents, transform them with XSLT, and then print them to separate files. However, NekoHTML has a few features that turn it from a simple stylesheet batch processor into an extremely flexible and powerful tool. This section will teach you how to take advantage of these features.

Replacement Text in Pipeline Scripts

The Pipeline processor allows the user to perform text replacement within attributes of the pipeline elements by using properties. A property is simply a pair of name and value strings. However, when a pipeline is processed, all occurrences of the strings "${name}", where name is the property name, are replaced by the property's value, or the empty string if the property is not set.

An example of properties being used in a pipeline script can be seen in the data/style/transform.xml file included with NekoStyle. This pipeline provides a simple way to execute an XSLT transformation by setting the source document and the stylesheet via properties. The contents of this pipeline script is shown below:

<pipeline>
 <processor classname='org.cyberneko.style.parsers.XMLParser'>
  <param name='XMLParser.href' value='${xsl}'/>
 </processor>
 <processor classname='org.cyberneko.style.core.DocumentStore'>
  <param name='DocumentStore.id' value='xsl'/>
 </processor>
 <processor classname='org.cyberneko.style.parsers.XMLParser'>
  <param name='XMLParser.href' value='${xml}'/>
 </processor>
 <processor classname='org.cyberneko.style.processors.XSLTProcessor'>
  <param name='XSLTProcessor.style' value='xsl'/>
 </processor>
 <processor classname='org.cyberneko.style.printers.XSLTPrinter'/>
</pipeline>

Notice the reference to the "xml" and "xsl" properties. By passing values for these properties to the pipeline, you can use this script to generically transform a document using XSLT and display the result on standard out. The following command line shows how to run NekoStyle to perform the same transformation we did in the Transformation sample.

> java -jar nekostyle.jar 
       -p xml index.xml -p xsl docbook2html.xsl transform.xml

Reducing the Verbose Syntax

By now you have probably noticed that the pipeline document format is rather verbose. You may be wondering if there's an easier way to write pipelines. And the answer is YES, there is an easier way — using XSLT stylesheets.

The Pipeline processor is written to recognize the "<?xml-stylesheet ...?>" processing instruction that is typically used to provide a stylesheet to render XML documents as HTML in web browsers. However, you can use this processing instruction in your top-level pipeline script in order to simplify the execution of the pipeline. In fact, this approach is used in the generation of the documentation you are reading now.

Dynamic Pipelines

One of the most useful features of the Pipeline processor is its ability to dynamically generate pipeline statements. To dynamically add to the pipeline processing, simply create a sub-pipeline using the <inline> element. The document returned by the last stage of this sub-pipeline is then interpreted by the main pipeline.

The following example creates sub-pipeline that parses another XML document, called "pipeline2.xml".

<pipeline>
 <inline>
  <processor classname='org.cyberneko.style.parsers.XMLParser'>
   <param name='XMLParser.href' value='pipeline2.xml'/>
  </processor>
 </inline>
</pipeline>

However, because this statement is contained within an <inline> element, the Pipeline processor then executes the pipeline statements contained in the parsed document. Instant dynamic pipeline! However, the real excitement begins when you start using XSLT stylesheet transformations within the sub-pipeline. The way in which the pipeline is defined can actually be generated based on the output from previous processes.

NekoStyle includes a processor called DirectoryList that is very useful when used in conjunction with the dynamic pipelines feature. This processor generates a list of directory contents as an XML document. You can then transform the output from this processor in order to dynamically generate a series of new processing stages based on the contents of the directory. Neat.

Accessing Document Cache from Stylesheets

XSLT stylesheet writers frequently need to combine the contents of multiple source documents. This is done using the "document()" method within an XPath expression in the stylesheet. The XSLT processor provided with NekoStyle allows stylesheets to access document's in the cache of the local context using the same mechanism.

In order to access a document in the cache from within a stylesheet, call the "document()" method using a URI with the syntax: "nekostyle:///id" where id is the identifier of the document in the cache.

The following example access the document with an identifier of "names" from the document cache:

<xsl:value-of select='document("nekostyle:///names")'/>