Automating PDF Objects for Interactive Publishing
An Introduction to pj's Object Framework
The Adobe Portable Document Format (PDF) has become very common for representing documents with rich textual and graphical layouts. Many Web pages are now augmented with PDF files that contain detailed technical specifications and fancy marketing brochures. PDF provides continuity between the traditional office and the Internet, allowing complex documents to be published in unaltered condition both on the Web and on paper. Using PDF it is possible to phase in a new information system to replace existing procedures that are based on printed forms. By merging accurate electronic versions of the original forms on the fly with data that are input through a standard Web browser, users can make the transition from a paper-based process to fully electronic document management gradually, seamlessly, and without loss of data integrity.
Surprisingly, there are few inexpensive software tools for manipulating PDF files, such as those available for processing HTML. Most of the existing tools are designed for interactive use and aren't well-suited for automated applications like dynamic document generation. This article discusses some of the issues related to understanding PDF documents. We explore the PDF format in the context of an extensible framework for modeling PDF data, rather than concentrating on parsing issues and ad hoc implementations. Finally, we introduce pj, a Java class library and object framework that can be used to make any Web site PDF-enabled.
In a Nutshell
The overall structure of a PDF file is straightforward. It consists of a sequence of objects that contain the page contents followed by a cross-reference table and a trailer (see
Figure 1). The cross-reference table lists the byte offset of each object within the file, and the trailer contains some additional information about the document as a whole. Most importantly, it tells us which object is the root of the object hierarchy. The root object is the starting point for interpreting the page structure and contents of the document. At the end of the trailer, just after the word startxref, there1s a byte offset that points to the starting position of the cross-reference table.
It should be apparent from this structure that a PDF file is intended to be read in a random-access fashion. A reasonable approach might be something like this:
- Seek to the end of the file and read the startxref offset value.
- Seek to the position given by startxref, which is the beginning of the cross-reference table. Read the cross-reference table and trailer.
- Use the cross-reference table to seek to the root object (identified in the trailer) and read the object.
- Traverse the object hierarchy and use the cross-reference table to look up offset positions for desired objects.
Why is this preferable to reading the whole file sequentially? To answer this question, we have to examine the document structure of a PDF file.
Figure 2 shows a simplified view of a PDF document, organized by its object hierarchy. The root object refers to a tree of page group objects and page objects. The page objects refer to objects that contain page content, and these may in turn refer to additional objects. A preorder traversal of the page nodes of the tree lets us visit the pages sequentially. The contents of a page are described fully by the objects referenced from the page object. This means that we can process a PDF file one page at a time, or even in page fragments, if that's suitable to the application. Since PDF files can be very large, being able to load one part of a document at a time can conserve a great deal of memory. In some cases, it may save time as well. For example, a PDF viewer can immediately display a desired page rather than first having to load the entire file into memory and parse all the objects ahead of time. This can be of critical importance if the viewer is a Java applet and the PDF objects are being transmitted over a network.
There's one more thing to add about the basic file structure. PDF supports a feature called incremental update, which allows a PDF file to be modified any number of times by appending new data to it and without changing any of the original data. A PDF file updated in this way has new objects and cross-reference tables that override and refer to their predecessors. For this reason, we need to insert two additional steps (between steps 2 and 3) into the algorithm for reading a PDF file:
2a. If the trailer contains an offset pointer to a previous cross-reference table, seek to that position and read the table and trailer. Repeat this step until all the tables have been read.
2b. Merge all the tables into a single cross-reference table so that newer entries override older ones.
Generally, incremental update is useful only if rewriting the original file is prohibitively expensive, such as with WORM hardware or in a very update-intensive environment, or if you want to maintain an audit trail of changes that have been made over time.
We have seen that the contents of a PDF document are represented as a tree of
PDF objects. But what is a PDF object? An object is any one of nine object types: array, boolean, dictionary, name, null, number, reference, stream, and string (see Table 1). Five of these are simple types, and the rest (array, dictionary, stream, and reference) are composed of or include other object types. For example, an array contains a list of objects that can be of any type:
[(this is a string)
/ThisIsAName 2 0 R]
This array contains three elements: the string
this is a string; the name
ThisIsAName; and a reference to object number 2 with generation 0. In
Figure 1, the reference 2 0 R could be read as, "the object labeled as 2 0 obj."
A dictionary is a sequence of key-value pairs, where the key is always a name object and is unique within the dictionary. For example:
<< /COUNT R] [7 0 2 R /KIDS 8>>
Here the key name
Count identifies the value 2, and the key name
Kids identifies the array containing two object references, 7 0 R and 8 0 R.
Algorithmic analysis of the object encoding is outside the scope of this article, but it should be obvious that this format is not too difficult to take apart, even using a hand parser. The main complexity in the lexical analysis involves handling state changes for strings and streams, where delimited characters have to be treated literally; and in the case of streams, they must be read according to a specified length rather than to a delimiting token. Parsing the grammar is easy once the tokens are identified, because data are always terminated by a token that indicates the meaning of the preceding data. (It is no accident that this grammar resembles the Forth language, considering PDF's origins in PostScript; however PDF is very different from PostScript and much simpler.)
For our purposes it's less important to understand the lexical and syntactical analysis than it is to grasp the possibilities for relationships among objects within a PDF document. Almost any place in which an object is called for, an object reference can be used; the contents of the object can then be defined elsewhere. This is relevant to the question of how to model a PDF document in Java, because a page might encapsulate a substantial number of object references, all of which have to be processed and represented in some way.
Finally, we should note that the object type does not tell us what purpose the object serves (apart from whether it's a dictionary, array, and so on). For example, a page group object in the tree and a page object are both represented as dictionaries. A page group is identified by the key-value pair,
/Type /Pages, while a page would have,
Modeling PDF in Java
We have worked our way from a bird's-eye view of PDF to a close-up look at the inner components of PDF objects. As we start to assemble a coherent model for PDF, we can resolve some of the issues that have come up in representing the PDF document with Java objects.
The last point about differing classes of PDF types is interesting because it raises the question of how much knowledge of PDF should be considered part of the initial syntax analysis. The
dictionary type (which we define as the value that corresponds to the key name, "Type") could be interpreted by the parser to indicate different types of dictionary objects. In fact, this is desirable because it lets us make our Java classes more intelligent about their own capabilities. Thus our Java Page class can be derived from a dictionary class, and page-specific methods have a natural place to reside.
The simple object types are easy to represent in Java, because they can be defined as directly corresponding to standard Java classes.
For example, a boolean PDF object can be implemented as a wrapper around the Java boolean type. Why not use boolean directly, for the sake of performance; or, if we need a wrapper, why not use the Java Boolean class? For two reasons: First, it's useful to declare our PDF classes in a uniform way, deriving from a common base class, in order to facilitate type checking (for example, by casting to the base class in a try block). Secondly, we need to define some special methods common to all the PDF classes‹such as for PDF streaming‹and we would like to be able to call such methods through a reference to the base class.
It's true that performance could be enhanced by using primitive types for objects such as boolean and number, but we would then give up the high-level object features that Java provides. By treating all PDF objects the same, as descendants of a base class, we can make it very simple to write Java programs that manipulate those objects. The benefits of this are significant, especially because the running time of a new method may be small in relation to the time required to implement the method.
To represent the complex types, such as
dictionary, we need to understand the special status of PDF reference objects. For example, take an array with two members:
[4 0 R 5 0 R]
This array contains indirect references to two objects, but the contents of those objects are not defined within the array. The references, 4 0 R and 5 0 R, are considered PDF objects themselves, and so the clearest way to represent this structure is as a list containing two reference objects. To process the list, we must call a method to resolve each reference by retrieving the associated object. If the entire PDF file wasn't loaded into memory at the start, it may be that the referenced objects still need to be read from disk. By following references in this way, we can load and parse PDF objects on demand.
One problem with this approach is that it may significantly increase the number of arbitrary seeks performed on the disk if many objects must be read. In such cases, it would be better to load as many objects as possible at one time. For example, if we want to display some kind of representation of a PDF page on the screen, we know that it will be necessary to load most of the objects that are referenced either directly or indirectly by that page. To accomplish this, we would iteratively descend the object hierarchy, accumulating as many unresolved references at a time as possible and retrieving the objects they reference as a group. In the most extreme case, if we know that we need to process all of the objects within a PDF document, it may make sense to read all of them sequentially from disk in a single pass, if we're able to handle them in an arbitrary order or if we have sufficient memory to store the entire set.
Up to this point we have referred to the structure of objects in a PDF file as a "hierarchy," and sometimes as a "tree." In fact, the objects may be viewed as being organized in various ways, but they are seen most easily as a directed acyclic graph, which is a network of objects that lacks any recursive references or loops. This closely resembles a tree, except that a child node can have multiple parents. However, there1s a tree structure embedded within the graph that represents the pages of the document. So, for purposes of traversing pages, the objects may be processed as a tree. Once we begin to examine contents of the page objects, we have to be aware that there may be multiple references to a single object.
The pj Class Library
pj is a Java class library that can be used to parse, manipulate, and write PDF files. The software is distributed freely under the terms of the Library Gnu Public License (LGPL). With an understanding of the techniques covered here, you should be able to make intelligent use of the library. Some of the more advanced subjects involving PDF, such as compression and encryption, are areas of development in pj, and readers are invited to use the source code for learning about PDF and to contribute enhancements or corrections to the public distribution. (Also see "Online")
The pj library provides certain features that can be useful for automated processing of PDF files:
- It can read a PDF file from disk, converting PDF objects to Java objects.
- It provides methods for editing and manipulating the Java objects. The entire PDF document can be represented in memory as an object structure and modified in that context.
- It can write out the Java objects in PDF format.
These functions can be used in combination to modify or extract data within existing PDF files, or to generate new files from input data. The library supports reading files that have been incrementally updated, but PDF objects that have been superseded by newer versions are not read. Since the object space is traversed nonsequentially, the library could be modified to vary or optimize the path it takes, depending on the requirements.
Table 2 shows the constructors for pj's implementation of each type of PDF object. These are essentially wrapper classes derived from a common base class,
PjObject. They also know how to write their members to a PDF file through their implementation of
.Hashtable as container classes to hold other pj objects.
The Pdf class encapsulates a document as a set of
PjObject descendants. The
PjObjects are fully accessible, and their contents can be modified freely. However, it's best to use their class methods wherever possible rather than changing the contents directly, because this will ensure that they're valid according to the PDF standard. It's also easier to use the methods, because they handle much of the encoding within the object.
pj Applications in Java
There are many practical applications that pj may be used for. The most obvious ones involve automated processing of existing PDF files or creation of new ones. For example, a collection of PDF files may be updated over time, and maintaining a table of contents for the files is a tedious process. A simple program could read a listing of the files and then use pj to create a few pages containing hyperlinks to each file. These pages might then be written to a new PDF file or merged with an existing one.
As another example, it's often useful to generate a periodic report or a large set of reports all at once from an existing database. With pj these could be created as PDF files, or a preformatted document could be populated with fields from the database. In other cases the data fields are already in PDF files, and we want to make modifications or export them to be indexed with a search engine. pj can be used to identify and extract textual elements within a PDF document, since they are contained within string or stream objects. It is not difficult and would be a good exercise to develop a utility that can export all of the text from a PDF document.
Document conversion is a somewhat more difficult problem. PDF is a good intermediate encoding for richly formatted text and graphics, because it has much of the power of PostScript but also maintains some logical organization of text and other content. For example, it's easy for a search engine to directly index a highly formatted, uncompressed PDF document, since individual words are represented in plain text. Converting between complex formats is never trivial, but fortunately many companies provide conversion software for PDF. Converting between PDF and a simple format such as HTML is relatively easy, if format variations are not too critical. The main problem is figuring out where things go on the page, because PDF layouts can be complex and are not easily reproduced or even easily approximated in HTML.
Working with PDF in Java offers excellent potential for Web integration, because of the strong support for Java within popular browsers. The current method for displaying PDF is with a browser plug-in. An interesting alternative would be to use a Java applet to display individual PDF page objects supplied via a Java RMI server. Such a model could also be expanded to include collaborative editing of PDF files, requiring far less network bandwidth than exchanging whole documents. PDF is well-suited to this because of its object-oriented structure, and Java is ideal for combining a distributed object model with applets on the Web.
On the server side, PDF files can be taken apart and reassembled on the fly to provide rich output in response to browser input. A data-entry system could easily be integrated with pj to create nicely formatted PDF forms that are dynamically filled in with data submitted by the user. This model represents a completely automated and distributed path from data entry to paperless documents. (Also see "Sidebar")
Nassib is president of Etymon Systems, a research and development corporation specializing in distributed information retrieval. He can be reached at firstname.lastname@example.org.