Performance Analysis for Apache FOP's new Intermediate Format

2009-02-13, Jeremias Märki

Introduction

During the last few months, I've implemented a new Intermediate Format (IF) for Apache FOP. The main goal was to have a much more performant conversion from the intermediate XML format to the final output format. More information about the motivations and technical documentation can be found on the following Wiki page:

http://wiki.apache.org/xmlgraphics-fop/AreaTreeIntermediateXml/NewDesign

This document visualizes some data gathered to compare the old and the new approach. It also provides some hard numbers on FOP's general performance, although this document cannot provide a complete picture of all aspects of Apache FOP. But it gives some ideas. In short: the goal have been met. So let's look at the results in detail...

Benchmark Environment

The benchmark test have been run on a workstation with an Intel Core2Duo E6600 processor (dual core, 2.4Ghz), running MS Windows XP Professional 32 bit SP3. The following Java VMs were used:

The 6.0_14 is an early-access version which came available this week. I read there were several improvements that partly came out of the OpenJDK project, so I wanted to see the differences. This just serves my curiosity. ;-)

Please note that all tests run single-threaded. When the Server VM is used, the parallel GC has been activated from which the results profit on a dual-core processor. For the Client VMs, I was seeing slightly over 50% CPU usage. FOP itself is only single-threaded for best-effort J2EE-compatibility. This underlines the importance for being able to run FOP concurrently to increase through-put in high-volume environments.

All tests are run 4 times before a measurement is taken to allow to filter out class-loading and Just-In-Time Compiling (i.e. the JVM warmup).

I've only measured AFP, PDF and PostScript output. I've considered adding PCL and TIFF, too, but it proved too time-consuming since both formats are currently much slower than the other three. Contrary to my assumptions, the bad performance for PCL and TIFF doesn't come from serializing and compressing bitmap data but to a big part from the inefficient use of Java2D font metrics. There's potential for dramatic performance improvements there. But that was out of scope for now, especially since that problem exists for a long time.

Scenario 1: FOP's readme.fo

FOP's readme.fo example contains 10 pages (9 pages if AFP font metrics are used). In this scenario, the FO is replicated 30 times, resulting in 300 (270 for AFP) pages in total. Given that AFP output has a different number of pages (due to different page breaking decisions), those number have to be treated with care. The example contains just text with some lists and some links. The links are ignored by PS and AFP output, so PDF output has slightly more to do.


The values are in pages per minute (ppm), so higher bars mean more performance. Here's what the names in the legend mean:

direct

Direct rendering from FO to the target format using a Renderer.

direct-via-if

Direct rendering from FO to the target format, but going via the new IFDocumentHandler/IFPainter implementations.

from-at

Converting from Area Tree XML to the target format using a Renderer.

from-at-via-if

Converting from Area Tree XML to the target format, but going via the new IFDocumentHandler/IFPainter implementations.

from-if

Converting from the new Intermediate Format to the target format.

to-atxml

Creating Area Tree XML from FO.

to-if

Create Intermediate Format from FO.

The most important bar here is the green one, of course. It shows that with plain FO content, the new IF is 2 to 3 times faster than the old Area Tree XML (the red bar). The two gray bars are just provided for completeness and don't contribute much to the entire picture. However, it can be seen that the new IF can be produced more quickly since it's less verbose.

The “direct-via-if” and “from-at-via-if” bars show the performance relative to the old Renderer-based implementation. Maintaining two different implementations for the same output format will be a burden in the long run. So this helps decide whether it's worth keeping the old implementations or if they can be retired in favor of the new implementations once they are stable. At least in this example, the new implementations are faster than their Renderer-counterparts for PDF and PS output. Only for AFP output, the performance is slightly worse. For PostScript, there's a particularly high improvement over the old Renderer because I was able to optimize text production.

It's also interesting to see how AFP output is slightly faster than PDF and PostScript output. That can be accounted to its compact record-based format. However, from what I've learned while working on the code, there's still room for improvement. For PostScript and AFP, it needs to be noted that the output file is produced in two passes. AFP writes a second file parallel to the main file containing document-level resources. The two files are simply concatenated in the end. PostScript reparses the created PostScript file (using the DSC Parser in XML Graphics Commons) and adds any required resources (fonts, images etc.) in the prolog. This is a bit slower than the mere file concatenation of AFP but still very fast.

Scenario 2: The XSLT 1.0 specification

In this scenario the XSLT 1.0 specification is read from the original XML file. An XSLT transformation does the conversion to XSL-FO. The document contains mostly text, but also contains one SVG, one PNG and one JPEG image (all three between 2 and 3 KB) on the first page. The document results in 100 pages (for all output formats) in 3 page-sequences. There's a bookmark tree in PDF and a number of links.


The reduced improvement compared to the previous scenario can be explained to by the need to fire up Apache Batik and the image processing as such. We will see more like that below. Still, the new IF provides still twice the performance. However, documents like this are not usually used in conjunction with the Intermediate Format. This is mostly the domain of business documents, so let's take a look at another set of examples.

Scenario 3a: “Pension Report” (with charts as SVG)

This scenario uses an FO file resulting in 4 pages of text with bordered and borderless tables and two charts in SVG format (embedded using fo:instream-foreign-object). The FO file is replicated 30 times, so the whole job contains 120 pages in the end.


In this scenario, the SVG files have a larger impact because the amount of text is relatively small and the charts are processed every time they are encountered. But the latter is a normal case in perfonalized business documents. That's why the improvement over the old Area Tree XML melts away to a degree.

I'm not sure why in “direct” mode the new implementations are faster but they are slower when working off the Area Tree XML. Could be related to the way the test is built.

Scenario 3b: “Pension Report” (with charts as JPEG)

This is the same scenario as 3a but the charts are provided as external JPEG files.


Performance for PDF and PS has improved a lot compared to 3a as JPEG images can be embedded without decompression them. The same can be done in AFP but not all AFP/IPDS environments support embedded JPEG files which is why in this example, compatibility settings are used: all bitmap images are converted to bi-level images (1 bit). This is currently implemented in a rather inefficient way which explains the bad performance. There are already TODO entries in the code for that.

Note: It should be noted here that the two charts are embedded in the PDF and PS exactly once. I didn't invest the time to insert a different chart image for each replication of the document which would be the more realistic case. The same applies to the following scenario...

Scenario 3c: “Pension Report” (with charts as PNG)

This is the same scenario as 3b but the charts are provided as external PNG files.


The performance for AFP is the same as before but for PDF and PS output it is apparent that the need to decompress and re-compress the images reduces the performance.

File Sizes

I've recently revisited the text production parts of the new output implementations. In particular, PostScript output was very slow and produced big files. Well, PostScript will always produce bigger files than PDF or AFP. PDF has compression and AFP uses compact binary records. Smaller files usually means increased performance. So let's see what how the file sizes improved.





Of course, the improvements in the PSPainter could also be backported to the PSRenderer but then we're back to the question about maintaining two implementations.

File sizes for AFP output could be improved by implementing compression schemes (such as CCITT Group 4 which should be usable in all environments) for bitmap images. Currently, the images are uncompressed.

JVM Performance

While we're at it, let's take a look at the influence of the JVM in use on the performance. Let's start with scenario 1, looking at direct formatting from FO to the output format using the new implementations.


As you can see, Sun did a very good job improving HotSpot over time, so try to go with the more modern JVM if possible. And something else: Never, never, NEVER forget to turn on the Server VM if you're running FOP in a server. You'd just give away a lot of performance. For a single document from the command-line, the Client VM is better as it has a better startup-time. The Server VM only gains speed after some time.

The image looks similar, if you look at the “to-atxml” or “to-if” case. So FOP's layout engine profits a lot from a good JVM.

BTW, if you wonder what JVM was used for the previous charts above: it was the Sun Java 6.0_11 Server VM.

Let's look at a different use case:


That's a somewhat different picture. The focal point here is on generating output content, i.e. more I/O, less processing. The example before included the layout engine and this one doesn't. Still, a more modern JVM helps a lot.

Note: for this particular chart it might have been better to enlarge the test cases since execution times per test could drop below 1 second which might affect accuracy here.

Conclusion

Mission accomplished, IMO. The new intermediate format provides the promised performance increase. Of course, the new IF can't improve performance from non-FO content such as images but there are other ways to improve performance in that area (like using page-segments in AFP or PostScript forms). There's also still some room for improvement for the way images as processed (especially for AFP output). If and how quickly we're retiring the old renderers is subject to discussion in the project. At least, this document provides some data for an informed decision.

Another thing that can be taken out of this document: Apache FOP is a good tool even for high-volume environments, but you need to distribute the formatting to different cores and machines. The new Intermediate Format provides an easy way to concatenate the various pre-formatted documents and allows to quickly produce print files for high-volume, personalized output. Of course, you can also concatenate on the level of the final output format (which can be much faster) but if you need to support different output formats from within the same product, using the intermediate format only requires one implementation.