Running PDFtk on AWS Lambda

AWS Lambda is a server-less computing service, allowing you to forget about infrastructure and focus on business logic that is provided in a single function (hence the name Lambda). Lambda scales virtually infinitely, and does so cost efficiently — you only pay for the CPU cycles you use. It is an ideal service for running asynchronous background computations like image thumbnailing. At Lob, we use Lambda in combination with PDFtk to generate and manipulate hundreds of thousands of PDFs every day. Getting started with PDFtk on Lambda was not straightforward, but we were able to do so by compiling PDFtk from source and including the binary and the GNU Compiler for Java (GCJ) shared library in our Lambda Project. We’d like to share the difficulties we came across with running PDFtk on Lambda and describe how we overcame them.

AWS Lambda allows users to run virtually any binary by including it in a project’s ZIP file. A user can simply spin up a temporary EC2 instance running Amazon Linux, build the binary, and copy it into their project. Unfortunately, Amazon Linux does not officially support PDFtk, nor GCJ, one of PDFtk’s dependencies. In order to build a compatible PDFtk binary, we compiled it on CentOS 6, a close relative to Amazon Linux1.

Getting Started Running PDFtk on AWS Lambda

We’ve put together an example Lambda function that includes the compiled PDFtk binary so that you can skip the compiling step and get started with Lambda and PDFtk instantly: https://github.com/lob/lambda-pdftk-example.

After firing up a CentOS EC2 instance, we followed the instructions on the PDFtk website for building from source2.

If you do this part yourself, use an instance with more than 1 GB of RAM. Otherwise you might run into errors while compiling.

Once compiling the binary is finished, copy the resulting pdftk binary and /usr/lib64/libgcj.so.10 shared library into your Lambda project.

To run PDFtk in Lambda, you can update the PATH and LD_LIBRARY_PATH environment variables to let the system know where to find your binary and shared library dependency. In our example project, we did this in our projects entry point in index.js:

If you’d rather, you can call them directly without updating environment variables:

While this post has focused specifically on running PDFtk on Lambda, you can use the CentOS-built PDFtk binary and libgcj shared library to run PDFtk on an EC2 instance running Amazon Linux.

Don’t forget to check out the example Lambda function which should provide the boilerplate required for getting started with PDFtk: https://github.com/lob/lambda-pdftk-example.

Also, Lob is hiring! If you’d like to help us solve challenging problems, apply here.