DEV Community

loading...

Converting Word to PDF Using A Python-Based Lambda

mdhornet90 profile image Chris Murphy Updated on ・7 min read

The Mission

TL;DR or: abort mission

I was recently put on a new assignment that makes heavy use of AWS for, among other things, serverless architecture. The goal of my first task was to trigger a Lambda when documents are uploaded to an S3 bucket, and convert files of varying formats to .pdfs. Among the formats expected to be supported were .doc and .docx. While I knew those files are packed with metadata for use during document editing, I figured I could just scrape the document until I found ascii characters. That was until I forced VS Code to open the file raw:
The nightmare that is a raw Word Doc
The horror. Clearly, I was about to have my hands full.

Exploration

So I think we can all agree writing code to solve problems should be a last resort, so first I wondered if I could leverage a (hopefully free) service to do the heavy lifting.

How about Google Docs?

I considered using Google Docs as the conversion workhorse, but I was informed by a coworker who had been on the project longer that Google Docs always dropped certain formatting elements, typically symbols like open paren. The ask from the business was that the document format was preserved completely, so I couldn't risk an incomplete solution.

Ok, so what else is there?

It turns out a popular strategy for converting word documents to pdf is to use the CLI capabilities of LibreOffice. In fact, there already exists a JS-based library that does exactly that!

Oh! So why not use Javascript instead of Python?

Because I felt like using Python and wanted a challenge? Forget about what I said earlier about avoiding writing code.

Ok then.

The tools

So we've established that I wanted to replicate the functionality of the Javascript Word-to-PDF conversion library in a Python-based AWS Lambda for valid and totally non ego-related reasons. The first step was to pick apart the code of the aforementioned JS library to figure out how the magic is happening. Let's take a look at Shelf's description for their AWS-Lambda-ified LibreOffice:

85 MB LibreOffice to fit inside AWS Lambda compressed with brotli

And sure enough the code proves that out. It uses Richard's Google's brotli compression algorithm to unpack a lo.tar.br file provided by the LibreOffice Lambda Layer into a given AWS Lambda Function's /tmp folder.

This sure seems like a lot of effort, why can't we just upload an unpacked instance of what's contained in that LibreOffice Layer ourselves? Well, at this point it's time to take a dive off a technical cliff...

Constraints

It's been pretty well-established that the maximum allowable packaging of code to upload to Lambdas from any source is 250MB. You might see that 85MB number up there and think "what's the problem, exactly?"

You read my mind, what is it?

While 85MB is indeed a much smaller number than 250MB, it's a testament to how efficient the brotli algorithm is at packing up its contents; uncompressed and unpacked, the size of the package is just north of 300MB! So if we were to upload the package ourselves, we'd still have to do the work of decompressing its contents. And in that case, why don't we just leverage the existing LibreOffice layer to keep our deployment package small and reduce iteration time whenever we upload new code, which is certainly subject to change far more often than our use of LibreOffice?

You've convinced me, but how do we move forward?

As I mentioned before, the JS library unpacks LibreOffice to /tmp, and this is beneficial for two reasons:

  1. The size of /tmp is capped at 512MB, more than enough for a decompressed and unpacked instance of LibreOffice and all the fixins of a given run of a (sane) Lambda Function!
  2. The contents of /tmp are cached between runs, meaning that we can add logic to reuse a previously unpacked instance of LibreOffice. Considering my testing proved initial extraction of the program took between 10s and 12s, this is a critical performance improvement to keep Lambdas that rely on PDF conversion speedy!

The Approach

Ok finally, we get to come up with an algorithm! First let's recap what we know about how all of these pieces fit together.

  • The LibreOffice Lambda Layer, like all other Lambda Layers, dumps its contents into the /opt folder. So we know we have an /opt/lo.tar.br file with size 85MB that needs decompressing and unpacking.
  • We know that for any given run of a Lambda Function, we have 512MB of space in /tmp, so we're going to want to unpack everything there.
  • We also know that /tmp is cacheable between Lambda runs, so we're going to want to check whether a previous run of the Lambda already did the unpacking for us.
  • Finally, we know that LibreOffice has been compressed with the brotli compression algorithm. I'm going to cut the suspense short and tell you that a Python-specific implementation exists, complete with acceptable levels of documentation.

With all of this in mind, we now have enough context to port the JS code of Shelf's library to Python!

Implementation

Build Tools

Keep in mind all of this decompressing and unpacking needs to be done in the AWS Lambda Function itself, so any external tools we need to use (like the brotli module) must be bundled in the Function code we send up. I highly recommend checking out the juniper tool for this task - it bundles standalone versions of your dependencies along with all of your source code into a .zip file. From there it's just a matter of uploading your bundled code to AWS (note that juniper handles steps 1 through 3 for you).

Finally, the Code

import os
from io import BytesIO
import tarfile

import brotli

LIBRE_OFFICE_INSTALL_DIR = '/tmp/instdir'

def load_libre_office():
    if os.path.exists(LIBRE_OFFICE_INSTALL_DIR) and os.path.isdir(LIBRE_OFFICE_INSTALL_DIR):
        print('We have a cached copy of LibreOffice, skipping extraction')
    else:
        print('No cached copy of LibreOffice exists, extracting tar stream from Brotli file...')
        buffer = BytesIO()
        with open('/opt/lo.tar.br', 'rb') as brotli_file:
            decompressor = brotli.Decompressor()
            while True:
                chunk = brotli_file.read(1024)
                buffer.write(decompressor.process(chunk))
                if len(chunk) < 1024:
                    break
            buffer.seek(0)

        print('Extracting tar stream to /tmp for caching...')
        with tarfile.open(fileobj=buffer) as tar:
            tar.extractall('/tmp')
        print('Done caching LibreOffice!')

    return '{}/program/soffice'.format(LIBRE_OFFICE_INSTALL_DIR)

Breaking it down

There's a little to unpack (sorry) in the module above, so I'm going to call out some of the more interesting chunks of code:

if os.path.exists(LIBRE_OFFICE_INSTALL_DIR) and os.path.isdir(LIBRE_OFFICE_INSTALL_DIR):
    print('We have a cached copy of LibreOffice, skipping extraction')
else:

As I mentioned before, with how long it takes to decompress and unpack LibreOffice, we're going to want to reuse the efforts of previous runs of the Lambda. Our Lambda alone controls all the space in /tmp and as far as I can tell by default Lambda executions happen serially by default, so a simple sanity check that instdir (the root of the LibreOffice program after unpacking) exists is sufficient.

buffer.seek(0)

Missing this line led me down a 20-minute rabbit trail trying to figure out why attempting to unpack the .tar file contained in buffer produced no files or folders. Make sure you set the read pointer to the beginning of the buffer if you plan on reading after writing!

with tarfile.open(fileobj=buffer) as tar:
    tar.extractall('/tmp')

You'll see here I'm leveraging tarfile's open function with a fileobj. Why not write the decompressed .tar file to the filesystem in /tmp and then open it? Well, it turns out trying to have both packed and unpacked instances of LibreOffice exceed even the 512MB limit of /tmp! If you refer to source of Shelf's Brotli Unpacker Library, you'll see that it's piping the decompression result through a tar-extractor (implying it's an in-memory operation), so I assume they were working around the same issue.

I don't code in Python for my day job too often so I might be missing out on a more pythonic way to express what's essentially the same piping operation, but it certainly gets the job done. As long as you're willing to allocate an appropriate amount of memory for your Lambda this shouldn't be a problem.

Wrap Up

I didn't formally performance test this solution, but on average with 512 MB of memory allocated to the Lambda and assuming the Lambda is using a cached copy of the LibreOffice, the function converts PDFs in about a 1s to 1.5s, depending on its size.

Figuring out this approach taught me a lot about the finer points of AWS Lambda, and it ended up being a fun challenge working within the constraints of that ecosystem.

Finally, this is my first post so it should go without saying (but I'll say it anyway) that if you see a way this explanation can be improved, definitely let me know!

References

Discussion (1)

pic
Editor guide
Collapse
mayurchoksi profile image
mayurchoksi

My function could be called concurrently. If if does then all the concurrent execution attempt decompression. Is there any way to still continue using Lambda and not have the need to decompress every time. Or would EC2 be a better fit.