DEV Community

Cover image for Generate PDFs from HTML via Puppeteer on AWS Lambda + API Gateway
zahaar
zahaar

Posted on

Generate PDFs from HTML via Puppeteer on AWS Lambda + API Gateway

“Evil cannot create anything new, they can only corrupt and ruin what good forces have invented or made.” - JRR Tolkien.

Preface

Would it be great to have the functionality that would enable you to generate PDF files using HTML && CSS capabilities without the need to rely on overly complex drivers that are dependent on a whole bunch of C libraries?

While also supporting all the latest features of HTML5 && CSS3?

Well, we have great news. There is a framework called Puppeteer that uses relatively new Chrome feature and makes it accessible though a NodeJS based API.

Essentially what Puppeteer does, is: Launches a Chromium browser instance in a headless mode ( not actually opening it ), and allows us to manipulate the browser via set of API command to parse website, retrieve images and generate PDF as if you were actually opening an HTML file in the latest browser version, etc..

While we can create a running Docker Puppeteer instance and deploy that on ECS or Heroku. The creation of stable && optimized image can be quite challenging...

Having a running instance in AWS Lambda IHMO in contrast would be much simpler in terms of development speed, debug and monitoring. Besides, serverless, is a nice concept for POC ( you pay for what you use )

Repo -> End Result

You can see the complete working example in this repo

License: MIT

Generate PDF document via Puppeteer running on AWS Lambda

This repo contains a serverless application that takes a HTML template and return a PDF in form of a binary

Diagram

Diagram

Requirements

How to Run

  1. Clone this repo git clone https://github.com/zahaar/generate-pdf-lambda

  2. Import cUrl to Insomnia ( Postman is not recommended, as it can't visualize Pdf ).

  3. Run make api-local to have local API GW running.

  4. Send cUrl request via Insomnia.

You can also invoke Lambda bypassing API GW, by supplying an example event in file, and running make invokation-local. The response would be a base64 encoded PDF binary.

How to Deploy

A configured AWS CLI V2 is a must -> AWS Console Account && API Keys

  1. make deploy

  2. Fetch AWS SAM deploy output URL Value, and change the Url in Insomnia from localhost to that value execution result in

Requirements and Prerequisites

1. SetUp local AWS SAM Template with Chrome Lambda Layer

In this step the local SAM execution setUp will be complete. Once this is done, we will have a strong reference point.

The end version of this step can be fetched from 1_local-setup branch

We can create a basic SAM template by running sam init or reference a guide

but our end goal should be a sophisticated structure like this



├── Makefile
├── VERSION -- for VERSION tracking, helpful for CI
├── envs.json -- to sep envs for local execution ( if necessary )
├── events
│   └── api-gw-event.json -- an example API GW event for local execution
├── src
│   └── app.js -- main source code file
└── template.yaml -- AWS SAM configuration template


Enter fullscreen mode Exit fullscreen mode

app.js contains simple code that will return the same event.body that it receives from example event.



...
...
var response = {
    statusCode: 200,

    body: event.body,
  }

  return response
...


Enter fullscreen mode Exit fullscreen mode

while template.yml has a resource configuration for API GW Service



...
...
  ApiGatewayApi:
    Type: AWS::Serverless::Api
    Properties:
      StageName: Staging
    BinaryMediaTypes:
      - application~1pdf  // Note the support for binary pdf media Type
...


Enter fullscreen mode Exit fullscreen mode

and the Lambda. As per context of our goal, it's called PdfFunction
Take note of the Layer being used in this config. By setting chrome-aws-lambda, we have essentially ruled out the need to set package.json dependencies for puppeteer and chrome on Docker image thar AWS is using on EC2 for Lambdas, as this step can be quite challenging.



...
...
  PdfFunction:
    Type: AWS::Serverless::Function
    Description: Invoked by EventBridge scheduled rule
    Properties:
      CodeUri: src/
      Handler: app.handler
      Runtime: nodejs12.x
      Timeout: 15
      MemorySize: 3008

      Layers:
        - !Sub 'arn:aws:lambda:${AWS::Region}:764866452798:layer:chrome-aws-lambda:22'
      Environment:
        Variables:
          EXAMPLE_ENV: 'CHANGE_THIS'
      Events:
        ApiEvent:
          Type: Api
          Properties:
            Path: /pdf
            Method: post
            RestApiId:
              Ref: ApiGatewayApi
...


Enter fullscreen mode Exit fullscreen mode

To test that all requirements are met, let's run a local event.



make invokation-local


Enter fullscreen mode Exit fullscreen mode

The output should be essentially the same as the execution logs in CloudWatch



...
...
Mounting /Users/wparker/Dev/scheduled-website-screenshot-app/.aws-sam/build/PdfFunction as /var/task:ro,delegated inside runtime container
START RequestId: e4d7743d-5be2-4735-84c8-9d5160d9a750 Version: $LATEST
...


Enter fullscreen mode Exit fullscreen mode

2. Configure Puppeteer in Lambda; Supply Template HTML

Next step is to program app.js to start puppeteer, consume HTML from an API GW event and return a base64 encoded response that would be decoded on Response by API GW.

The end version of this step can be fetched from 2_generate-pdf branch

We need to change the Lambda handler code to something like this. File ( File is too long to displayed here )

Key takeaways are:

  1. Browser launch args parameters in this example are set specifically for AWS Lambda compatibility.


...
...
    browser = await chromium.puppeteer.launch({
      args: chromium.args,
      defaultViewport: chromium.defaultViewport,
      executablePath: await chromium.executablePath,
      headless: chromium.headless,
      ignoreHTTPSErrors: true,
    })


Enter fullscreen mode Exit fullscreen mode
  1. The return format goal was set to mimic A4 document.


...
...
    await page.setViewport({
      width: 1080,
      height: 1600,
      deviceScaleFactor: 1,
      isLandscape: true,
    })
    pdf = await page.pdf({
      format: 'a4',
      margin: {
        top: '0px',
        right: '0px',
        bottom: '0px',
        left: '0px',
      },
    })
...


Enter fullscreen mode Exit fullscreen mode
  1. The response headers are set for pdf file transfer. isBase64Encoded flag is set to true to inform API GW that it needs to decode the file.


...
...
  var response = {
    statusCode: 200,
    headers: {
      'Access-Control-Allow-Origin': '*',
      'Access-Control-Allow-Methods': 'GET, POST',
      'Content-type': 'application/pdf',
      'Content-Disposition': 'attachment; filename="foo.pdf"',
    },
    isBase64Encoded: true,
    body: pdf.toString('base64'),
  }
...


Enter fullscreen mode Exit fullscreen mode

To test this code, an HTML template is needed. We will use this open-source one for demonstration.
The document is being sent as body with 'Content-Type: text/html'

Please note 'Accept: application/pdf', this is important.

The end result of cUrl request is in this file

To test our result let's start local SAM in local start-api mode. ( akin to a server, contrary to one time invokation)



make api-local


Enter fullscreen mode Exit fullscreen mode

Import cUrl into Insomnia.

Result

Image description

3. Deploy Lambda + API GW via SAM

Refers to step in README.md on main branch

Tips

1. Insomnia vs Postman

Instead of playing tricks with Postman PDF Visualization. I highly recommend switching to Insomnia

Insomnia Visualized Response Postman Visualization Response
Insomnia Visualized Response Postman Visualization Response
2. Error: Error building docker image: pull access denied for

Works fine here. You shouldn't need credentials for Public ECR (you can use auth for specific cases) but if you just want to consume it, remove the existing credentials

docker logout public.ecr.aws

and then try the build again.

That said, if you still want to make use of the…

.

3. Consider using latest AWS Lambda feature: Lambda URL
4. Be mindful of HTTP Headers, when dealing with API GW. As this can lead to a huge confusion.

Oldest comments (0)