DEV Community

aromanarguello
aromanarguello

Posted on

Using AWS Lambdas + headless Chrome to Generate PDF files from HTML

This post assumes that you have basic knowledge of the Serverless framework and AWS Lambda, have created an AWS account with a lambda function initialized. This article will focus on going over the code used to generate a PDF file using user input received from a client and an HTML template, then store it in an AWS s3 bucket.

Make sure that you have added your AWS credentials to serverless, you can do so from the CLI as so:

export AWS_ACCESS_KEY_ID=<your-key-here>
export AWS_SECRET_ACCESS_KEY=<your-secret-key-here>
# AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY are now available for serverless to use
serverless deploy

# 'export' command is valid only for unix shells. In Windows - use 'set' instead of 'export'
Enter fullscreen mode Exit fullscreen mode

First things first, let's create a directory:

mkdir lambdas && cd lambdas

Note:
I have this trusty alias in my ~/.zshrc file that I use to create directories and cd into them:

function mkcd {
    local target=$1
    mkdir -p "$target"
    cd $target
}
Enter fullscreen mode Exit fullscreen mode

Let's install our dependencies:
npm install aws-sdk chrome-aws-lambda puppeteer-core

  • aws-sdk will help use connect to AWS and make use of their services such as lambda and s3!
  • chrome-aws-lambda & puppeteer-core will allow us to spin up a headless version of chrome. We will use this instance to create a blank page, add the content, and store it as a PDF

** Make sure you have the serverless framework installed globally **
(npm install -g serverless)

Once inside our directory or your location of choice, we can go ahead and leverage the serverless template scaffolding to create a new handler.js using the nodejs blueprint.
serverless create --template aws-nodejs --name generatePdf

This will create 3 files for us:

  • handler.js
  • serverless.yml
  • .gitignore

The first thing I did was to go ahead and clean up all of the contents in the serverless.yml file and only left what was necessary for us to be able to perform the desired action.

After some cleanup, my serverless.yml file looks like this:

service: pdf

frameworkVersion: "2"

provider:
  name: aws
  runtime: nodejs12.x
  region: us-east-1

functions:
  handler:
    handler: handler.handler # πŸ‘ˆ this access the handler
    memorySize: 1600 # here we tell AWS how much memory to allocate
    timeout: 30
    events:
      - http:
          path: users/create # πŸ‘ˆ this defines the http path
          method: get
          cors: true

package:
  exclude:
    - node_modules/puppeteer/.local-chromium/**
Enter fullscreen mode Exit fullscreen mode

At the time of writing this, the handler.js file should initially look like this:

'use strict';

module.exports.hello = async event => {
  return {
    statusCode: 200,
    body: JSON.stringify(
      {
        message: 'Go Serverless v1.0! Your function executed successfully!',
        input: event,
      },
      null,
      2
    ),
  };

  // Use this code if you don't use the http event with the LAMBDA-PROXY integration
  // return { message: 'Go Serverless v1.0! Your function executed successfully!', event };
};

Enter fullscreen mode Exit fullscreen mode

At this point, none of this is relevant to our goal. We will go ahead and clear everything up, install dependencies and get to the nitty gritty!

If you recall, in our config file, we said that the path to our module would be called handler and that lived within handler.js

functions:
  handler:
    handler: handler.handler // πŸ‘ˆ this
    memorySize: 1600
    timeout: 30
Enter fullscreen mode Exit fullscreen mode

It really doesn't matter what you call your handler, just make sure that the name is the same in both your config file and your exported module.

module.exports.handler = async event => {
   // .. cleaned up content and changed name of
   // exported module from hello to handler
};
Enter fullscreen mode Exit fullscreen mode

Now, let's import and initialize our S3 instance from the aws-sdk package we installed earlier.

"use strict"

const { S3 } = require("aws-sdk");

const s3 = new S3(); // πŸ‘ˆ initialize our instance

module.exports.handler = async event => {
   // .. cleaned up content and changed name of
   // exported module from hello to handler
};
Enter fullscreen mode Exit fullscreen mode

Let's also add the rest of the utilities from chrome-aws-lamda and add some initial boilerplate code.

"use strict"

const { S3 } = require("aws-sdk");
const { puppeteer, args, defaultViewport, executablePath } = require("chrome-aws-lambda");

const s3 = new S3(); // πŸ‘ˆ initialize our instance

module.exports.handler = async (event, ctx, cb) => {
  let result = null;
  let browser = null;

  const date = new Date().toISOString(); // we will use this
                                         // to create filename

  const filename = `pdf-${date}` // you can call this whatever you want
                                 // but make it unique or else the file
                                 // will be replaced

  const pdfPath = `/tmp/${filename}.pdf`
  // I will pause here to further talk about tmp files
};
Enter fullscreen mode Exit fullscreen mode

I want to make a quick note about /tmp/ files in case you are not familiar with them. As the shorthand notation might infer, it refers to a temporary file. These files are typically created by an application to store some form of temporary data.

In a nutshell, whenever we run our function it will generate this temporary directory and once the execution of the function is done, it will discard it. This provides the perfect environment needed to create a pdf file and store it in S3(more permanent storage).

Back from our short break and let's start writing some funky logic.

module.exports.handler = async event => {
  let result = null;
  let browser = null;

  const date = new Date().toISOString(); filename

  const filename = `pdf-${date}` 

  const pdfPath = `/tmp/${filename}.pdf`

  try {
    console.log("Establishing connection...");
    // Initialize and launched puppeteer
    browser = await puppeteer.launch({
      args,
      defaultViewport,
      executablePath: await executablePath,
      headless: true,  // πŸ‘ˆVery important, remember we want to run headless chrome
      ignoreHTTPSErrors: true,
    });

    console.log("Opening new page...");
    // πŸ‘‡ create a new headless chrome pag
    const page = await browser.newPage();

    console.log("Generating PDF file from HTML template...");

     // πŸ‘‡ Ignore this line for right now
    await page.setContent('<h1>Hello world!</h1>', { waitUntil: "networkidle2" });

    // πŸ‘‡ this tells puppeteer to save the webpage as a pdf file
    await page.pdf({ format: "Letter", path: pdfPath });

    const params = {
      Key: pdfPath,
      Body: fs.createReadStream(pdfPath),
      Bucket: "<yourS3Bucket>",
      ContentType: "application/pdf",
    };

    console.log("Uploading PDF...");

    // πŸ‘‡ Pretty self explanatory but this is what uploads
    // and store our PDF or file to S3
    await s3
      .upload(params, async (err, res) => {
        if (err) {
          console.log(err);
          throw new Error(err);
        }
        console.log("done");
        console.log(res);
        return cb(null, res);
      })
      .promise();

    result = await page.title();
  } catch (error) {
    return cb(error);
  } finally {
    if (browser !== null) {
      console.log("Closing browser...");
      await browser.close();
    }
  }
  return cb(null, result);
};

Enter fullscreen mode Exit fullscreen mode

So! This is pretty much it for the handler itself, but we are still missing something very important! We are missing our HTML template, so go ahead and create a new file inside your directory called template.js (or whatever you want to call it).

Wait! .js file? But I thought we were creating an HTML template!
And yes, you are correct we are creating an HTML template, BUT because we want to be able to add dynamic fields to that template we will return the HTML content from a function. This way we can add variables and interpolate them into our HTML. JS FTW!

From your root directory, run touch template.js and enter the following contents to give you an example of how it would be done:

module.exports.template = ({ someVariable }) => {
  const today = new Date();
  return `
    <!DOCTYPE html>
    <html>
      <head>
        <meta charset="utf-8" />
        <title>PDF Result Template</title>
        <style>
          .container {
            background-color: rebbeccapurble
          }
        </style>
      </head>
      <body>
          <div class="container">
            Hello ${someVariable}!
            Today's date is: ${today}
          </div>
      </body>
    </html>
  `;
};

Enter fullscreen mode Exit fullscreen mode

Up until right now, our serverless function should be able to generate a mostly blank PDF doc with Hello World in it. That isn't dynamic enough though.

To fix this we can import our template.js module into our function and call it from the page set content function.

Inside our handler, we had a comment that said to ignore the below-called function. Well, let's give it some attention now.

In await page.setContent('<h1>Hello world!</h1>', { waitUntil: "networkidle2" })

Replace the first string argument with the important function and pass it some data.

// ... other imports
const { template } = require("./template");
                  // destructure πŸ‘‡ data from event argument
module.exports.handler = async({ data }, ctx, cb) => {
     // ...
    //
    await page.setContent(template({ ...data }), { waitUntil: "networkidle2" });
   // ...
};

Enter fullscreen mode Exit fullscreen mode

page.setContent could be considered the main bread and butter since this is what merges our blank page and our template to render the desired outcome.

We can test if this works locally by running the following command:
serverless invoke local --function functionName --data '{"data": "Alejandro"}'

Finally, you can run serverless deploy!

Thanks for reading, any feedback and/or improvements are more than welcome :)

Top comments (1)

Collapse
 
cohensnir profile image
Snir Cohen • Edited

for some reason, setContent of html with script tags does not wait for network requests to return even when I add waitUntil. any idea why?