Wondering how you can get Puppeteer to work properly on AWS Lambda?
You’re in the right place! In this post, we’ll cover the main challenges you can encounter while trying to do that. But first, let’s start with introducing both Puppeteer and AWS Lambda.
What is Puppeteer?
Simply put, Puppeteer is a software for controlling a (headless) browser. It’s a piece of open-source software developed and supported by Google’s developer tools team. It allows you to simulate user interaction with a browser through a simple API. This is very helpful for doing things like automated tests or web scraping.
A picture’s worth a thousand words. How much is a gif worth? With a little bit of code shown in the gif below, I can log in to a Google account. You simply need to click, enter text, paginate, and scrape all the publicly available data you need.
What is AWS Lambda?
AWS Lambda is what Amazon calls “Run code without thinking about servers or clusters.” You can simply create a function on Lambda and then execute it. It’s that easy.
Simply put, you can do everything on AWS Lambda. Okay, everything is a strong word, but almost. For example, it is possible to scrape thousands of public web pages every night with AWS Lambda functions. Also, it manages to insert data into databases.
Getting started with AWS Lambda is simple and inexpensive. You only need to pay for what you use, and they also have a generous free trial.
Problem #1 – Puppeteer is too big to push to Lambda
AWS Lambda has a 50 MB limit on the zip file you push directly to it. Due to the fact that it installs Chromium, the Puppeteer package is significantly larger than that. However, this 50 MB limit doesn’t apply when you load the function from S3! See the documentation here.
The 250 MB unzipped can be bypassed by uploading directly from an S3 bucket. So we create a bucket in S3, use a node script to upload to S3, and then update our Lambda code from that bucket. The script looks something like this:
"zip": "npm run build && 7z a -r function.zip ./dist/* node_modules/",
"sendToLambda": "npm run zip && aws s3 cp function.zip s3://chrome-aws && rm function.zip && aws lambda update-function-code --function-name puppeteer-examples --s3-bucket chrome-aws --s3-key function.zip"
Puppeteer on AWS Lambda doesn’t work
By default, Linux (including AWS Lambda) doesn’t include the necessary libraries required to allow Puppeteer to function.
Fortunately, there already exists a package of Chromium built for AWS Lambda. You can find it here. You will need to install it and puppeteer-core in your function that you are sending to Lambda.
The regular Puppeteer package will not be needed and, in fact, counts against your 250 MB limit.
npm i --save chrome-aws-lambda puppeteer-core
And then, when you are setting it up to launch a browser from Puppeteer, it will look like this:
const browser = await chromium.puppeteer
executablePath: await chromium.executablePath,
Puppeteer requires more memory than a regular script, so keep an eye on your max memory usage. When using Puppeteer, we recommend at least 512 MB on your AWS Lambda function. Also, don’t forget to run
await browser.close() at the end of your script. Otherwise, you may end up with your function running until timeout for no reason because the browser is still alive and waiting for commands.