Web scraping with Python and AWS Lambda: A modern approach

#aws #lambda #webscrapping #selenium

In December 2020, AWS started to support Lambda functions as container images, which is a real breakdown that allows us to deploy way more complex projects with the same you-pay-only-for-what-you-use pricing and serverless architecture.

Web scraping workloads have real benefits from this Upgrade due to an easier installation of selenium.

Let's code!

The Dockerfile bellow is based on the oficial lambda container image for python 3.8 (it is really awful to create this image from scratch).

# Dockerfile
FROM public.ecr.aws/lambda/python:3.8

RUN yum install -y \
    Xvfb \
    wget \
    unzip

# Install google-chrome-stable
RUN wget https://dl.google.com/linux/direct/google-chrome-stable_current_x86_64.rpm && \
    yum localinstall -y google-chrome-stable_current_x86_64.rpm

# Install chromedriver
RUN wget https://chromedriver.storage.googleapis.com/2.40/chromedriver_linux64.zip && \
    unzip chromedriver_linux64.zip && \
    chmod 775 chromedriver

# Install selenium
RUN pip3 install -U pip selenium

# Copy lambda's main script
COPY app.py .

CMD ["app.lambda_handler"]

The python script below configures the Selenium with a Chrome headless. Note the path of the chrome driver at the driver definition - such path comes from the work directory of the base image.

# app.py

from selenium import webdriver
chromeOptions = webdriver.ChromeOptions()
chromeOptions.add_argument("--headless")
chromeOptions.add_argument("--remote-debugging-port=9222")
chromeOptions.add_argument('--no-sandbox')
driver = webdriver.Chrome('/var/task/chromedriver',chrome_options=chromeOptions)

def lambda_handler(event, context):
    driver.get("http://www.python.org")
    return {
        "statusCode": 200,
        "body": driver.title
    }

Finally, build and run the container image!

$ docker build -t scrapper:latest .

$ docker run -p 9000:8080  scrapper:latest

In order to test your new web scraping containerized lambda function, run the following command.

$ curl -XPOST "http://localhost:9000/2015-03-31/functions/function/invocations" -d '{}'

{"statusCode": 200, "body": "Welcome to Python.org"}

DEV Community

Web scraping with Python and AWS Lambda: A modern approach

Let's code!

Top comments (0)

Read next

Installing Python Dependencies on AWS Lambda Using EFS

Optimizing Your Amazon Web Services Email Address: A Comprehensive Guide

Deploying a Node.js Application on AWS EC2 Using Tabby SSH Client

Understanding RabbitMQ Brokers in AWS