DEV Community

Aissa Laribi
Aissa Laribi

Posted on • Updated on

How to use Beautiful Soup in AWS Lambda for Web Scraping

The purpose of this post is to show how to use the Beautiful Soup module in AWS Lambda with Python Runtimes.Keep in mind, AWS Lambda is not integrated with all the modules available for Python. And the only way to import the modules to Lambda is to bundle the lambda function alongside the modules in an isolated environment.

The idea is similar to containers, we create an isolated environment, we import only the dependencies we need to make our application working, we write our code, we bundle it and we export it to the Cloud.

1) Install Pipenv

Pipenv is the module that will enable us to create an isolated environment for our lambda function.

For Debian:

sudo apt update -y
sudo apt upgrade -y
sudo apt install python-pip
pip3 install pipenv 
Enter fullscreen mode Exit fullscreen mode

For RPM:

sudo apt update -y
sudo apt upgrade -y
sudo yum install python-pip
pip3 install pipenv
Enter fullscreen mode Exit fullscreen mode

2) Create a Python 3.8 environment

At the time I am writing, the documentation for Beautiful Soup has been written for Python 3.8 https://www.crummy.com/software/BeautifulSoup/bs4/doc/ . As result,we need a Python 3.8 isolated environment. Let's create it:

pipenv --python 3.8
Enter fullscreen mode Exit fullscreen mode

Creating Python Environment

pipenv shell
Enter fullscreen mode Exit fullscreen mode

Log in the Virtual Environment

3) Install our bs4 dependency

pip install bs4
Enter fullscreen mode Exit fullscreen mode

Installing bs4

4) Write the Lambda function
Create a file name lambda_function.py, it's very important to name it "lambda_function.py", otherwise the Lambda handler will not work
Copy the code supplied in the following link, paste it onto the the lambda_function.py file, and save it. https://github.com/aissa-laribi/bs4-in-lambda/blob/main/lambda_function.py
Lambda function

5) Bundle up the lambda function and the dependencies

Now it's time to move the lambda function alongside the dependencies of our environment.

cp lambda_function.py ~/.local/share/virtualenvs/<yourenvname>/lib/python3.8/site-packages
cd ~/.local/share/virtualenvs/<yourenvname>/lib/python3.8/site-packages
ls
Enter fullscreen mode Exit fullscreen mode

And normally, you should be able to see your lambda function alongside the dependencies.

Lambda function & dependencies

Now, we need to zip the whole directory

zip -r9 bs4_in_lambda.zip *
cp bs4_in_lambda.zip ~/Desktop
cd  ~/Desktop
Enter fullscreen mode Exit fullscreen mode

Zipping The Lambda Function

**
6) Upload the Zip File to your Lambda Function****

Then, create a Lambda Function, and make sure that Python 3.8 Runtime is selected.

Creating the Lambda function

Go to Code and in the top right corner, click on Upload from

Lambda function uploading zip file

And select the Zip file we have created.

Lambda function uploading zip file

The Lambda function will show up and we can see on the left side all the pip packages stored in folders.

Imgur Image

Then go to Configuration and increase the runtime because there are 10 pages to be scraped.

Lambda Configuration

Let's set a 5 minute runtime to make sure it will scrape all the pages.

Lambda Runtime

Then,return to Code > Test.

Leave the Configuration Test content by default and add any name to the Event Name and Save.

Configuratio Test

Click on Test.

And sometimes we will get this error message.

Test Failing

The trick is to switch between http and https in our function.

Press Ctrl + F Scroll Down and replace all "https" with "http", Deploy and test the function.

Tweak the function

Et Voila!
We have a list of websites.

Imgur Image

Then go to Monitor > Logs > Click in the LogStream of the first invocation.Then, a new window will open, and you will get access to the full list.

Imgur Image

Imgur Image

Top comments (0)