Marcin K.

Posted on Oct 5, 2019

Serverless web scraper in Ruby - tutorial

#ruby #scraping #aws #serverless

Imagine you have this awesome web app that will make you very rich someday. This app has some end-user tests. You used Selenium to automate all the manual stuff requiring a browser interaction.
As your app gets bigger, end-user tests are taking more and more time that could be spent on something else.
You recruit more QA engineers, and they all have to configure selenium, chrome driver, and a proper browser binary. This gets cumbersome and error-prone.

Why not run those tests in parallel? Why not keep them totally separate from our app? Why not keep them, configure them and run them on a separate machine?

We can do it with serverless chrome!
It's just a chrome binary designed to be used on AWS Lambda (at the moment of writing this article GCP and Microsoft Azure is not yet supported).
Let's build a very simple web scraping app with it. We are going to write it in Ruby - If you prefer writing in Python, here's an article for you.

Requirements:

AWS account (and some very basic knowledge)
Ruby installed (version 2.5.x)
Serverless chrome (1.0.0-37)
Chromedriver (2.37)
Ruby gems: selenium-webdriver (I used 3.142.4) and bundler (2.0.x)

Note: Your lambda function, s3 bucket, role, and the user should be created in the same region.

Create role

Create an IAM role and attach to it existing AWSLambdaFullAccess policy. Here is a tutorial for creating roles.

Create a user with programmatic access

Create an AWS user and attach to it the same policy as above.
Here is a tutorial for adding new users.
Write the access id and secret that you have obtained. Set them as the environment variables (link) and configure them in your AWS profiles.

Create an S3 bucket

Create an AWS S3 bucket. Here is how to do it.

Create a lambda function

Now let's go to the AWS console again and create our lambda function.

When asked to enter the basic information for your function, add a name you want and choose Ruby 2.5.

Once it's created, go to the "Basic settings" in the function view and set the memory to 512MB and the timeout to 1 min.

Have a look at the template for our function:

Also, assign the role that we have created earlier as the execution role.
Alt Text

Install chromedriver and serverless chrome

Let's grab serverless chrome:

wget https://github.com/adieuadieu/serverless-chrome/releases/download/v1.0.0-37/stable-headless-chromium-amazonlinux-2017-03.zip
unzip stable-headless-chromium-amazonlinux-2017-03.zip -d bin/
rm stable-headless-chromium-amazonlinux-2017-03.zip

And the chromedriver:

wget https://chromedriver.storage.googleapis.com/2.37/chromedriver_linux64.zip
unzip chromedriver_linux64.zip -d bin/
rm chromedriver_linux64.zip

Install selenium web driver gem

First, we need to create a Gemfile for our project with the following content:

Once we have the Gemfile, we can install the required gem and its dependencies:

bundle install --path vendor/bundle

Note the location we are installing it to. We will need to include this folder in the package deployed to AWS S3.

Implement the scraper

The first thing we will need is to set up a selenium driver.
Note how we are passing paths to the binaries we have just installed.

Next, let's implement the lambda function itself.

Here I am just using selenium webdriver API to send some input into google.com and show me the browser title. If you would like to know the details or experiment a little with it, check out the selenium API docs. Do not forget to tell the driver to quit in the end!

I am just going to add some additional driver options to make it more efficient:

Run the scraper code locally

If you're using macOS or Windows, you will need to test your code with Docker. The good news is that there is an image provided by AWS that mirrors the lambda environment and we can use it directly. We use --mount flag here to set /dev/shm permissions to read-only.

docker run --rm -v "$PWD":/var/task --mount type=tmpfs,target=/dev/shm,readonly=true lambci/lambda:ruby2.5 lambda_function.lambda_handler

Upload to lambda

Run those commands in order to zip our code and its dependencies, upload it to S3, and update our function from there.

We are ready to invoke our function! Do it with aws lambda invoke --function-name your_function_name output_file in your terminal, or use the "Test" button in the function view in the AWS console.

Happy scraping!

Top comments (16)

Kronos35 • Mar 11 '20

I am getting this error my dude:

Function\u003cSelenium::WebDriver::Error::UnknownError\u003e","errorMessage":"unknown error: Chrome failed to start: exited abnormally\n  (chrome not reachable)\n  (The process started from chrome location bin/headless-chromium is no longer running, so ChromeDriver is assuming that Chrome has crashed.)

Marcin K. • Mar 13 '20

Hmmm I have just tried re-doing this tutorial on a new lambda function but I was not able to replicate this issue. So in Docker it works fine, but the issue appears only after upload to lambda?
What is the exact selenium-webdriver gem version that you're using?

Kronos35 • Mar 13 '20 • Edited

Hey, I used the exact same version used in this tutorial, but it looks like they changed headless chrome a little bit. Anyways whatever the case I managed to make it work by adding --disable-dev-shm-usage to the Selenium Chrome options.

You should update the tutorial to include this option.

Marcin K. • Mar 14 '20

Ok, I will. Thanks for your comment

Henry Miguel Guzmán Escorcia • May 15 '20 • Edited

My code does not work... :(

require 'json'
require 'selenium-webdriver'

def lambda_handler(event:, context:)
  setup_driver
  # driver.navigate.to 'http://www.google.com'
  # element = driver.find_element(name: 'q')
  # element.send_keys 'Pizza'
  # element.submit
  # title = driver.title
  # driver.quit
  { statusCode: 200, body: JSON.generate("Hola mundo") }
end

def setup_driver
    options = Selenium::WebDriver::Chrome::Options.new(binary: 'bin/headless-chromium')
    options.add_argument('--headless')
    options.add_argument('--disable-gpu')
    options.add_argument('--window-size=1280x1696')
    options.add_argument('--disable-application-cache')
    options.add_argument('--disable-infobars')
    options.add_argument('--no-sandbox')
    options.add_argument('--hide-scrollbars')
    options.add_argument('--enable-logging')
    options.add_argument('--log-level=0')
    options.add_argument('--single-process')
    options.add_argument('--ignore-certificate-errors')
    options.add_argument('--homedir=/tmp')
    service = Selenium::WebDriver::Service.chrome(path: 'bin/chromedriver')
    @driver = Selenium::WebDriver.for :chrome, service: service, options: options
    # @driver.manage.timeouts.implicit_wait = 30
end

START RequestId: c6fca6cd-e2e6-4782-8ca6-e10197058471 Version: $LATEST
Error raised from handler method{
  "errorMessage": "unable to connect to chromedriver 127.0.0.1:9515",
  "errorType": "Function<Selenium::WebDriver::Error::WebDriverError>",
  "stackTrace": [
    "/var/task/vendor/bundle/ruby/2.7.0/gems/selenium-webdriver-3.142.7/lib/selenium/webdriver/common/service.rb:200:in `connect_until_stable'",
    "/var/task/vendor/bundle/ruby/2.7.0/gems/selenium-webdriver-3.142.7/lib/selenium/webdriver/common/service.rb:111:in `block in start'",
    "/var/task/vendor/bundle/ruby/2.7.0/gems/selenium-webdriver-3.142.7/lib/selenium/webdriver/common/socket_lock.rb:41:in `locked'",
    "/var/task/vendor/bundle/ruby/2.7.0/gems/selenium-webdriver-3.142.7/lib/selenium/webdriver/common/service.rb:108:in `start'",
    "/var/task/vendor/bundle/ruby/2.7.0/gems/selenium-webdriver-3.142.7/lib/selenium/webdriver/common/driver.rb:303:in `service_url'",
    "/var/task/vendor/bundle/ruby/2.7.0/gems/selenium-webdriver-3.142.7/lib/selenium/webdriver/chrome/driver.rb:40:in `initialize'",
    "/var/task/vendor/bundle/ruby/2.7.0/gems/selenium-webdriver-3.142.7/lib/selenium/webdriver/common/driver.rb:46:in `new'",
    "/var/task/vendor/bundle/ruby/2.7.0/gems/selenium-webdriver-3.142.7/lib/selenium/webdriver/common/driver.rb:46:in `for'",
    "/var/task/vendor/bundle/ruby/2.7.0/gems/selenium-webdriver-3.142.7/lib/selenium/webdriver.rb:88:in `for'",
    "/var/task/prueba.rb:30:in `setup_driver'",
    "/var/task/prueba.rb:5:in `lambda_handler'"
  ]
}END RequestId: c6fca6cd-e2e6-4782-8ca6-e10197058471
REPORT RequestId: c6fca6cd-e2e6-4782-8ca6-e10197058471  Duration: 20236.86 ms   Billed Duration: 20300 ms   Memory Size: 512 MB Max Memory Used: 62 MB  Init Duration: 303.13 ms

source 'https://rubygems.org'
gem 'selenium-webdriver'

Chrome81.0.4044.138
stable-headless-chromium-amazonlinux-2017-03.zip

Lambda AWS

Please help me.

Marcin K. • May 21 '20

Hi,

Sorry for the late reply.
Your code and Gemfile are ok.
It looks like you're running it on Ruby 2.7 in lambda and it's not compatible with this chromedriver version.
Unfortunately, chromedriver must be compatible with your serverless chrome and ruby version, it's not easy to find a match.
The easiest solution, for now, would be to downgrade to ruby 2.5 in lambda - just create a new lambda function with this version.

Hoonki • Jun 10 '21

Hello. from now on, aws plan to deprecate ruby 2.5.
So we have to migrate version of ruby 2.5 to 2.7.
How can i find compatible chromedriver version with ruby 2.7? Can you send me an reference? Thank you.

activklaus • Jun 4 '21

Hi Marcin,
AWS will stop supporting Ruby 2.5 in a few weeks. Do you have any update on chromedriver compatible with Ruby 2.7?

Your article was the most helpful source for creating a scraper with selenium and ruby for AWS lambda (great work btw!).

So I was hoping you have some news about how to build the scraper with Ruby 2.7
Thanks

Kronos35 • Jun 4 '21

I am working on that as well, if you find a way to do that send me a message, I'll share the info I gather with you as well.

Hoonki • Jun 17 '21

Hello, Kronos, Did you find the solution? I am working on that, but I don't have any solutions so far. If you find the solution? Could you tell me about that? Thank you.

Kronos35 • Jul 22 '21

As a matter of fact I did I uploaded a short answer to a question in Stack Overflow
I provided some guidance you can check my solution here:

stackoverflow.com/questions/678419...

Asha E • Jul 30 '20 • Edited

When I run this command,
docker run --rm -v "$PWD":/var/task --mount type=tmpfs,target=/dev/shm,readonly=true lambci/lambda:ruby2.5 lambda_function.lambda_handler

Init error when loading handler lambda_function.lambda_handler

"errorMessage": "Could not find childprocess-3.0.0 in any of the sources",
"errorType": "InitBundler::GemNotFound",

Used the same code and gem versions as yours

Asha E • Jul 30 '20

I had to create ruby layers
stackoverflow.com/questions/536342...

Thanks for this post

Kronos35 • Jun 2 '21 • Edited

Now that ruby 2.5 is being deprecated by the end of July it'd be useful to update this tutorial to include a compatible chromedriver binary.

Otherwise this tutorial, and all projects inspired by it would be rendered useless.

Angel Buzany • Feb 13 '21

Hi Marcin, I have a question

In the step "Install chromedriver and serverless chrome" where should I run the commands?

Kronos35 • Jun 2 '21

In your bash console, I assume this was developed using linux, so to open your linux console type ctrl+alt+t. there you chould use the cd command to change the directory you're working on and download the drivers directly there.

View full discussion (16 comments)