DEV Community

Raphael Noriode
Raphael Noriode

Posted on

Web scraping with Ruby and Watir

Ruby is a popular object-oriented programming language that is well suited for scriptwriting and creating complex web applications, but in this article, we would focus on writing scripts like web scrapers.
A quick google search on web scraping with Ruby would bring up the Nokogiri gem, So, it obviously has to be the best tool out there, well, No.

Nokogiri is great but it has a big flaw, it doesn't fill forms which obviously means, things like authentication cannot be done with it, and if you are going to scrap a dynamic web site you most likely need to authenticate, enter Watir
enter Watir

You can check out Watir here

How to scrap Facebook :XD

  • Create a ruby file

Basically, just create any file and end it with .rb, that is pretty much a ruby file. Ideally, this has to be in a folder, then open it in your favourite text editor. (Go VS code)

in that folder

  • Create a Gemfile
    Just create a file with the name 'Gemfile'

  • Copy the below code into the Gemfile

    source 'https://rubygems.org/'

    gem 'watir', '~> 6.16', '>= 6.16.5'
    gem 'headless', '~> 2.2', '>= 2.2.3'
    gem 'webdrivers', '~> 4.2'
  • Now run bundle install This should install all the above stated dependencies, I would explain why 'headless' and 'webdrivers' is needed.

In your ruby file, require the needed gem

    require 'watir'
    require 'webdrivers'

I really like to modularize my code, so we will need to create another ruby file, call it login.rb
This file would hold our authentication code.

class Login
  attr_reader :email, :password

  def initialize(user)
    @email = user[:email]
    @password = user[:pass]
    $browser = Watir::Browser.new :chrome, headless: true

    $browser.goto 'http://facebook.com/'
  end

  def auth
    form = $browser.form(id: 'login_form')
    form.text_field(name: 'email').set(email)
    form.text_field(name: 'pass').set(password)
    form.button(text: 'Log In').click

    sleep(2)
  end
end

A couple of things to note from the above code, I hid my Facebook login credentials in environment variables, because, well you are not supposed to know them, LOL.

Breaking down the code;
I created a login class that initializes the Watir on a chrome instance and makes it run in the background, this is the point of headless: true. Because, Watir already has a way to start up headless browsers we would not need the headless gem, feel free to delete it from the Gemfile.

The auth function finds a form on Facebook.com with the id equal to login_form and the input fields in the form with names equal to email and pass respectively, setting them to the email and password you have initially passed in. This would authenticate and log your scraper into Facebook.

Now go to the first ruby file you created, this should be your entry point, put in this code.

require 'watir'
require 'webdrivers'
require 'dotenv'
Dotenv.load

require_relative 'login.rb'

#login to Facebook
user = Login.new(:pass => ENV['password'], :email => ENV['email'])
user.auth 


timeline = $browser.spans(class: ['oi732d6d', 'ik7dh3pa', 'd2edcug0', 'qv66sw1b', 'c1et5uql', 'a8c37x1j', 'muag1w35', 'enqfppq2', 'jq4qci2q', 'a3bd9o3v', 'knj5qynh', 'oo9gr5id', 'hzawbc8m'])

timeline.each do |event|
  p event.children[0].inner_html
  puts ' '
end

$browser.close

Here, we are not doing anything fancy, just scraping through our unread notification(s) and timeline and spitting out what we can find in plain sight. The output would contain both HTML tags but, this is to show you what the scraper does and the trick behind web scraping, so basically, you need to study how the elements on the site you are scraping are named, this is how you can interact with the website.

output

What you choose to do with this information afterwards is up to you, but you could access the div and neatly output the text in each of them, that would be a good test.

Top comments (0)