Build a CLI for Web scraping in Ruby

#ruby #beginners #tutorial

Hello everyone,

This week, we will be building a CLI(command line interface) to scrape data from dev.to. We will be filter by hashtag and return the title and author of articles related to a hashtag. We are going to be doing this with just ruby. A summary of how the CLI will work is;

You run it and it shows the welcome message
It asks you to enter an hashtag
Then the result is displayed to you

Time to get our hands dirty..

First created a ruby file. I called mine dev_to_web_scraper.rb and open with your code editor.

We are going to create a simple skeleton just to get our CLi going before scraping. We will be creating a module for instructions and a class to handle the scraping.

# instruction module

module Instructions
  def introductions
    puts 'Welcome to dev.to webscraper. This CLi tool gathered articles based on the hashtag provided'
    puts 'If you want to quit, simple type (q) the next time you are prompted to enter a value'
    puts 'Please provide a hashtag to continue..'
    puts ''
  end

  def quit_message
    puts 'You have quit the scraper'
  end

  def invalid_entry
    puts 'Invalid entry, try again'
  end
end

# scraper class

class Scraper
  extend Instructions

  def self.get_input
    user_input = gets.chomp
    get_hashtag(user_input)
  end

  def self.get_hashtag(user_input)
    if user_input == 'q'
      quit_message
    elsif user_input.empty?
      invalid_entry
      get_input
    else
      scrape_data(user_input.to_s)
    end
  end

  def self.scrape_data(hashtag)
    puts "Scraped data for #{hashtag}"
    get_input
  end
end

Let me explain a little. I think the instruction module is pretty straight forward. We just created 3 methods to display instructions to scrape the page.

For the class, we include the instruction module. This class also have 3 methods, the first is used to get the input from the user and pass it to the next method. This next method is called get_hashtag that takes an input, then decides what to do based on the input.

Based on the instruction, when the user enter's q, they quit the CLI and a message is displayed. If the user puts an empty string, an invalid_entry message is displayed and they are prompted to enter another input. And when it's not empty, we convert to a string in the case that it's a number and pass it to the scrape_data method.

This is where the action will happen but for now, it simply logs a string with the user_input and prompts them for another input.

To get it working, we need to call the introductions and the get_input.

All this is in one file. So the full file will resemble this:

# dev_to_web_scraper.rb

module Instructions
  def introductions
    puts 'Welcome to dev.to webscraper. This CLi tool gathered articles based on the hashtag provided'
    puts 'If you want to quit, simple type (q) the next time you are prompted to enter a value'
    puts 'Please provide a hashtag to continue..'
    puts ''
  end

  def quit_message
    puts 'You have quit the scraper'
  end

  def invalid_entry
    puts 'Invalid entry, try again'
  end
end


class Scraper
  extend Instructions

  def self.get_input
    user_input = gets.chomp
    get_hashtag(user_input)
  end

  def self.get_hashtag(user_input)
    if user_input == 'q'
      quit_message
    elsif user_input.empty?
      invalid_entry
      get_input
    else
      scrape_data(user_input.to_s)
    end
  end

  def self.scrape_data(hashtag)
    puts "Scraped data for #{hashtag}"
    get_input
  end
end


Scraper.introductions
Scraper.get_input

Time to run in our console. I saved my file in the desktop folder so I need to cd into that folder to run my code.

$ cd Desktop
$ ruby dev_to_web_scraper.rb

And you should have this beautiful goodness

Onto the not-so-hard part. We need to install some gems:

$ gem install httparty #HTTP request gem
$ gem install nokogiri #parsing gem

We also need to get the url form dev.to that gives you access to search by hashtags. The url is https://dev.to/t/career where we can change career to what we get from the user.

Updating our file to require httparty and updating our scrape_data method:

require "HTTParty"

...

  def self.scrape_data(hashtag)
    url = "https://dev.to/t/#{hashtag}"
    html = HTTParty.get(url)
    puts "Scraped data for #{hashtag}"
    puts html
    get_input
  end
...

Running the above will display a bunch of html. Time to turn it into something meaningful.

We are going to be gathering the title and author of the article into an array and returning the array. This is where nokogiri comes to play. For us to identify the title and author of the article, we need to use the dev tool to find the element and its class or id. Anything we can use to identify those information.

For the article title, the css identifer I identified is h2.crayons-story__title a and for the author is div.crayons-story__top p. Each of the articles are wrapped by a parent div whose css class is .crayons-story__body.

Next, we import nokogiri and then use it to parse our html. Our updated code for scrape_data should be:

require 'nokogiri'

...

  def self.scrape_data(hashtag)
    url = "https://dev.to/t/#{hashtag}"
    puts 'getting data ....'
    html = HTTParty.get(url)
    response = Nokogiri::HTML(html)
    info = []
    response.css('.crayons-story__body').each do |section|
      title_and_author = section.search('h2.crayons-story__title a', 'div.crayons-story__top p')
      info.push({
          title: title_and_author[0].text.gsub(/\n/, '').strip.gsub(/\s+/, ' '), 
          author: title_and_author[1].text.gsub(/\n/, '').strip.gsub(/\s+/, ' ')
        })
    end
    puts info
    get_input
  end

First we parse out http call with nokogiri and save the response to the variable. Then we create an empty array to push our objects.

Then we use css method to find all the elements whose class matches .crayons-story__body. We then loop through a search for h2.crayons-story__title a and div.crayons-story__top p elements within it. The search returns an array. We apply the text method on each of the 2 search results as well as clean up the newlines and multiple spaces around and within the string, and push them into the array and then log the array to the console.

Go ahead a run the code. We should have an array of objects displayed in the console.

We should be done here but i think we still have one small issue to resolve. I would like us to take care of the case where there are no articles for a hashtag. We do this by simply getting the count of response to see if its empty. If it is, we tell the user, else we run through and return the result. We should have our updated code as:

...
  def self.scrape_data(hashtag)
    url = "https://dev.to/t/#{hashtag}"
    puts 'getting data ....'
    html = HTTParty.get(url)
    response = Nokogiri::HTML(html)
    info = []
    articles = response.css('.crayons-story__body')
    if articles.empty?
      puts "No article for for hashtag: #{hashtag}"
    else
      articles.each do |section|
        title_and_author = section.search('h2.crayons-story__title a', 'div.crayons-story__top p')
        info.push({
          title: title_and_author[0].text.gsub(/\n/, '').strip.gsub(/\s+/, ' '), 
          author: title_and_author[1].text.gsub(/\n/, '').strip.gsub(/\s+/, ' ')
        })
      end
    end
    puts info
    get_input
  end