Ben Greenberg

Posted on Apr 30, 2021 • Edited on May 2, 2021

49 Days of Ruby: Day 35 - Web Scraping

#ruby

Welcome to day 35 of the 49 Days of Ruby! 🎉

Now that we know a bit about HTTP and making HTTP requests in Ruby, today we'll discuss how to use that knowledge to scrape the web!

Web scraping is where you write some code that fetches a resource off the web and gives you some content from that website. It is an alternative to using APIs (more about that tomorrow) and is often used when there is no API available.

tl;dr Today's resources come from this excellent blog post by Sylwia, a DEV community member, and friend:

Sylwia Vargas

I'm a tech writer and educator advocating for code newbies ✨ I'm also a Developer Relations Lead + front-end dev at @inngest

Making the HTTP Request

If you recall from yesterday, we made HTTP requests using the net/http library. Today, we will use open-uri, which is also part of the standard Ruby core utilities:

require "open-uri"

html = open("https://en.wikipedia.org/wiki/Douglas_Adams")

The above example looks a lot like our fetching of the blog post yesterday, except even more condensed. The variable html now holds the HTML content of the Wikipedia page for Douglas Adams.

Our next step is to parse that HTML.

Parsing the HTML

A popular gem to use to help us in parsing HTML is Nokogiri. The gem is very powerful, and because of that, its complexity can grow by multitudes as you build out more intricate applications.

In our case, we will try to pare down our usage of it:

response = Nokogiri::HTML(html)

The response variable now contains an object of Nokogiri::HTML::Document. This is the HTML that is structured like a hash with lots of nested resources.

We now have our HTML in a structure that we can scrape some data from.

Scrape Away

For our example, we'll get just the main body text for Douglas Adams.

We do that by finding some kind of identifier on the Wikipedia page that we can utilize. HTML is the language that one creates websites in. Another language, which we are not discussing but need to mention, is CSS. CSS is the language that one styles websites in. Each part of the page has some kind of CSS tags that we can use to identify the part we want to scrape.

In the case of the Wikipedia page, the text is inside a p tag. We can use the Nokogiri #css method providing the p tag as an argument to get just the text:

text = html.css("p").text

Now, if you inspect text you will see it contains the entire description for Douglas Adams from Wikipedia. You've successfully scraped a site!

If you want to read more about this, I highly recommend Sylwia's post. She goes into a lot more detail than our format provides. Continue to share your learnings with the community using the hashtag #49daysofruby!

Come back tomorrow for the next installment of 49 Days of Ruby! You can join the conversation on Twitter with the hashtag #49daysofruby.

DEV Community

49 Days of Ruby: Day 35 - Web Scraping

Sylwia Vargas

Making the HTTP Request

Parsing the HTML

Scrape Away

Top comments (0)

Read next

Difference between AddScoped, AddTransient and AddSingleton in .netcore

Create scalable and fault-tolerant microservices architecture

Metode Main Judi Online Supaya Gak Kolaps serta Rugi Lagi

How to Cut Down Work-from-Home Distractions as a Mom?

Sylwia VargasFollow

Making the HTTP Request

Parsing the HTML

Scrape Away

Read next

Difference between AddScoped, AddTransient and AddSingleton in .netcore

Create scalable and fault-tolerant microservices architecture

Metode Main Judi Online Supaya Gak Kolaps serta Rugi Lagi

How to Cut Down Work-from-Home Distractions as a Mom?

Sylwia Vargas