Using Nokogiri and RegEx to Scan a Webpage in Ruby

#ruby #regex #scraping #nokogiri

TL;DR

We are going to use a gem to search for a topic in a web article to produce a list of snippets.

Read this article I wrote to get an introduction to Nokogiri.

I'm a casual fan of into baseball. ESPN has lots of MLB articles. What do they talk about? It's a mystery! We need to scan the page with Nokogiri to find out. Are you ready?

Get that Article!

➜  ~ irb --simple-prompt
>> require 'nokogiri'
=> true
>> require 'open-uri'
=> true

# I'm going to scan the article at this URL.  Feel free to 
# find one of your own, since this will probably be outdated 
# by the time you read this.

>> url = 'https://www.espn.com/mlb/story/_/id/33855507/mlb-power-rankings-week-4
-new-team-surged-our-no-1-spot'
=> "https://www.espn.com/mlb/story/_/id/33855507/mlb-power-rankings-week-4-...
>> web_page = Nokogiri::HTML5(URI(url))
=>
#(Document:0x5b4 {
...

Now we need to find the body of this article so we only scan the relevant portions. We'll use the #css method with the div and class name.

>> body = web_page.css('div.article-body')
=> [#<Nokogiri::XML::Element:0x5730 name="div" attributes=[#<Nokogiri::XML:...
>> text = body.text
=> "7:00 AM ETESPNFacebookTwitterFacebook MessengerPinterestEmailprintFour ...

Now for the fun part.

Regular Expression Scan

Ruby has a String#scan method that takes a Regular Expression as an argument and returns an array of all occurrences.

>> "don't do that dog!".scan(/do/) => ["do", "do", "do"]

But we don't really want to just get a list of occurrences. We want to get the whole context so that we can see what the article is talking about. To accomplish this, we need to find the index of each occurrence as if our string was an array of characters. Then we will slice well before and after this index to get the context of each occurrence. This brings up a (maybe lesser-known) method called #to_enum (to enumerator).

The to_enum method allows us to enumerate the String and pass a method and optional argument. Here's an example where we will get the byte code for each ASCII character in a string. We will print each to binary using to_s(2).

>> "abcdefg".to_enum(:each_byte).each { |b| p b.to_s(2) }
"1100001"
"1100010"
"1100011"
"1100100"
"1100101"
"1100110"
"1100111"

For our purposes, we will pass the :scan method with the argument being our Regular Expression. Then we will map each occurrence with Regexp.last_match.begin(0) to get the beginning index for the occurrence. This is how it works.

# remember text holds the text of the article body
# each index will go into the indices variable
# we can search for whatever we want, let's search for pitch
# this will work for pitch, pitchers, pitches, etc.
>> indices = text.to_enum(:scan, /pitch/i).map do |pitch|
     Regexp.last_match.begin(0)
>> end
=>
[1825,
...
>> indices
=>
[1825,
 3699,
 4727,
 10007,
 10127,
 10846,
 11016,
 12617,
 13734,
 14060,
 14585,
 14927,
 16019,
 17835,
 18858]

Great! This list of indices reveals to us where to slice to get our data. We'll slice 30 characters before the start and will will make the length of our slice 70 characters. We'll push these snippets of text into an array.

>> snippets = []
=> []
?> indices.each do |i|
?>   snippets << text.slice(i - 30, 70)
>> end

>> snippets
=>
["n-differential in the majors. Pitching has mostly carried them, but th",
 "st year, Milwaukee's starting pitching was basically three deep. That ",
 "rt envious: Too many starting pitchers. Clevinger is the sixth member ",
 " allowed. While he has a five-pitch repertoire, one thing he's done th",
 "eup combo. He threw those two pitches a combined 64.3% of the time las",
 "ause his swing rate and first-pitch swing rate in particular are up. H",
 "nd him he's going to get more pitches to hit. -- Schoenfield17. Chicag",
 "2 batting line this year. The pitching staff has been one of the brigh",
 "ice start. Good, right-handed pitching will stymie them this year, tho",
 "le against both hard and soft pitching, despite dominating the league ",
 " ranks among some of the best pitchers in all of baseball in WAR. -- L",
 " back to .500. Their starting pitchers have lifted them, with Zac Gall",
 ". The Rangers did have better pitching last week, moving them up the l",
 "r nine innings in 11⅓ innings pitched. -- Lee29. Washington NationalsR",
 " Colorado will do that -- but pitching was a big problem. The Reds com"]

We did it! Now let's clean them up so they start and end with full words. We'll take each snippet, split it apart by whitespace, remove the first and last partial words, then paste it back together with spaces.

snippets.map do |snippet|
?>   words = snippet.split(" ")
?>   words.pop
?>   words.shift
?>   snippet = words.join(" ")
>> end
=>
["in the majors. Pitching has mostly carried them, but",
 "year, Milwaukee's starting pitching was basically three deep.",
 "envious: Too many starting pitchers. Clevinger is the sixth",
 "While he has a five-pitch repertoire, one thing he's done",
 "combo. He threw those two pitches a combined 64.3% of the time",
 "his swing rate and first-pitch swing rate in particular are up.",
 "him he's going to get more pitches to hit. -- Schoenfield17.",
 "batting line this year. The pitching staff has been one of the",
 "start. Good, right-handed pitching will stymie them this year,",
 "against both hard and soft pitching, despite dominating the",
 "among some of the best pitchers in all of baseball in WAR. --",
 "to .500. Their starting pitchers have lifted them, with Zac",
 "The Rangers did have better pitching last week, moving them up the",
 "nine innings in 11⅓ innings pitched. -- Lee29. Washington",
 "will do that -- but pitching was a big problem. The Reds"]

There you have it! Hopefully this article revealed just some of the cool features to Ruby. Don't be afraid to explore new gems and look up new methods on the Ruby-Doc website.

DEV Community

Using Nokogiri and RegEx to Scan a Webpage in Ruby

TL;DR

Get that Article!

Regular Expression Scan

Top comments (0)

Read next

Django, Flask, FastAPI, and More: Choosing the Right Python Framework for Your Project

Frontend Challenge - December Edition

WIP Notes working though Render hosting Flask + Vite + React + Wouter

Winter Solstice