DEV Community

Cover image for Using Nokogiri and RegEx to Scan a Webpage in Ruby
Jesse vB
Jesse vB

Posted on

Using Nokogiri and RegEx to Scan a Webpage in Ruby

TLDR;

We are going to use a gem to search for a topic in a web article to produce a list of snippets.

Read this article I wrote to get an introduction to Nokogiri.

I'm a casual fan of into baseball. ESPN has lots of MLB articles. What do they talk about? It's a mystery! We need to scan the page with Nokogiri to find out. Are you ready?

Get that Article!

  ~ irb --simple-prompt
>> require 'nokogiri'
=> true
>> require 'open-uri'
=> true

# I'm going to scan the article at this URL.  Feel free to 
# find one of your own, since this will probably be outdated 
# by the time you read this.

>> url = 'https://www.espn.com/mlb/story/_/id/33855507/mlb-power-rankings-week-4
-new-team-surged-our-no-1-spot'
=> "https://www.espn.com/mlb/story/_/id/33855507/mlb-power-rankings-week-4-...
>> web_page = Nokogiri::HTML5(URI(url))
=>
#(Document:0x5b4 {
...
Enter fullscreen mode Exit fullscreen mode

Now we need to find the body of this article so we only scan the relevant portions. We'll use the #css method with the div and class name.

>> body = web_page.css('div.article-body')
=> [#<Nokogiri::XML::Element:0x5730 name="div" attributes=[#<Nokogiri::XML:...
>> text = body.text
=> "7:00 AM ETESPNFacebookTwitterFacebook MessengerPinterestEmailprintFour ...
Enter fullscreen mode Exit fullscreen mode

Now for the fun part.

Regular Expression Scan

Ruby has a String#scan method that takes a Regular Expression as an argument and returns an array of all occurrences.

>> "don't do that dog!".scan(/do/) => ["do", "do", "do"]

But we don't really want to just get a list of occurrences. We want to get the whole context so that we can see what the article is talking about. To accomplish this, we need to find the index of each occurrence as if our string was an array of characters. Then we will slice well before and after this index to get the context of each occurrence. This brings up a (maybe lesser-known) method called #to_enum (to enumerator).

The to_enum method allows us to enumerate the String and pass a method and optional argument. Here's an example where we will get the byte code for each ASCII character in a string. We will print each to binary using to_s(2).

>> "abcdefg".to_enum(:each_byte).each { |b| p b.to_s(2) }
"1100001"
"1100010"
"1100011"
"1100100"
"1100101"
"1100110"
"1100111"
Enter fullscreen mode Exit fullscreen mode

For our purposes, we will pass the :scan method with the argument being our Regular Expression. Then we will map each occurrence with Regexp.last_match.begin(0) to get the beginning index for the occurrence. This is how it works.

# remember text holds the text of the article body
# each index will go into the indices variable
# we can search for whatever we want, let's search for pitch
# this will work for pitch, pitchers, pitches, etc.
>> indices = text.to_enum(:scan, /pitch/i).map do |pitch|
     Regexp.last_match.begin(0)
>> end
=>
[1825,
...
>> indices
=>
[1825,
 3699,
 4727,
 10007,
 10127,
 10846,
 11016,
 12617,
 13734,
 14060,
 14585,
 14927,
 16019,
 17835,
 18858]
Enter fullscreen mode Exit fullscreen mode

Great! This list of indices reveals to us where to slice to get our data. We'll slice 30 characters before the start and will will make the length of our slice 70 characters. We'll push these snippets of text into an array.

>> snippets = []
=> []
?> indices.each do |i|
?>   snippets << text.slice(i - 30, 70)
>> end

>> snippets
=>
["n-differential in the majors. Pitching has mostly carried them, but th",
 "st year, Milwaukee's starting pitching was basically three deep. That ",
 "rt envious: Too many starting pitchers. Clevinger is the sixth member ",
 " allowed. While he has a five-pitch repertoire, one thing he's done th",
 "eup combo. He threw those two pitches a combined 64.3% of the time las",
 "ause his swing rate and first-pitch swing rate in particular are up. H",
 "nd him he's going to get more pitches to hit. -- Schoenfield17. Chicag",
 "2 batting line this year. The pitching staff has been one of the brigh",
 "ice start. Good, right-handed pitching will stymie them this year, tho",
 "le against both hard and soft pitching, despite dominating the league ",
 " ranks among some of the best pitchers in all of baseball in WAR. -- L",
 " back to .500. Their starting pitchers have lifted them, with Zac Gall",
 ". The Rangers did have better pitching last week, moving them up the l",
 "r nine innings in 11⅓ innings pitched. -- Lee29. Washington NationalsR",
 " Colorado will do that -- but pitching was a big problem. The Reds com"]
Enter fullscreen mode Exit fullscreen mode

We did it! Now let's clean them up so they start and end with full words. We'll take each snippet, split it apart by whitespace, remove the first and last partial words, then paste it back together with spaces.

snippets.map do |snippet|
?>   words = snippet.split(" ")
?>   words.pop
?>   words.shift
?>   snippet = words.join(" ")
>> end
=>
["in the majors. Pitching has mostly carried them, but",
 "year, Milwaukee's starting pitching was basically three deep.",
 "envious: Too many starting pitchers. Clevinger is the sixth",
 "While he has a five-pitch repertoire, one thing he's done",
 "combo. He threw those two pitches a combined 64.3% of the time",
 "his swing rate and first-pitch swing rate in particular are up.",
 "him he's going to get more pitches to hit. -- Schoenfield17.",
 "batting line this year. The pitching staff has been one of the",
 "start. Good, right-handed pitching will stymie them this year,",
 "against both hard and soft pitching, despite dominating the",
 "among some of the best pitchers in all of baseball in WAR. --",
 "to .500. Their starting pitchers have lifted them, with Zac",
 "The Rangers did have better pitching last week, moving them up the",
 "nine innings in 11⅓ innings pitched. -- Lee29. Washington",
 "will do that -- but pitching was a big problem. The Reds"]
Enter fullscreen mode Exit fullscreen mode

There you have it! Hopefully this article revealed just some of the cool features to Ruby. Don't be afraid to explore new gems and look up new methods on the Ruby-Doc website.

Discussion (0)