DEV Community

Adam Smolenski
Adam Smolenski

Posted on

Web Scraping with Ruby Part Two

This is part two of the walkthrough.

Last time
Last time we got some of the simpler categories, you can see part one at https://dev.to/amsmo/web-scrape-episode-information-walk-through-part-one-1423


Here we are going more in depth. We will start off simple though to get back into it. There is a plot on the top that involves going a link deeper but when scrolling down you see the full synopsis without having to go to another page. When scraping many episodes this will save us time so let's use that one.

<div class="inline canwrap">
            <p>
                <span>    The Pinky and The Brain who live in the future travel back in time to give their present-dwelling counterparts a book on how to take over the world. The present pair travel to the future and find themselves in a world where cockroaches and are ruling the planet. After being captured, they escape to their capsule, which takes them safely back to the lab. But when Pinky leaves the book in the capsule, the craft vanishes, leaving the jet-lagged mice pondering how to take over the world yet again.</span>
                <em class="nobr">Written by
<a href="/search/title?plot_author=Anonymous&amp;view=simple&amp;sort=alpha&amp;ref_=tt_stry_pl">Anonymous</a></em>            </p>
        </div>

Highlighting the area we see that it is a class called canwrap. Let's try that. Again to look at a class you use "." before the name of it.

narf.get_text(parsed_episode.css('.canwrap'))
Hmm the results it give are the following

=> ["The Pinky and The Brain who live in the future travel back in time to give their present-dwelling counterparts a book on how to take over the world. The present pair travel to the future and find themselves in a world where cockroaches and are ruling the planet. After being captured, they escape to their capsule, which takes them safely back to the lab. But when Pinky leaves the book in the capsule, the craft vanishes, leaving the jet-lagged mice pondering how to take over the world yet again.\n                Written by\nAnonymous",
 "Plot Keywords:\n friendship\n                        |\n friends who live together\n            | See All (2) ยป",
 "Genres:\n Animation |\n Comedy |\n Family |\n Sci-Fi"]

A bit too much, let's try to be more specific, let's take one step further in. You can do that by using the ">" css selector and then putting in the child you are looking at. Let's try .canwap > p Well, that still gives us the Written by. One more selector .canwap > p > span That worked to give us an array of 1 item. We can then try something different to expand our practice of nokogiri. When you know exactly where you want to be you can use .at instead of .css. So here parsed_episode.at('.canwrap > p > span')).text. It returns the string with an indentation " The Pinky and The Brain who. Let's throw a .strip at the end and you'll get what we want. We won't use out get_text function because that is used for arrays, where we are unsure if we want more info. So writing out that function will look something like this:

def plot
   parsed_episode.at('.canwrap > p  > span').text.strip
end

the plot thickens
Let's start tackling some harder things now. Let's look at the cast. All the data is in a table, let's see what we get from that.

Looking at the css td > a gives the result " Maurice LaMarche\nThe BrainQueen Roach's Aid Rob Paulsen\nPinky Tress MacNeille\n Frank Welker\n"... We lose non-recurring characters since they don't have a link. So that's not helpful, also how can we figure out where an actor ends and a character begins? Let's try something more specific, maybe have something with actors and then characters...
Let's have fun with this one. Let's guide ourselves off the photos, and then also gather the characters to the right as separate arrays and join them letter. So the photos have the title of the actors name. Let's navigate to the photos with parsed_episode.css('.primary_photo > a > img')
That will give us an array so let's look at the first item giving us the following information:

=> #(Element:0x3ff3fd8e31c8 {
  name = "img",
  attributes = [
    #(Attr:0x3ff3fd8e3178 { name = "height", value = "44" }),
    #(Attr:0x3ff3fd8e3164 { name = "width", value = "32" }),
    #(Attr:0x3ff3fd8e3150 { name = "alt", value = "Maurice LaMarche" }),
    #(Attr:0x3ff3fd8e313c { name = "title", value = "Maurice LaMarche" }),

There's more but we've got the actors name in it! great. With nokogiri since this is an element instance, the top two items (name, attributes) can be obtained with a method call.
So if we go into .attributes we step into a hash that looks like.

=> {"height"=>#(Attr:0x3ff3fd8e3178 { name = "height", value = "44" }),
 "width"=>#(Attr:0x3ff3fd8e3164 { name = "width", value = "32" }),
 "alt"=>#(Attr:0x3ff3fd8e3150 { name = "alt", value = "Maurice LaMarche" }),
 "title"=>#(Attr:0x3ff3fd8e313c { name = "title", value = "Maurice LaMarche" }),

No we just need to get into title, and look there... another Attribute!!! so let's get the whole function now parsed_episode.css('.primary_photo > a > img')[0].attributes["title"].value gives us just Maurice, let's write a function that gets all of them.

def actors
    parsed_episode.css('.primary_photo > a  > img').map {|inner| inner.attributes["title"].value}
end

That gives an array of actors, great. That's what we want so we can orderly join them with characters.


So let's try td.character, that seems reasonable. The first in the array results in "\n The Brain / \n Queen Roach's Aid \n \n \n (voice)\n \n \n " Wow, that's a mess.. Let's try stripping...
"The Brain / \n Queen Roach's Aid \n \n \n (voice)" Still not great, and we want to keep this as a single part of the array for now so we can join both the characters with the actors. Right... Ruby has regex built in. Let's change all those line breaks and spaces into a single space. You can do that with .gsub(/[[:space:]]+/, " " ).strip. So putting everything together you get:

def characters
    chars = parsed_episode.css('td.character').map {|char| char.text.strip}
    chars.map { |char| char.gsub(/[[:space:]]+/, " " ).strip}
end

This results in

=> ["The Brain / Queen Roach's Aid (voice)",
 "Pinky (voice)",
 "Queen Roach / First Lady / Time Machine Computer (voice)",
 "The President (voice)"]

It will all line up!
Mr Burns Excellent


Let's line them in a hash, that seems the best. Actor as key and characters as an array value. Remember the characters are separated with a slash so here I think it's a simple function to draw up.

def cast_assesment
    i = 0
    cast = {}
    while i < actors.length
        cast[actors[i]] = characters[i].split(" / ")
        i+=1
    end

    cast
end

Remember, we are doing this in a class, so actors and characters are called functions. So we are just iterating through the array that those functions return. Let's see what we get.

[1] pry(main)> narf.cast_assesment
=> {"Maurice LaMarche"=>["The Brain", "Queen Roach's Aid (voice)"],
 "Rob Paulsen"=>["Pinky (voice)"],
 "Tress MacNeille"=>["Queen Roach", "First Lady", "Time Machine Computer (voice)"],
 "Frank Welker"=>["The President (voice)"]}

PERFECT!!!! Alright, next time we will tackle the crew. Writers and Directors, because that gets a little more complicated. And we will also write to json.

90's cartoon commercial break


You can find the code here https://github.com/AmSmo/webscraper_narf Part 1 is a separate branch

Top comments (0)