meikay

Posted on Apr 5, 2019 • Edited on Oct 29, 2019

Ruby Web Scraper with Nokogiri

#beginners #ruby #webscraper #webdev

I was interested in remote full stack positions so I decided why not build that for my portfolio project? I built a Command Line Application that uses object oriented programming to scrape remoteok.io for their top 100 full stack developer jobs. The application is able to do a second level scrape to retrieve the job description and can be installed as a gem on your local environment. My project uses the Nokogiri gem which is an open source software library to parse HTML and XML.

I will be going through how I built my CLI Data Gem project in the following steps:

Step 1 Install Bundler gem

In your terminal type:

bundle install remote_jobs

Step 2 Add Necessary Files

I had to map out the models I will be using through out my program. I decided to name my folder within lib, remote_jobs and within that folder, I added 3 more files that contained my classes.

Step 3 Setup File Environment

Then, I required the files in remote_jobs.rb to set up my project’s environment. At this point, I had to figure out what objects to build and what their attributes were going to be.

Step 4 Add Dependecies

In my remote_jobs.gemspec file, I added the necessary dependecies such as Nokogiri, colorize, and pry. I also added information about my gem such as its name, version, description, authors and homepage.

Step 5 Suedo Code

I imagined how my CLI would behave and wrote down some steps. Then, I thought about what methods I would use to make my CLI behave that way. I also needed to find out what data I want to scrape.

Step 6 Scrape

I looked for the the data’s css selectors so that I am able to access them through their children. I decided to scrape the first 100 full stack jobs from remotok.io and iterated over each job object to grab the job name, company name, and url. On the second level scrape, I scraped the description of each job. However, instead of scraping many times and potentially being locked out of the website, I only scraped if the job the user chose does not have a description.

Problems I Ran Into & How I(We) Solved It

So inevitably, I ran into difficulty. During this project I was not allowed to reach out to technical coaches for help, so I relied on my tried and trusted friend Google, as well as reaching out to my peers for some advice! When you need to code a project from scratch, you start to realize your weak points. So I figured out that my understanding of object orientation was not as strong as I had thought. I re-watched some videos on that subject and implemented my understanding to my project.

As you can see in the caption above, I said I(we). This was meant to encourage you to ask for help when you get stuck. If you have tried googling everything and searched stack overflow and still can’t come to a solution, don’t be afraid to ask for help. Which comes to the next issue I bumped into, the data on the second scrape was not displaying the way I thought it should. When I ran my program, there was a {linebreak} everywhere I looked in the description. So I asked a fellow peer to take a look and see what he thought was causing the issue. We played around with the code in pry(debugging tool) and came up with using gsub to fix the issue. This method globally substitutes the first argument with the second argument. This allows us to replace {linebreak} with a literal line break everywhere it is found within the description.

clean_description = description.gsub("{linebreak}", "\n") puts "#{clean_description}".colorize(:blue)

All in all, I am grateful to have gone through this experience and can’t wait till my next project!

DEV Community

Ruby Web Scraper with Nokogiri

Step 1 Install Bundler gem

Step 2 Add Necessary Files

Step 3 Setup File Environment

Step 4 Add Dependecies

Step 5 Suedo Code

Step 6 Scrape

Problems I Ran Into & How I(We) Solved It

Top comments (0)

Read next

User Interface (UI) Design: A Guide for Developers

What’s New in React 19? A Quick Guide with Code Examples

Deploying Traefik Proxy with Cloudflare Origin CA Certificate on k0s

The All-in-One Fake API for developers.