DEV Community

Cover image for Web Scraping Walkthrough with Python
Andrew (he/him)
Andrew (he/him)

Posted on

Web Scraping Walkthrough with Python

First Steps

Web scraping is the process of extracting data from a web page's source code, rather than through some API exposed by the owner(s) of that page. It can be a bit tricky at first, but it allows you to easily pull and organise lots of information from the web, without having to manually copy and paste anything.

To do some basic web scraping today, I'll use the Python library BeautifulSoup. If you haven't used this package before, you'll need to install it. The easiest way to do that is with the Python package manager pip. First, check if you have pip on your machine by trying to install a library with it:

$ pip install beautifulsoup4
Enter fullscreen mode Exit fullscreen mode

If you have Python but don't have pip (if the above throws an error), install pip by itself using the instructions found here. macOS and most Linux distributions come with Python by default, but if you're on Windows and you need to install Python, try the official website.

Python 2.7 is deprecated as of 1 January 2020, so it might be better to just get Python 3 (if you don't yet have it). I don't have Python 3 yet (because I just factory reset my Mac not too long ago), so I'm installing it first using these instructions, which essentially just boil down to:

$ brew install python
Enter fullscreen mode Exit fullscreen mode

Now, we can check that both Python 2 and Python 3 are installed, and that pip was installed alongside Python 3:

$ python --version
Python 2.7.10

$ python3 --version
Python 3.7.2

$ pip --version
-bash: pip: command not found

$ pip3 --version
pip 19.0.2 from /usr/local/lib/python3.7/site-packages/pip (python 3.7)
Enter fullscreen mode Exit fullscreen mode

Finally, let's get BeautifulSoup using pip3:

$ pip3 install beautifulsoup4
Enter fullscreen mode Exit fullscreen mode

Note that, at this point, you could use the "normal" Python interpreter with the python3 command, or you could use the more feature-rich IPython by installing:

$ pip3 install ipython
Enter fullscreen mode Exit fullscreen mode

Throughout this tutorial, I'll be using IPython.

Preliminary Research

My motivation for this project was that I wanted to create an "average profile" of a developer at a given level in a given area, based on job postings on Indeed and similar websites. While doing something like that is a bit involved and might involve some regex, a good place to start would be to simply see how often a given technology is listed in job postings: more mentions == more important, right?

BeautifulSoup lets you access a page's XML / HTML tags by their type, id, class, and more. You can pull all <a> tags, for instance, or get the text of all <p> tags with a particular class. So to pull data out in a regular way, we need to dissect the structure of the pages we want to scrape. Let's start by doing a search for JavaScript developers in New York City:

Note the URL of this web page:

https://www.indeed.com/jobs?q=javascript+developer&l=New+York+City

If we go to the second page of results, it changes to:

https://www.indeed.com/jobs?q=javascript+developer&l=New+York+City&start=10

...and the third page of results:

https://www.indeed.com/jobs?q=javascript+developer&l=New+York+City&start=20

Right, so there are 10 results per page and each page after the first has an additional parameter in the URL: &start=..., where ... is a positive multiple of 10. (As it turns out, we can append &start=0 to the URL of the first page and it returns the same results.) Okay, so we know how to access pages of results... what's next? How about we inspect the structure of the first results page:

One thing I notice is that the links to each job ad seem to have an onmousedown which changes predictably. The first one is

onmousedown="return rclk(this,jobmap[0],0);"
Enter fullscreen mode Exit fullscreen mode

...the second is

onmousedown="return rclk(this,jobmap[1],0);"
Enter fullscreen mode Exit fullscreen mode

...and so on. I would bet that we can pull all <a> tags with an onmousedown containing "return rclk(this,jobmap[" and that would give us all the links to all the jobs listed on this page. Let's put that in our back pocket for now and open one of these ads -- let's see if we can figure out where the job specifications are within these pages:

It looks like the main body of the ad is contained in a <div> with class="jobsearch-JobComponent-description". That sounds like a pretty specific div. I'll just go ahead and assume that's the same on every page, but you can check if you like. So now that we know the structure of the URLs we want to visit, how to find links to job ads on those pages, and where the text of the ad is contained in those subpages, we can build a web scraping script!

Building the Scraper

Let's start by just looping over search pages. Our URL will look something like:

https://www.indeed.com/jobs?q=javascript&l=New+York+City&start=
Enter fullscreen mode Exit fullscreen mode

...but we need to append a non-negative multiple of 10 to the end. An easy way to do this in Python is to create a range loop:

In [91]: for pageno in range(0,10): 
    ...:     search = "https://www.indeed.com/jobs?q=javascript&l=New+York+City&start=" + str(10*pageno) 
    ...:     print(search) 
    ...:                                                                                                                                           
https://www.indeed.com/jobs?q=javascript&l=New+York+City&start=0
https://www.indeed.com/jobs?q=javascript&l=New+York+City&start=10
https://www.indeed.com/jobs?q=javascript&l=New+York+City&start=20
...
https://www.indeed.com/jobs?q=javascript&l=New+York+City&start=90
Enter fullscreen mode Exit fullscreen mode

That looks good! Note that we had to convert the integer to a string with Python's str() method.

What we really want to do is actually visit these pages and extract their content. We can do that with Python's urllib module -- specifically urllib.request.urlopen() (Python 3 only). We can then parse the page with BeautifulSoup by simply calling the BeautifulSoup constructor. To test this, let's temporarily reduce our loop range to just one page and print the contents of the page with soup.prettify():

In [100]: for pageno in range(0,1): 
     ...:     search = "https://www.indeed.com/jobs?q=javascript&l=New+York+City&start=" + str(10*pageno) 
     ...:     url = urllib.request.urlopen(search) 
     ...:     soup = BeautifulSoup(url) 
     ...:     print(soup.prettify()[:500]) 
     ...:                                                                                                                                          
<!DOCTYPE html>
<html dir="ltr" lang="en">
 <head>
  <meta content="text/html;charset=utf-8" http-equiv="content-type"/>
  <script src="/s/a3599cf/en_US.js" type="text/javascript">
  </script>
  <link href="/s/97464e7/jobsearch_all.css" rel="stylesheet" type="text/css"/>
  <link href="http://rss.indeed.com/rss?q=javascript&amp;l=New+York+City" rel="alternate" title="Javascript Jobs, Employment in New York, NY" type="application/rss+xml"/>
  <link href="/m/jobs?q=javascript&amp;l=New+York+City" m
Enter fullscreen mode Exit fullscreen mode

I trimmed the output using string slicing, limiting it to 500 characters (the source code of this page is pretty long). You can see just in that short snippet, though, our original search: q=javascript&amp;l=New+York+City.

Great! So, this seems to work. Let's use select() now to grab all of the job ad links on this page. Remember that we're looking for all of the <a> tags with an onmousedown containing "return rclk(this,jobmap[". We have to use a special syntax to achieve that result, see below:

In [102]: for pageno in range(0,1): 
     ...:     search = "https://www.indeed.com/jobs?q=javascript&l=New+York+City&start=" + str(10*pageno) 
     ...:     url = urllib.request.urlopen(search) 
     ...:     soup = BeautifulSoup(url) 
     ...:      
     ...:     for adlink in soup.select('a[onmousedown*="return rclk(this,jobmap["]'): 
     ...:         subURL  = "https://www.indeed.com" + adlink['href'] 
     ...:         print(subURL) 
     ...:                                                                                                                                          
https://www.indeed.com/rc/clk?jk=43837af9ab727a8b&fccid=927356efef1f3075&vjs=3
https://www.indeed.com/rc/clk?jk=6511fae8b53360f1&fccid=f057e04c37cca134&vjs=3
https://www.indeed.com/company/Transport-Learning/jobs/React-HTML-Javascript-Developer-ca898e4825aa3f36?fccid=6b6d25caa00a7d0a&vjs=3
...
https://www.indeed.com/rc/clk?jk=9a3a9b4a4cbb3f28&fccid=101a2d7616184cc8&vjs=3
Enter fullscreen mode Exit fullscreen mode

We append "https://www.indeed.com" to the beginning of each link because, in the source code of the page, all the hrefs are relative. If we grab one of these links (say the third one) and paste it into the browser, we should hopefully get a job ad:

...looking good! Okay, what's next? Well, we want to, again, open these subpages with BeautifulSoup and parse the source code. But this time, we want to look for <div>s with a class that contains jobsearch-JobComponent-description. So let's use string slicing again and print the first, say, 50 characters of each page, just to make sure that all of these URLs are working:

In [103]: for pageno in range(0,1): 
     ...:     search = "https://www.indeed.com/jobs?q=javascript&l=New+York+City&start=" + str(10*pageno) 
     ...:     url = urllib.request.urlopen(search) 
     ...:     soup = BeautifulSoup(url) 
     ...:      
     ...:     for adlink in soup.select('a[onmousedown*="rclk(this,jobmap"]'): 
     ...:         subURL  = "https://www.indeed.com" + adlink['href'] 
     ...:         subSOUP = BeautifulSoup(urllib.request.urlopen(subURL)) 
     ...:         print(subSOUP.prettify()[:50]) 
     ...:                                                                                                                                          
<html dir="ltr" lang="en">
 <head>
  <title>
   Ne
<html dir="ltr" lang="en">
 <head>
  <title>
   Re
<html dir="ltr" lang="en">
 <head>
  <title>
   Re
...
<html dir="ltr" lang="en">
 <head>
  <title>
   Ni
Enter fullscreen mode Exit fullscreen mode

Again, great! Everything's working so far. The next thing to do would be to try to extract the text of the main body of each ad. Let's use the same *= syntax in select() that we used previously to find <div>s in these subpages which have a class attribute which contains jobsearch-JobComponent-description:

In [106]: for pageno in range(0,1): 
     ...:     search = "https://www.indeed.com/jobs?q=javascript&l=New+York+City&start=" + str(10*pageno) 
     ...:     url = urllib.request.urlopen(search) 
     ...:     soup = BeautifulSoup(url) 
     ...:      
     ...:     for adlink in soup.select('a[onmousedown*="rclk(this,jobmap"]'): 
     ...:         subURL  = "https://www.indeed.com" + adlink['href'] 
     ...:         subSOUP = BeautifulSoup(urllib.request.urlopen(subURL)) 
     ...:          
     ...:         for desc in subSOUP.select('div[class*="jobsearch-JobComponent-description"]'): 
     ...:             print(desc.get_text()[:50]) 
     ...:                                                                                                                                          
Impact

Ever wondered how Amazon offers the Earth'
Mobile & Web Engineering is looking for talented w
Job Description

We are looking for a talented Fro
$75,000 - $95,000 a yearYour first few months:We c
Michael Kors is always interested in hearing from 
Facebook's mission is to give people the power to 
$70,000 - $80,000 a yearWe Make Websites are the g
InternshipApplications are due by June 27, 2019 at
Job Overview:

UI Developer should have a very goo
* THIS IS A REMOTE POSITION *

At Dental Intellige
Enter fullscreen mode Exit fullscreen mode

BeautifulSoup.select() returns the HTML / XML tags which match the search parameters that we provide. We can pull attributes from those tags with bracket notation (as in adlink['href']) and we can pull the text contained within opening and closing tags (for instance, between <p> and </p>) with get_text(), as we did above. The subSOUP.select() statement returns a list of <div> tags, with class attributes that contain the substring "jobsearch-JobComponent-description", then we use a for ... in loop to get each <div> in that list (there's only one) and print the text contained within <div> ... </div> with get_text().

The result is this list of jumbled text. It doesn't make any sense because we cut each description off after only 50 characters. But now we have our fully-functional Indeed job ad scraper! We just need to figure out what to do with these results to complete our task.

Organizing Your Web Scrapings

The easiest thing to do is to come up with a list of keywords we're interested in. Let's look at the popularity of various JavaScript frameworks. How about:

frameworks = ['angular', 'react', 'vue', 'ember', 'meteor', 'mithril', 'node', 'polymer', 'aurelia', 'backbone']
Enter fullscreen mode Exit fullscreen mode

...that's probably a good start. If you're familiar with processing text data like this, you'll know that we have to convert everything to lowercase to avoid ambiguity between things like "React" and "react", we'll have to remove punctuation so we don't count "Angular" and "Angular," as two separate things, and we can easily split this text into tokens on spaces using split(). Let's first split the text of each ad, convert each word to lowercase, and see what our list of words looks like:

In [110]: for pageno in range(0,1): 
     ...:     search = "https://www.indeed.com/jobs?q=javascript&l=New+York+City&start=" + str(10*pageno) 
     ...:     url = urllib.request.urlopen(search) 
     ...:     soup = BeautifulSoup(url) 
     ...:      
     ...:     for adlink in soup.select('a[onmousedown*="rclk(this,jobmap"]'): 
     ...:         subURL  = "https://www.indeed.com" + adlink['href'] 
     ...:         subSOUP = BeautifulSoup(urllib.request.urlopen(subURL)) 
     ...:          
     ...:         for desc in subSOUP.select('div[class*="jobsearch-JobComponent-description"]'): 
     ...:             words = desc.get_text().lower().split()[:50] 
     ...:             for word in words: 
     ...:                 print(word) 
     ...:                                                                                                                                          
mobile
&
web
engineering
is
looking
for
talented
web
developers
to
join
the
digital
acquisitions
engineering
group.
...
Enter fullscreen mode Exit fullscreen mode

...and so on. Let's pick out some weird ones:

group.
role,
summary:
recoded:you'd
limitless.we
react.within
Enter fullscreen mode Exit fullscreen mode

...right, so we'll have to split on spaces as well as ., ,, and :. Elsewhere in the list, we have:

2.0-enabled
Enter fullscreen mode Exit fullscreen mode

which will, of course, be damaged by splitting on ., but I think the benefits outweigh the costs here. We also have lots of hyphenated words like

blue-chip
data-driven,
hyper-personalized,
go-to
team-based
e-commerce
Enter fullscreen mode Exit fullscreen mode

...so we probably shouldn't split on hyphens or dashes. We do however have one or two

trends/development
qa/qc
Enter fullscreen mode Exit fullscreen mode

...so we'll want to split on / as well. Finally, there's nothing we can do about typos like:

analystabout
part-timeat
contractlocation:
yearyour
Enter fullscreen mode Exit fullscreen mode

...at the moment, so we'll have to leave those as-is. To make this solution a bit more robust, we want to split on multiple separators, not just the space character. So we need Python's regular expression library re:

In [110]: import re

In [111]: for pageno in range(0,1): 
     ...:     search = "https://www.indeed.com/jobs?q=javascript&l=New+York+City&start=" + str(10*pageno) 
     ...:     url = urllib.request.urlopen(search) 
     ...:     soup = BeautifulSoup(url) 
     ...:      
     ...:     for adlink in soup.select('a[onmousedown*="rclk(this,jobmap"]'): 
     ...:         subURL  = "https://www.indeed.com" + adlink['href'] 
     ...:         subSOUP = BeautifulSoup(urllib.request.urlopen(subURL)) 
     ...:          
     ...:         for desc in subSOUP.select('div[class*="jobsearch-JobComponent-description"]'): 
     ...:             words = re.split("[ ,.:/]", desc.get_text().lower())[:50] 
     ...:             for word in words: 
     ...:                 print(word) 
     ...:                                                                                                                                          
impact

ever
wondered
how
amazon
offers
the
earth's
biggest
selection
and
still
...
Enter fullscreen mode Exit fullscreen mode

Right. So now what weirdos do we have?

earth's

customers?
$75
000
-
$95
000
(both
ios
and
android)
facebook's
$70
000
-
$80
000
11
59pm
*
Enter fullscreen mode Exit fullscreen mode

So, still a few edge cases. Easy-to-fix ones include removing trailing 's from words and adding ?, (, and ) to the list of separator characters (as well as whitespace characters like \n, \t, and \r). (One more quick scan reveals that we should add ! to the list of separator characters as well, obviously.) We can also ignore words that are only a single character or less. Fixing the problems with times (11:59pm) and salaries ($70,000 - $80,000) are a bit more involved and won't be covered here. For now, we'll just ignore those. So let's check out our improved scraper:

In [121]: for pageno in range(0,1): 
     ...:     search = "https://www.indeed.com/jobs?q=javascript&l=New+York+City&start=" + str(10*pageno) 
     ...:     url = urllib.request.urlopen(search) 
     ...:     soup = BeautifulSoup(url) 
     ...:      
     ...:     for adlink in soup.select('a[onmousedown*="rclk(this,jobmap"]'): 
     ...:         subURL  = "https://www.indeed.com" + adlink['href'] 
     ...:         subSOUP = BeautifulSoup(urllib.request.urlopen(subURL)) 
     ...:          
     ...:         for desc in subSOUP.select('div[class*="jobsearch-JobComponent-description"]'): 
     ...:             words = re.split("[ ,.:/?!()\n\t\r]", desc.get_text().lower())[:50] 
     ...:             for word in words: 
     ...:                 word = word.strip() 
     ...:                 if word.endswith("'s"): 
     ...:                     word = word[:-2] 
     ...:                 if len(word) < 2: 
     ...:                     continue 
     ...:                 print(word) 
     ...:                       
Enter fullscreen mode Exit fullscreen mode

Beautiful! Now, what can we do with it?

Insights

Instead of simply printing a list of words, let's add them to a dictionary. Every time we encounter a new word, we can add it to our dictionary with an initial value of 1, and every time we encounter a word we've seen before, we can increment its counter:

In [123]: counts = {} 
     ...:  
     ...: for pageno in range(0,1): 
     ...:     search = "https://www.indeed.com/jobs?q=javascript&l=New+York+City&start=" + str(10*pageno) 
     ...:     url = urllib.request.urlopen(search) 
     ...:     soup = BeautifulSoup(url) 
     ...:      
     ...:     for adlink in soup.select('a[onmousedown*="rclk(this,jobmap"]'): 
     ...:         subURL  = "https://www.indeed.com" + adlink['href'] 
     ...:         subSOUP = BeautifulSoup(urllib.request.urlopen(subURL)) 
     ...:         print("Scraping: " + subURL + "...") 
     ...:          
     ...:         for desc in subSOUP.select('div[class*="jobsearch-JobComponent-description"]'): 
     ...:             words = re.split("[ ,.:/?()\n\t\r]", desc.get_text().lower())[:50] 
     ...:             for word in words: 
     ...:                 word = word.strip() 
     ...:                 if word.endswith("'s"): 
     ...:                     word = word[:-2] 
     ...:                 if len(word) < 2: 
     ...:                     continue 
     ...:                 if word in counts: 
     ...:                     counts[word] += 1 
     ...:                 else: 
     ...:                     counts[word] = 1 
     ...:  
     ...: print(counts) 
     ...:                                                                                                                                          
Scraping: https://www.indeed.com/company/CypressG/jobs/Newer-Javascript-Framework-Developer-5a17b0475e76de26?fccid=dc16349e968c035d&vjs=3...
Scraping: https://www.indeed.com/company/Transport-Learning/jobs/React-HTML-Javascript-Developer-ca898e4825aa3f36?fccid=6b6d25caa00a7d0a&vjs=3...
Scraping: https://www.indeed.com/rc/clk?jk=a0727d28799f1dff&fccid=5d5fde8e5925b19a&vjs=3...
...
Scraping: https://www.indeed.com/rc/clk?jk=b084048e6a1b2727&fccid=5d5fde8e5925b19a&vjs=3...
{'$80': 1, '000': 8, '$250': 1, 'yeari': 1,...
Enter fullscreen mode Exit fullscreen mode

I added a "Scraping" echo to the user so we can be sure our script is progressing. Note that the resulting dictionary is not ordered! If we want to order it by value, there are a few different ways we can do that, but the easiest one is probably to just turn it into a list of tuples, flipping the keys and values so we can easily sort by key (number of occurrences of a particular word):

word_freq = [] 

for key, value in counts.items(): 
word_freq.append((value,key)) 

word_freq.sort(reverse=True)
Enter fullscreen mode Exit fullscreen mode

We sort by reverse=True so it's sorted high-to-low, and the most common words are at the top of the list. Let's see the result:

[(19, 'to'), (13, 'and'), (12, 'the'), (11, 'for'), (9, 'of'), (9, 'is'), (6, 'we'), (6, 'in'), (6, '000'), (5, 'you')]
Enter fullscreen mode Exit fullscreen mode

Of course, the reason we want to pick specific words out (like "angular", "react", etc.) is because we'll get a bunch of useless filler words (like "to", "and", etc.) otherwise. Let's define a list of "good" words, check our word against the list, and only count ones that we care about. Finally, I'll also get rid of the [:50] slice which we used for debugging, and expand my search to the first 100 pages of results. Here is the final script:

In [127]: counts = {} 
     ...: frameworks = ['angular', 'react', 'vue', 'ember', 'meteor', 'mithril', 'node', 'polymer', 'aurelia', 'backbone'] 
     ...: max_pages = 100 
     ...: ads_per_page = 10 
     ...: max_ads = max_pages * ads_per_page 
     ...:  
     ...: for pageno in range(0, max_pages): 
     ...:     search = "https://www.indeed.com/jobs?q=javascript&l=New+York+City&start=" + str(ads_per_page * pageno) 
     ...:     url = urllib.request.urlopen(search) 
     ...:     soup = BeautifulSoup(url) 
     ...:     this_page_ad_counter = 0 
     ...:      
     ...:     for adlink in soup.select('a[onmousedown*="rclk(this,jobmap"]'): 
     ...:         href = adlink['href'] 
     ...:         subURL  = "https://www.indeed.com" + href 
     ...:         subSOUP = BeautifulSoup(urllib.request.urlopen(subURL)) 
     ...:         ad_index = this_page_ad_counter + pageno*ads_per_page 
     ...:         print("Scraping (" + str(ad_index + 1) + "/" + str(max_ads) + "): " + href + "...") 
     ...:         this_page_ad_counter += 1 
     ...:          
     ...:         for desc in subSOUP.select('div[class*="jobsearch-JobComponent-description"]'): 
     ...:             words = re.split("[ ,.:/?()\n\t\r]", desc.get_text().lower()) 
     ...:             for word in words: 
     ...:                 word = word.strip() 
     ...:                 if word.endswith("'s"): 
     ...:                     word = word[:-2] 
     ...:                 if word.endswith(".js"): 
     ...:                     word = word[:-3] 
     ...:                 if word.endswith("js"): 
     ...:                     word = word[:-2] 
     ...:                 if len(word) < 2: 
     ...:                     continue 
     ...:                 if word not in frameworks: 
     ...:                     continue 
     ...:                 if word in counts: 
     ...:                     counts[word] += 1 
     ...:                 else: 
     ...:                     counts[word] = 1 
     ...:  
     ...: word_freq = []  
     ...:   
     ...: for key, value in counts.items():  
     ...:     word_freq.append((value,key))  
     ...:       
     ...: word_freq.sort(reverse=True) 
     ...:  
     ...: print(word_freq) 
     ...:                                                                                                                                          
Scraping (1/1000): /rc/clk?jk=72b4ac2da9ecb39d&fccid=f057e04c37cca134&vjs=3...
Scraping (2/1000): /company/Transport-Learning/jobs/React-HTML-Javascript-Developer-ca898e4825aa3f36?fccid=6b6d25caa00a7d0a&vjs=3...
Scraping (3/1000): /rc/clk?jk=9a3a9b4a4cbb3f28&fccid=101a2d7616184cc8&vjs=3...
...
Enter fullscreen mode Exit fullscreen mode

I made some small aesthetic changes... can you see where they are? I also made sure to remove ".js" or "js" from the end of any framework names so they're not counted as separate things. I removed the "magic number" 10 from the script and put it in a descriptive variable (ads_per_page). Also, I created a variable (max_pages) which says I should only look at 100 pages of results, so in total, I'll look at the 1000 most recent "Javascript" ads posted on Indeed in the NYC area.

This is going to take a while, so I'll go grab some coffee and come back...


...so, what does the result look like?

[(556, 'react'), (313, 'angular'), (272, 'node'), (105, 'vue'), (45, 'backbone'), (36, 'ember'), (4, 'polymer')]
Enter fullscreen mode Exit fullscreen mode

So, out of 1000 ads scraped, 556 mentioned "react", 313 mentioned "angular", and so on. Quite a bit of insight from a quick script!

Applications

With some more work, this could be turned into a website / app where developers (or anyone) looking for a job could find out what the average requirements are ("...56% of ads requested experience with React..."), what the average salary is ("...$55,000 +/- $2,000..."), and benchmark themselves against those averages. Such a tool would be really useful in salary negotiations, or when trying to decide what new technologies / languages to learn to advance your career. Data could be kept current by tracking ad posting dates and throwing out stale information (older than, say, a week).

This information would also be useful to employers, giving them a better idea of where to set salaries for certain positions, levels of experience, and so on. Indeed was just the first step, but this scraping could easily be expanded to multiple job posting websites.

This prototype only took a few hours' work for one person with limited Python experience. I would imagine that a small team of people could get this app up and running in just a few weeks. Thoughts? Does anyone know of anything similar?

Top comments (7)

Collapse
 
rhymes profile image
rhymes • Edited

Nice idea, though scraping is always dependent of the website structure and/or copyright issues (they might block your user agent or IP if they don't allow scraping). In the case of Indeed they explicitly forbid it:

You are not permitted to use Indeed’s Site or its content other than for non-commercial purposes. Use of any automated system or software, whether operated by a third party or otherwise, to extract data from the Site (such as screen scraping or crawling) is prohibited. Indeed reserves the right to take such action as it considers necessary, including issuing legal proceedings without further notice, in relation to any unauthorized use of the Site.

😏

This is going to take a while, so I'll go grab some coffee and come back...

Ahah, if you want to actually build a scraping tool I would consider Scrapy which is a framework with async concurrency builtin to build crawlers with data scraping.

It's definitely more complicated than BeautifulSoup, which is only a parsing library. Scrapy contains it all: downloaders, parsers, streaming processors, concurrency, hooks, logging, statistics. You can use BeautifulSoup as the parser, instead of the default one. It even allows you to choose either breadth first order or depth first order in crawling.

Collapse
 
awwsmm profile image
Andrew (he/him)

Oh jeez let's hope I don't get permabanned from Indeed.

Collapse
 
rhymes profile image
rhymes

There's an Indeed API on Mashape, don't know how flexible that is: rapidapi.com/indeed/api/indeed

Collapse
 
westerdal profile image
Jay Westerdal • Edited

You can always work around them banning your IP by using spider.com. They have millions of IPs and allow you to crawl anything and not get blocked.

A terms of service is not the law, there is nothing illegal about scraping a website. Read: eff.org/deeplinks/2018/01/ninth-ci...

Collapse
 
thebouv profile image
Anthony Bouvier

It is not illegal (in the US, but keep in mind not everyone on this site is US-based nor are the companies that might get targeted by a spider written here).

However, it may be unethical.

"Please don't do this to our site and our property and our data."

"Yeah, well, screw you. I'm doing it anyway."

Collapse
 
juancarlospaco profile image
Juan Carlos

Try Faster Than Requests
x 5 times faster than std lib urllib.
x 15 times faster than Requests.
x 2 times faster than PyCurl.

Some comments may only be visible to logged-in visitors. Sign in to view all comments.