DEV Community

Web Scraping Walkthrough with Python

Andrew (he/him) on February 18, 2019

First Steps Web scraping is the process of extracting data from a web page's source code, rather than through some API exposed by the ow...

Read full post

rhymes • Feb 18 '19 • Edited

Nice idea, though scraping is always dependent of the website structure and/or copyright issues (they might block your user agent or IP if they don't allow scraping). In the case of Indeed they explicitly forbid it:

You are not permitted to use Indeed’s Site or its content other than for non-commercial purposes. Use of any automated system or software, whether operated by a third party or otherwise, to extract data from the Site (such as screen scraping or crawling) is prohibited. Indeed reserves the right to take such action as it considers necessary, including issuing legal proceedings without further notice, in relation to any unauthorized use of the Site.

😏

This is going to take a while, so I'll go grab some coffee and come back...

Ahah, if you want to actually build a scraping tool I would consider Scrapy which is a framework with async concurrency builtin to build crawlers with data scraping.

It's definitely more complicated than BeautifulSoup, which is only a parsing library. Scrapy contains it all: downloaders, parsers, streaming processors, concurrency, hooks, logging, statistics. You can use BeautifulSoup as the parser, instead of the default one. It even allows you to choose either breadth first order or depth first order in crawling.

Andrew (he/him) • Feb 18 '19

Oh jeez let's hope I don't get permabanned from Indeed.

rhymes • Feb 19 '19

There's an Indeed API on Mashape, don't know how flexible that is: rapidapi.com/indeed/api/indeed

Jay Westerdal • Feb 19 '19 • Edited

You can always work around them banning your IP by using spider.com. They have millions of IPs and allow you to crawl anything and not get blocked.

A terms of service is not the law, there is nothing illegal about scraping a website. Read: eff.org/deeplinks/2018/01/ninth-ci...

Anthony Bouvier • Feb 19 '19

It is not illegal (in the US, but keep in mind not everyone on this site is US-based nor are the companies that might get targeted by a spider written here).

However, it may be unethical.

"Please don't do this to our site and our property and our data."

"Yeah, well, screw you. I'm doing it anyway."

Juan Carlos • Feb 20 '19

Try Faster Than Requests
x 5 times faster than std lib urllib.
x 15 times faster than Requests.
x 2 times faster than PyCurl.