DEV Community

Cover image for Web Scraping Walkthrough with Python

Web Scraping Walkthrough with Python

Andrew (he/him) on February 18, 2019

First Steps Web scraping is the process of extracting data from a web page's source code, rather than through some API exposed by the ow...
Collapse
 
rhymes profile image
rhymes • Edited

Nice idea, though scraping is always dependent of the website structure and/or copyright issues (they might block your user agent or IP if they don't allow scraping). In the case of Indeed they explicitly forbid it:

You are not permitted to use Indeed’s Site or its content other than for non-commercial purposes. Use of any automated system or software, whether operated by a third party or otherwise, to extract data from the Site (such as screen scraping or crawling) is prohibited. Indeed reserves the right to take such action as it considers necessary, including issuing legal proceedings without further notice, in relation to any unauthorized use of the Site.

😏

This is going to take a while, so I'll go grab some coffee and come back...

Ahah, if you want to actually build a scraping tool I would consider Scrapy which is a framework with async concurrency builtin to build crawlers with data scraping.

It's definitely more complicated than BeautifulSoup, which is only a parsing library. Scrapy contains it all: downloaders, parsers, streaming processors, concurrency, hooks, logging, statistics. You can use BeautifulSoup as the parser, instead of the default one. It even allows you to choose either breadth first order or depth first order in crawling.

Collapse
 
awwsmm profile image
Andrew (he/him)

Oh jeez let's hope I don't get permabanned from Indeed.

Collapse
 
rhymes profile image
rhymes

There's an Indeed API on Mashape, don't know how flexible that is: rapidapi.com/indeed/api/indeed

Collapse
 
westerdal profile image
Jay Westerdal • Edited

You can always work around them banning your IP by using spider.com. They have millions of IPs and allow you to crawl anything and not get blocked.

A terms of service is not the law, there is nothing illegal about scraping a website. Read: eff.org/deeplinks/2018/01/ninth-ci...

Collapse
 
thebouv profile image
Anthony Bouvier

It is not illegal (in the US, but keep in mind not everyone on this site is US-based nor are the companies that might get targeted by a spider written here).

However, it may be unethical.

"Please don't do this to our site and our property and our data."

"Yeah, well, screw you. I'm doing it anyway."

Collapse
 
juancarlospaco profile image
Juan Carlos

Try Faster Than Requests
x 5 times faster than std lib urllib.
x 15 times faster than Requests.
x 2 times faster than PyCurl.