First Steps
Web scraping is the process of extracting data from a web page's source code, rather than through some API exposed by the ow...
For further actions, you may consider blocking this person and/or reporting abuse
Nice idea, though scraping is always dependent of the website structure and/or copyright issues (they might block your user agent or IP if they don't allow scraping). In the case of Indeed they explicitly forbid it:
😏
Ahah, if you want to actually build a scraping tool I would consider Scrapy which is a framework with async concurrency builtin to build crawlers with data scraping.
It's definitely more complicated than BeautifulSoup, which is only a parsing library. Scrapy contains it all: downloaders, parsers, streaming processors, concurrency, hooks, logging, statistics. You can use BeautifulSoup as the parser, instead of the default one. It even allows you to choose either breadth first order or depth first order in crawling.
Oh jeez let's hope I don't get permabanned from Indeed.
There's an Indeed API on Mashape, don't know how flexible that is: rapidapi.com/indeed/api/indeed
You can always work around them banning your IP by using spider.com. They have millions of IPs and allow you to crawl anything and not get blocked.
A terms of service is not the law, there is nothing illegal about scraping a website. Read: eff.org/deeplinks/2018/01/ninth-ci...
It is not illegal (in the US, but keep in mind not everyone on this site is US-based nor are the companies that might get targeted by a spider written here).
However, it may be unethical.
"Please don't do this to our site and our property and our data."
"Yeah, well, screw you. I'm doing it anyway."
Try Faster Than Requests
x 5 times faster than std lib urllib.
x 15 times faster than Requests.
x 2 times faster than PyCurl.