Discussion on: How we failed at web scraping

View post

Yep, CORS is a b*tch, but it has good reason for it: imagine an API, that has Access-Control-Allow-Origin: * header and doesn't have any rate limiting. I think you should consider this as a factor also, limit your request speed if needed.

More importantly, always respect robots.txt! Including what urls not allowed to scan and what is the minimum time between requests (Crawl-delay). For example IMDB use this settings in their robots.txt:

User-agent: ScoutJet
Crawl-delay: 3
User-agent: Slurp
Crawl-delay: .1
User-agent: *
Disallow: /ads/
Disallow: /ap/
Disallow: /tvschedule
Disallow: /mymovies/
Disallow: /OnThisDay
Disallow: /r/
Disallow: /register
Disallow: /updates
Disallow: /registration/
Disallow: /tr/
Disallow: /name/nm*/mediaviewer/rm*/tr
Disallow: /title/tt*/mediaviewer/rm*/tr
Disallow: /gallery/rg*/mediaviewer/rm*/tr
Disallow: /*/rg*/mediaviewer/rm*/tr
Disallow: /*/*/rg*/mediaviewer/rm*/tr

An API/Website with * wildcard in their CORS header and without rate limiting is a free fountain for everybody to mirror their entire database and use for their site (and their resources) in a client side application. Imagine the plus load on the servers that the lots of new users generating on those copycat sites.

That's why some services offer database dumps. You have to deal with the database hosting and parsing/updating the fields in your own stack. They allowing to GET the images, videos from the origin servers, but nothing much more.

John Paul Ada • Apr 3 '18

That makes sense.