Thanks, that's a good overview of the basics.
With scraping things can get big and serious fast and the codebase would get very big quickly. The majority of the work would be maintaining different scrapers/parsers for different websites that are always changing etc.
There's an excellent library/framework for creating scrapers (spiders) in Python: Scrapy. It takes a bit of a learning and setup but it's really really powerful once you master the concepts. There are daemons like scrapyd, web admin interface like Spiderkeeper etc and these work quite nicely together. If you're serious about scraping then you'd need a proxy solution also. I've had really good experience with Luminati. They're expensive but the best. And the comes cracking the captchas and other advanced topics. So scraping is a big world on its own. Happy scraping! :)
Thanks, that's a good overview of the basics.
With scraping things can get big and serious fast and the codebase would get very big quickly. The majority of the work would be maintaining different scrapers/parsers for different websites that are always changing etc.
There's an excellent library/framework for creating scrapers (spiders) in Python: Scrapy. It takes a bit of a learning and setup but it's really really powerful once you master the concepts. There are daemons like scrapyd, web admin interface like Spiderkeeper etc and these work quite nicely together. If you're serious about scraping then you'd need a proxy solution also. I've had really good experience with Luminati. They're expensive but the best. And the comes cracking the captchas and other advanced topics. So scraping is a big world on its own. Happy scraping! :)
Exactly!... things get complicated quickly but there are excellent libraries out there to help. Thanks for the helpful references too