loading...

If you would need to scrape many different websites nowdays, which tool/language combo would you pick?

davcevski profile image Mario Davchevski ・1 min read

Basically I want to crawl simple blogs and extract their blog posts. The biggest challenge here would probably be the parsing of the data and understanding different content parts within a blogpost

Discussion

pic
Editor guide
Collapse
crimsonmed profile image
Médéric Burlet

Would depend on the type of scraping.

If we need to interact as a human then puppetteer with JS / TS would be good: github.com/puppeteer/puppeteer

If you just need to parse data I really like to use cheerio with JS / TS : github.com/cheeriojs/cheerio
It let's you access webpage information with jquery syntax. which can be quite practical.

Collapse
davcevski profile image
Mario Davchevski Author

Thanks for the response!

I do not need to interact as a human, but just collect news articles from different websites, at scale. Looking at cheerio, seems like a very decent option. Thanks!

Collapse
patarapolw profile image
Pacharapol Withayasakpunt

Node.js +/- Puppetteer would probably be the first natural choice; although I am not accustomed to Puppetteer that much.

I used to use Selenium API with Python, if I need to scrape dynamic websites. But async in Python does not seems to be as natural as Node.js

I don't know much about Golang. How often is it used for web scraping?

Collapse
davcevski profile image
Mario Davchevski Author

But async in Python does not seems to be as natural as Node.js

This is one of the reasons I listed Go in the tags. Still learning it, but it feels that well thought concurrent code can go a long way in scraping at scale.

Basically I want to crawl simple blogs and extract their blog posts. The biggest challenge here would probably be the parsing of the data and understanding different content parts within a blogpost

Collapse
talha131 profile image
Talha Mansoor

I do not need to interact as a human, but just collect news articles from different websites, at scale.

If it is scale you are looking for then best option would be scrapy.org/ with Scrapy Cloud. You can also run multiple Scrapy spiders in a process.

Collapse
jcsvveiga profile image
João Veiga

Elixir + Floki

Collapse
jengfad profile image
Jennifer Fadriquela

I'm also a beginner to webscraping. Scrapy framework is a good tool but will have a steeper learning curve than just using libraries (selenium, beautifulsoup, requests).