Basically I want to crawl simple blogs and extract their blog posts. The biggest challenge here would probably be the parsing of the data and understanding different content parts within a blogpost
For further actions, you may consider blocking this person and/or reporting abuse
Top comments (6)
Node.js +/- Puppetteer would probably be the first natural choice; although I am not accustomed to Puppetteer that much.
I used to use Selenium API with Python, if I need to scrape dynamic websites. But async in Python does not seems to be as natural as Node.js
I don't know much about Golang. How often is it used for web scraping?
Would depend on the type of scraping.
If we need to interact as a human then puppetteer with JS / TS would be good: github.com/puppeteer/puppeteer
If you just need to parse data I really like to use cheerio with JS / TS : github.com/cheeriojs/cheerio
It let's you access webpage information with jquery syntax. which can be quite practical.
Thanks for the response!
I do not need to interact as a human, but just collect news articles from different websites, at scale. Looking at cheerio, seems like a very decent option. Thanks!
If it is scale you are looking for then best option would be scrapy.org/ with Scrapy Cloud. You can also run multiple Scrapy spiders in a process.
Elixir + Floki
I'm also a beginner to webscraping. Scrapy framework is a good tool but will have a steeper learning curve than just using libraries (selenium, beautifulsoup, requests).