BeautifulSoup or Scrapy?

#discuss #python #html #webdev

Which do you prefer?

Top comments (6)

Jean-Michel Plourde • Jun 15 '19 • Edited

It really depends on the needs. While they both get HTML, they aren't doing it to the same length and with the same capabilities.

Beautifulsoup is library that parse the HTML from a given URL without any efforts. It fetches the HTML then it stops (you could add some automation but there is already other tools doing it). It gives you access to the data without any hassle.

Scrapy is a full fledged framework to get all the HTML from many pages inside a set of domains. You specify constraints and it fetches all the HTML it can within the limits you set.

It boils down to a library vs a framework.

I'm currently working on a project where I need to fetch some data from a website with requests then parsing the HTML with BeautifulSoup. It's simple and surface parsing.

There is another project where a bot is crawling many websites, collect all the data then sends it to a neural network to work on it. In this case scrapy is the best option because you just put some rules and send it doing its job automatically.

eluzix • Jun 16 '19 • Edited

They are not the same thing.

Beautiful Soup allows you to build a navigatable tree from HTML and XML sources (be a file, URL or a stream). After building the tree, you can search modify it or pull data out.

Scrapy is a framework for crawling and scraping content from websites. For each page crawled you get access to it's DOM so you can extract your relevant information. This part is much like BS so if you are looking for comparison that's where you should look.

To give a living an example, I built a system that crawls a website for its historical content, extract and save the data. Then, periodically check the site content via it's RSS stream.

For the initial crawling, I used Scrapy to easily navigate through the site content, for the RSS stage, I used BS4 to parse each new URL I got from the RSS.

Edit:
Working with Scrapy you can use BS to extract information from the HTML you got, see docs.scrapy.org/en/latest/topics/s...

Kamaraj • Jun 16 '19

BeautifulSoup is best,
And use the requests module to get the Html page
to pass it to the BeautifulSoup and scrape it
And this is a good combination to scrape website