DEV Community

Discussion on: Scraping Reddit with Python and Beautiful Soup

Collapse
 
kerldev profile image
Kyle Jones • Edited

One consideration when doing this is duplication. Reddit has a pretty big cross-posting culture and you'd likely not want similar/identical posts being caught by your scraper.
Some decent ways around this would be to store hashes of post titles or by using something like a MinHash or SimHash.
Implementing something like this can significantly reduce the storage used.