DEV Community

Bartude
Bartude

Posted on

Building URL's to crawl based on other websites

The idea behind this is that a user, upon registration, selects a number of preferences of their choosing (min/max price, region, etc..). Then based on those preferences, I'm gonna crawl/scrape other websites to find matches for those preferences (For all of these websites, these matches are found using query parameters).

Just a sidenote, this is not to steal other peoples content. The idea is to merely gather up the matches from the different websites and list them all on each users authenticated page.

I'm not sure what's the best course of action to do this. This is what I've thought of so far:

  • Manually make a map for each website of the existing values and the corresponding query parameters and build the url's using the corresponding website map
  • Use a browser automation tool like Selenium to automatically browse and build the url each time the preferences are saved (although I'm not sure these are safe to use in production)

Any tips on the best way to move forward with this? (I am open to using API's that would help me do this)

Top comments (2)

Collapse
 
peterj profile image
Peter Jausovec

I'd create a map of preferences to the names of query parameters for each website you want to scrape (assuming there is no way to get the query parameters dynamically from each site).

For example - for each website you'd store the name of your preference (e.g. maxPrice) and how the query parameters is named on the website you are trying to scape:
{
"some-website1.com": {
"maxPrice": "max_price",
"region": "region", ....
}
}

This should work fine assuming all websites you are scraping are using query parameters. Using the information from the map you will be able to build up the URL, make a request and get the data back. If you need to parse the data (assuming you're not getting JSON/XML back), I'd look into something like BeautifulSoup (a Python library) that allows you to easily parse HTML for example.

Collapse
 
bartude profile image
Bartude

Yeah, that seems to be the way to go with no other visible solution. I'm gonna have to build in safeguards in case those websites by some chance decide to change the query parameters names. Thanks!