Building web scrapers can be frustrating and repetitive.
You must dig into the target webpage HTML, search for the elements you want to extract, and match them with CSS selectors or XPaths.
You want to ensure your scraper can consistently match the desired elements, but a suitable choice of selector or path is difficult; A wrong choice could lead to failed scrapes.
The challenge is that modern websites generate HTML dynamically -- a website scraper that works on one page may fail on a similar page, and one that works today may fail tomorrow.
Figuring out what works consistently is primarily a process of trial and error.
Choosing good selectors
To uniquely and consistently match an element, the best choice is the unique ID attribute. Unfortunately, most of the time, it will not be available.
How about XPaths? Those are unique - but inconsistent.
Let's say, for example, you want to match this span element: body/div[2]/span/
.
What if on another page, or at a later time, a div is added before the div in index 2? in that case, the scraper will match the wrong element, and the path that matches the desired span will now be body/div[3]/span
.
CSS selectors are used for applying style to elements but could also be useful for scrapers. The problem is that a style could be applied to different elements, and you may not want to match all of them. The only way to know is to search the entire page for matches.
For both XPaths and CSS selectors, shorter is usually better. Long paths or selectors are exposed to any change in each element along the path. Using short paths is more resistant to change as it limits that exposure.
These are just some of the considerations to keep in mind. In practice, you must make choices, try them out on different pages, and repeat the process until you get good results. After that, you would still need to monitor your scraper and be ready to make changes.
Solutions
Luckily there are some tools and libraries that can help.
This, for example, is an excellent project that generates CSS selectors that are stable and robust. It searches the entire DOM tree for unique selectors, assigns them scores, and returns the best choice.
Some tools use machine learning to scrape web pages. These are trained on specific types of pages, for example, e-commerce product pages, and extract commonly used data -- product titles, prices, images, etc.
These are useful in some cases, but you won't be able to get any additional data you may want to extract, you won't be able to configure the format of the results, and in some cases, it simply won't work, and not much can be done to fix it.
AutoBrowser.io
AutoBrowser is a web application I've created to solve some of the above-mentioned problems. Unlike other apps, it does not require installing software or shady browser extensions.
Just enter a URL of a website you want to scrape; AutoBrowser will take a snapshot of the page and let you select the elements you want to extract directly from the web UI. Clicking an element finds the best selector for it, which you can copy and use.
You can also use a template to create a configurable API, extract the same elements from similar pages, or set schedules for receiving notifications on updates. The selector extraction tool is free to use. API calls and schedules require purchasing usage credits.
AutoBrowser is a work in progress; feel free to contact me with questions; I'll also appreciate any feedback!
Top comments (0)