The Beginners Guide to Web Extraction and XPATH
XPATH is a query language used for selecting nodes and computing values of an HTML or XML document. For some of you reading this, A LOT of information was just thrown at you. You're probably here because you've fallen down the rabbit hole of web extraction, or data scraping as it's widely called, and at this point you're just clicking around. Stop. You've arrived at the beginner's guide for all things web extraction and XPATH. Lets branch out and discuss what web extraction is and the two elements needed to extract the most data.
Web extraction, also known as web scraping and web harvesting, is data scraping used for extracting data from websites. Scraping can be done both manually and through automated processes carried out by web crawlers and bots. The data is gathered from across the internet and copied into a central local database or spreadsheet. Google and Amazon AWS provide web scraping tools and public data free of charge. While that may sound somewhat invasive, there are many everyday uses such as:
- online price change monitoring
- price comparison
- real estate listings
- weather data monitoring
- product review
Still, websites are finding new methods to prevent web crawlers and bots from extracting data. Data scraping systems have also evolved, but there are times when a human touch is needed and required. This requires the user to be able to both read and write in the language of the web.
The Language of the Web
Web pages are built using text based markup languages such as HTML AND XHTML. When web browsers receive HTML documents, they render them into multimedia web pages. HTML describes the structure of a web page semantically. In text form, the wealth of information to be parsed through offline is high. Understanding the language of the web will give you a foundation to be able to get the most out of web scraping tools, bots and web crawlers. It will also allow you yourself to actually read, write and troubleshoot what your looking for. Just to name a few, here are some other markup languages:
- BNML (Business Narrative Markup Language)
- CFML (ColdFusion Markup Language)
- XBEL (XML Bookmark Exchange Language)
Roots, Trees and XPATH
In the introduction of Using Xpath: An Xpath Tutorial, it's established that XPATH is another language of the web. XPATH is based on a tree representation of the XML document. Based on the hierarchical structure of an XML document, they can be interpreted similar to a tree structure. An XML tree must have a root element. A root element is parent to all other elements. Each element can have sub elements. When dealing with XPATH, elements are referred to as nodes. XPATH allows the user to navigate around the tree and select specific nodes. XPATH's ability to both select and compute is essential as some sub nodes may contain values, conditions, text or even another data structure.
If you plan on working with the web, learning the language is an essential step. As website developers find new ways to block bots, knowing the language will allow YOU to find a work around. Knowing and understanding what you're looking for is crucial when data scraping. With XPATH, when you find nodes that contain values, you can compute them with no time being lost. Building your foundation in the language of the web will lead to a deeper understanding of the tree structure. A deeper understanding of the tree structure will have you scraping heaps of data in no time.