DEV Community

Cover image for What is Web Scraping | Data Mining
Anjan Kant
Anjan Kant

Posted on

What is Web Scraping | Data Mining

Web scraping is a popular term for various significant methods used to extract web metadata or gather valuable information across the Internet. Generally, this is accomplished with exclusive software that simulates web surfing to gather specific bits of information from different websites.

Purpose of web scraping

Throughout web scraping programs, some professionals or businessmen will be able to gather some web data to sell to other companies or users, for promotional intention. Hence, Web scraping is known as screen scraping, data mining, Web harvesting or Web data extraction.
Subscribe YouTube Channel

Web scraping as data mining

Web scraping as data mining helps in report collection of weather, auction information, market pricing for any product, or any other list of gathered information can be inherited or captured. Sometimes, web scraping is restricted by many websites with respect to data mining, but web scraping is widely utilized to collect aggregated data from different private or government data sources in spite of all legal challenges.

Types of data mining

Different types of data mining are practiced by developers. Four approaches are given below.

1. Text pattern fetching

A simple yet influential method to extract text from html pages can be based on the UNIX grep command or regular expression-matching facilities of programming languages (for instance Perl or Python).

2. HTML parsing (Wrapping)

In this data mining method, the wrapper extracts information or text from a specific web page having dynamically encoded data. The most important feature of the wrapper is it detects such dynamic templates in a specific information source, extracts its entire content and translates it into a relevant form. Wrapper making algorithms presume that input web pages of a wrapper orientation system conform to a common template and that they can be easily identified in terms of a URL common scheme.[3] Furthermore, some semi-structured data retrieving languages, like the HTQL and XQuery, can be utilized to parse HTML based web pages and to regain and transform html web page content.

3. HTTP programming

Static and dynamic web pages can be recovered by posting HTTP requests to the distant web server through socket applications.

4. DOM (Document Object model parsing)

By embedding a complete-matured web browser, like the Internet Explorer, Chrome or the Mozilla browser control, the application can recover the dynamic content produced by the client-side scripts. All these browsers also parse the website pages into a DOM tree, based on which web scraping applications can regain parts of the pages.
>>> Original Source

Discussion (0)