Daniel Musembi

Posted on May 22, 2023 • Edited on Jun 6, 2023

Building Web Scrapers with Puppeteer

#react #webdev #javascript #programming

Introduction to Web Scraping with Puppeteer
Getting Started with Puppeteer
Selecting and Extracting Data
Interacting with Dynamic Web Content
Handling Forms and User Input
Dealing with Pagination and Multiple Pages
Overcoming Web Scraping Challenges
Ensuring Ethical and Responsible Web Scraping
Conclusion

Introduction to Web Scraping with Puppeteer

Overview of web scraping and its importance

The term "web scraping" refers to the method of automatically gathering structured data from the World Wide Web. Data scraping or web data extraction are other common names for this practice.

The capacity to glean useful information from websites and other online resources is what gives web scraping its significance. Some of the most compelling arguments in favour of web scraping include:

Web scraping is a method for collecting and analyzing data from the vast amounts of information available on the Internet. This data makes market research, competition analysis, sentiment analysis, trend tracking, and other insight-driven tasks possible.
As a sort of business intelligence, web scraping provides companies with valuable information about their rivals, market trends, prices, products, and client feedback. Market research, strategic planning, and the discovery of fresh prospects are just some of the many uses to which this information can be put.
For lead generation and sales prospecting, online scraping allows the collection of data from websites and directories, including contact details, customer information, and potential leads. It's useful for sales prospecting, lead creation, and focused marketing initiatives.
Web scraping enables businesses to monitor and track changes to web pages in real time. This is especially helpful for businesses in the e-commerce, financial services, and media sectors, where keeping up with the latest pricing, stock availability, market trends, and news stories is essential.
When it comes to content gathering and curation, web scraping is a crucial tool. Businesses may build superior databases, news aggregators, comparison sites, and curated content platforms by collecting and arranging data from numerous sources.
Web scraping is an excellent way to get raw material for use in artificial intelligence and machine learning projects. Large-scale data collection enables businesses to create NLP, picture recognition, sentiment analysis, and other AI-powered models.
Academic researchers frequently resort to web scraping to gather information for studies, polls, and experiments. Online resources provide researchers with a wealth of information that may be used to study user behaviour, public opinion, and developing trends.
Monitoring and Enforcing Compliance with Laws and Regulations on the Part of Government Agencies and Regulatory Bodies Web scraping lets these agencies and bodies do this. These groups can detect illegal behaviour, monitor market trends, and compile evidence for legal cases by scraping information from websites.
Web scraping plays a crucial role in pricing comparison and monitoring websites and platforms. Looking for the greatest pricing and discounts across multiple online retailers is convenient.

Introduction to the Puppeteer Library

Front-end development and testing are both automated and streamlined with the help of Puppeteer. Google was the first to market with it. Puppeteer is a free and open-source application that uses the Node.js framework.

Puppeteer provides APIs for controlling and interacting with Chromium or Chrome in headless mode (using the DevTools' prescribed protocols). Chrome, Chromium, Edge, and Firefox are examples of browsers that can be used for non-headless execution.

The majority of user interface testing, keystrokes, mouse clicks, and so on can be automated with Puppeteer. Angular and Angularjs applications can be tested with it. Puppeteer can be used to carry out tasks such as web page crawling and scraping.

Puppeteer functions similarly to a development tool in that it can handle many of the same duties that developers do, such as managing requests and responses, finding elements, monitoring network traffic and performance, and so on.

If we look at the npm trends for Puppeteer downloads over the past few years, we can see that Puppeteer's popularity has been on the rise.

Let's look at Puppeteer Architecture

Puppeteer makes use of the Node library, which provides a first-rate Application Programming Interface for handling Chromium or Chrome browsers. The protocols of the DevTools are used to do this.

Puppeteer has the below hierarchy −

The browser (with or without headless mode) is what does work for the browser engine.

Project to Develop Chromium (CDP) — Chromium is the actual working environment. Chromium is the browser engine used by both Microsoft Edge and Chrome.

Puppeteer − This is a package based on the node module.

Automation testing code or Nodejs level. In this case, the user creates the automation code themselves using JavaScript.

Use cases and benefits of using Puppeteer for web scraping

Here are some of the use cases for Puppeteer for web scraping:

Extracting data from dynamic websites: Puppeteer can be used to extract data from websites that use JavaScript to dynamically generate their content. This can be difficult to do with other web scraping tools, as they often rely on static HTML to extract data.
Crawling websites: Puppeteer can be used to crawl websites, which means visiting each page on a website and extracting data from it. This can be useful for tasks such as creating a sitemap or gathering data for market research.
Automating tasks on websites: Puppeteer can be used to automate tasks on websites, such as logging in, filling out forms, or clicking buttons. This can be useful for tasks such as testing websites or automating customer support workflows.

Here are some of the benefits of using Puppeteer for web scraping:

Powerful: Puppeteer is a powerful tool that can be used to extract data from a wide variety of websites.
Flexible: Puppeteer is a flexible tool that can be used for a variety of tasks, such as extracting data, crawling websites, and automating tasks.
Reliable: Puppeteer is a reliable tool that is unlikely to crash or fail.
Easy to use: Puppeteer is easy to use, even for beginners.

Getting Started with Puppeteer

Installing Puppeteer and setting up the development environment

Make sure you have Node.js installed on your machine. You can download it from the official Node.js website (https://nodejs.org) and follow the installation instructions for your operating system.

Open a terminal or command prompt and create a new directory for your Puppeteer project. You can do this by running the following command:

mkdir puppeteer-project

Navigate into the newly created directory using the cd command:

  cd puppeteer-project

Initialize a new Node.js project by running the following command and following the prompts:

npm init

Once the project is initialized, you need to install Puppeteer as a dependency. Run the following command to install Puppeteer using npm:

npm install puppeteer

Verify that Puppeteer is successfully installed by creating a new JavaScript file, for example,index. js, and adding the following code:

const puppeteer = require('puppeteer');

(async () => {
  const browser = await puppeteer.launch({ headless: 'new' });
  const page = await browser.newPage();
  await page.goto('https://www.google.com');
  const title = await page.title();
  console.log(title);
  await browser.close();
})();

Run the script by executing the following command in your terminal:

node index.js

This will execute the Puppeteer script, The output of the command should be the title of the Google website.

Creating a basic Puppeteer script

Creating a basic Puppeteer script involves using Puppeteer's API to launch a headless Chrome browser, navigate to a web page, and perform various actions such as extracting data or taking screenshots. Here's a brief description of the steps involved in creating a basic Puppeteer script:

const puppeteer = require('puppeteer');

(async () => {
  const browser = await puppeteer.launch({ headless: 'new' });
  // Create a new page instance.
  const page = await browser.newPage();

  // Navigate to the Google homepage.
  await page.goto('https://www.google.com');

  // Take a screenshot of the page.
  await page.screenshot({ path: 'google.png' });

  // Close the browser instance.
  await browser.close();
})();

A new instance of Chrome will be launched, directed to the Google site, and a screenshot will be taken automatically. The captured image can then be seen in your computer's file browser.

Selecting and Extracting Data

Using CSS selectors to target elements on web pages

Let's see step by step how to use CSS selectors to target elements on web pages

Import the Puppeteer library.

const puppeteer = require('puppeteer');

Create a new browser instance.

  const browser = await puppeteer.launch({ headless: 'new' });

Create a new page instance.

const page = await browser.newPage();

Navigate to the web page.

await page.goto('https://www.google.com');

Use the querySelector()method to select the element you want to target.

const element = await page.querySelector('.g-h-title');

Do whatever you want with the element.

// Get the text content of the element.
const title = await element.textContent;

// Print the title to the console.
console.log(title);

Close the browser instance.

await browser.close();

Here is an example of how to use CSS selectors to target elements on web pages using Puppeteer:

const puppeteer = require('puppeteer');

(async () => {
  const browser = await puppeteer.launch({ headless: 'new' });

  // Create a new page instance.
  const page = await browser.newPage();

  // Navigate to the Google homepage.
  await page.goto('https://www.google.com');

  // Select the title element.
  const titleElement = await page.querySelector('.g-h-title');

  // Get the text content of the title element.
  const title = await titleElement.textContent;

  // Print the title to the console.
  console.log(title);

  // Close the browser instance.
  await browser.close();
})();

This code will print the title of the Google homepage to the console.

Interacting with Dynamic Web Content

Interacting with dynamic web content is an essential aspect of web scraping and automation using Puppeteer. Here's an overview of how you can handle waiting for page loads, and AJAX requests, interact with JavaScript-driven elements, and execute custom scripts on web pages:

Waiting for page loads: Puppeteer provides several methods for waiting for a page to load, including waitForNavigation, waitForSelector, and waitForContentLoaded. These methods can be used to ensure that the page has finished loading before proceeding with any further actions.
AJAX requests: Puppeteer can be used to interact with AJAX requests by using the waitForRequest and evaluate methods. The waitForRequest method can be used to wait for a specific AJAX request to complete, while the evaluate method can be used to execute JavaScript code that will access the results of an AJAX request.
JavaScript-driven elements: Puppeteer can be used to interact with JavaScript-driven elements by using the click, type, and focus methods. These methods can be used to simulate user interactions with elements that are driven by JavaScript.
Custom scripts: Puppeteer can be used to execute custom scripts on web pages by using the evaluate method. The evaluate method can be used to execute any JavaScript code on the page, which can be useful for tasks such as data extraction and form filling.

// Wait for the page to load
await page.waitForNavigation();

// Wait for a specific AJAX request to complete
await page.waitForRequest('https://example.com/api/get-data');

// Get the results of an AJAX request
const data = await page.evaluate(() => {
  return window. data;
});

// Click an element
await page.click('#button');

// Type text into an input element
await page.type('#input', 'some text');

// Focus on an element
await page.focus('#input');

// Execute a custom script
await page.evaluate(() => {
  // Do something with the page
});

Handling Forms and User Input

Filling out and submitting web forms: To fill out a web form programmatically, you can use the type and click methods. The type method can be used to type text into an input element, while the click method can be used to click on an element. For example, to fill out a form with the username and the password password, you would use the following code:

await page.type('#username', 'username');
await page.type('#password', 'password');
await page.click('#submit');

Handling CAPTCHAs and authentication challenges: CAPTCHAs and authentication challenges are often used to prevent automated bots from accessing web pages. To handle CAPTCHAs and authentication challenges, you can use the waitForNavigation and evaluate methods. The waitForNavigation method can be used to wait for the page to load after a CAPTCHA or authentication challenge has been completed, while the **evaluate **method can be used to execute JavaScript code that will solve the CAPTCHA or authentication challenge. For example, to solve a CAPTCHA that requires you to identify a series of images, you could use the following code:

await page.waitForNavigation();
const data = await page.evaluate(() => {
  // Solve the CAPTCHA
});

Simulating user interactions like clicks and typing: To simulate user interactions like clicks and typing, you can use the clickand type methods. The click method can be used to click on an element, while the type method can be used to type text into an input element. For example, to simulate a user clicking on a button and typing some text into an input element, you would use the following code:

await page.click('#button');
await page.type('#input', 'some text');

If you follow these guidelines, you'll have no trouble utilizing Puppeteer to manage forms and user input programmatically.

Dealing with Pagination and Multiple Pages

Pagination is a standard feature of websites, allowing visitors to easily flip between different pages of material. It is vital to think about how to handle pagination while scraping a website with pagination to get all of the desired data.

When scraping a webpage, pagination may be handled in a few different ways. Using a web scraping tool that allows for pagination is one option. Scrapy and BeautifulSoup are just two examples of the many web scraping solutions that have pagination functionality. Because of this, you can tell the program how many pages to scrape, and it will take care of the pagination for you.

Let's see steps on how to handle pagination when scraping a website using Scrapy:

Install Scrapy.
Create a Scrapy project.
Define a spider.
In the spider, specify the number of pages you want to scrape.
Use the yield statement to yield each page of data.
Run the spider.

Here is an example of a Scrapy spider that scrapes the first 10 pages of results from Google:

import scrapy

class GoogleSpider(scrapy.Spider):
    name = "google"

    def start_requests(self):
        for i in range(1, 11):
            url = "https://www.google.com/search?q=python&start={}".format(i * 10)
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        for result in response.css(".g-card"):
            title = result.css(".g-title::text").get()
            url = result.css(".g-link::attr(href)").get()
            print(title, url)

When this spider is run, it will scrape the first 10 pages of results from Google and print the title and URL of each result.

An additional case of a Scrapy spider that crawls an entire paginated webpage. Here, you can see how the website's use of internal links facilitates page navigation.

import scrapy

class MySpider(scrapy.Spider):
    name = "my_spider"

    def start_requests(self):
        url = "https://www.example.com/"
        yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        next_page_url = response.css(".next_page::attr(href)").get()
        while next_page_url is not None:
            yield scrapy.Request(url=next_page_url, callback=self.parse)
            next_page_url = response.css(".next_page::attr(href)").get()

When activated, this spider will crawl the whole site and extract data from every page.

Overcoming Web Scraping Challenges

Let's see five Challenges in Web Scraping & how to overcome them

Rate limiting: There may be a cap on the number of requests you can make to a website in a certain period. This is to protect against automated spam bots and other forms of abuse. Web scrapers have several options for getting around rate limits, including the use of proxies, IP address rotation, and reducing the rate at which they scrape.

Use a proxy: A proxy is a server that acts as an intermediary between your computer and the website you are scraping. This can help to hide your IP address and prevent the website from detecting that you are using a bot.

Rotate IP addresses: There may be a cap on the number of requests you can make to a website in a certain period. This is to protect against automated spam bots and other forms of abuse. Web scrapers have several options for getting around rate limits, including the use of proxies, IP address rotation, and reducing the rate at which they scrape.

Slow down the scraping process: Too many requests to the site too quickly could lead to you being blocked. You can prevent this by increasing the amount of time between scraping requests.

Captcha: Websites may use CAPTCHAs to prevent automated bots from accessing their content. CAPTCHAs are tests that require human interaction to complete. To overcome CAPTCHAs, web scrapers can use CAPTCHA-solving services or use other techniques, such as using a human to solve the CAPTCHAs.
Dynamic content: Websites may use JavaScript to dynamically generate content. This can make it difficult for web scrapers to extract the data they need. To overcome this challenge, web scrapers can use a variety of techniques, such as using a headless browser, using a proxy, or using a screen scraping library

Using a headless browser: A headless browser is a web browser that doesn't have a graphical user interface (GUI). This means that it can be used to scrape websites without actually displaying the website in a browser window. Headless browsers can be used with a variety of web scraping tools to extract data from dynamic websites.
Using a screen scraping library: A screen scraping library is a software library that can be used to extract data from websites that are not designed to be scraped. Screen scraping libraries can be used with a variety of programming languages, such as Python, JavaScript, and Java.

Content blockers: Websites may use content blockers, such as ad blockers, to prevent automated bots from accessing their content. To overcome content blockers, web scrapers can use a variety of techniques, such as using a proxy, using a different user agent, or using a different browser.
Anti-scraping measures: Websites may take steps to prevent automated bots from scraping their content. These measures can include using rate limiting, CAPTCHAs, dynamic content, and content blockers. To overcome these measures, web scrapers can use a variety of techniques, such as using a proxy, rotating IP addresses, slowing down the scraping process, using a human to solve CAPTCHAs, using a headless browser, and using a different user agent.

Ensuring Ethical and Responsible Web Scraping

Maintaining a positive and ethical perspective on data extraction relies heavily on ensuring ethical and responsible web scraping. Here are some rules to stick to:

Website Policies: Before scraping data, review the website's terms, conditions, and scraping standards. Follow the website's scraping rules. Check scraping frequency, API usage, and data usage restrictions.

-** Check Robots.txt,**which instructs web crawlers and scrapers. Respect file directives like prohibited paths and scraping frequency. Avoid scraping website-restricted parts or URLs.

Avoid High Request Rates: Limit your target website requests. Excessive scraping requests strain servers and slow down websites. To conserve website resources, use appropriate scraping rates and delay queries.
Set HTTP user-agent headers to identify yourself as a bot when scraping websites. This helps webmasters identify malicious traffic.
Scraping Publicly Accessible Data: Avoid confidential or sensitive data. Avoid scraping private data like login passwords or personal data.
If necessary, get approval to scrape data from websites that need user login or consent. Respect user privacy and data protection requirements when managing scraped PII.

It's important to keep in mind that scraping websites ought to be done morally and legally. Protecting user privacy and personal information is of the utmost importance, as is upholding the rights of website owners. Following these rules will help ensure ethical web scraping.

Conclusion

A puppeteer is a sophisticated tool for web scraping and browser automation. It simplifies browsing, data extraction, and dynamic content interaction. This article covered Puppeteer web scraper development, from setting up the development environment through advanced web page interaction.

Installing Puppeteer and setting up the development environment ensured we had all the dependencies. We then created a rudimentary Puppeteer script to run a browser, navigate to web pages, and click and type components.

We learned how to use CSS selectors to target page elements and extract their text, attributes, and other information. Waiting for page loads, processing AJAX requests, and running bespoke scripts on web pages were also studied.

We also addressed dynamic content, JavaScript rendering, and waiting for specified events and interactions before extracting data. We also discussed automating login and authentication with Puppeteer.

We also discussed online scraping ethics, including following website policies, respecting robots.txt files, and being conscious of data usage and privacy. We stressed ethical web scraping.

You can develop powerful web scrapers for a variety of scraping jobs using Puppeteer's rich capability and this article's approaches and best practices.

In conclusion, Puppeteer lets web developers automate browser interactions and gather useful data from websites, enabling unlimited data analysis, research, and automation.

DEV Community