Web scraping can get challenging when dealing with dynamically loaded content. Modern websites often use infinite scroll or “Load more” buttons to load additional content as you scroll. While this enhances user experience, it complicates data extraction using conventional methods.
In this tutorial, you’ll learn how to scrape data from a website that employs an infinite scroll mechanism. We’ll go step-by-step to fetch, parse, and save the data, eventually exporting it to a CSV file.
By the end of this tutorial, you will have learned how to:
- Fetch HTML content from a web page.
- Simulate clicking the “Load more” button to load additional content.
- Parse and extract specific data from the HTML.
- Save the extracted data to a CSV file.
Let’s get started!
Step 1: Prerequisites
To develop this web scraper, we'll use Node.js along with a few open-source libraries and the ZenRows API to handle anti-scraping mechanisms. Below is a list of tools and libraries that will be used:
- Node.js: A runtime environment for executing JavaScript code on the server side. You can download it from nodejs.org.
- Axios: A library for making HTTP requests.
- Cheerio: A library for parsing HTML.
- csv-writer: A library for writing data to a CSV file.
- ZenRows API: A service to bypass anti-scraping mechanisms.
First, create a new Node.js project named web-scraper-tool
. Then, install the required libraries by running the following command in your terminal:
npm install axios cheerio csv-writer
With the basic setup in place, you are ready to start building your web scraper.
Step 2: Fetch HTML Content
The first task is to fetch the HTML content of the page. This involves sending a request to the target URL using the ZenRows API. The response will contain the raw HTML of the page, initially displaying 12 products.
Create a fetchHtml
function that retrieves the HTML content from a target URL using the ZenRows API. This function should handle HTTP requests and errors, returning the HTML data for further processing.
const axios = require('axios');
const cheerio = require('cheerio');
const csv = require('csv-writer').createObjectCsvWriter;
const apiKey = 'ZENROWS_API_KEY'; // Replace with your ZenRows API key
const pageUrl = 'https://www.scrapingcourse.com/button-click'; // Page to be scraped
// Function to fetch HTML content from a URL
async function fetchHtml(url) {
try {
const response = await axios.get('https://api.zenrows.com/v1/', {
params: { url, apiKey }
});
return response.data; // Return the HTML content
} catch (error) {
console.error(`Error fetching ${url}: ${error.message}`);
return null; // Return null if an error occurs
}
}
To test the fetchHtml
function, create a main
function to execute the logic and print the fetched HTML.
// Main function to test and execute all the logic
async function main() {
const html = await fetchHtml(pageUrl); // Fetch HTML content of the initial page
if (html) {
console.log(html);
}
}
main();
Run the code using the command node index.js. The output in your terminal should display the entire raw HTML for the page. This HTML will be the foundation for the data extraction process.
Step 3: Load More Products
Once you have fetched the initial set of products, the next step is to simulate clicking the “Load more” button multiple times to load all the remaining pages. This step ensures that all products beyond the initial set displayed on the home page are fetched.
Create a fetchAllProducts
function that simulates clicking the “Load more” button by sending requests to the AJAX endpoint. This function should continue to load more products until a specified number of products have been fetched.
const ajaxUrl = 'https://www.scrapingcourse.com/ajax/products'; // AJAX URL to load more products
// Function to fetch all products by simulating the "Load more" button
async function fetchAllProducts() {
let productsHtml = [];
let offset = 0;
while (productsHtml.length < 48) {
const newHtml = await fetchHtml(ajaxUrl, { offset });
if (!newHtml) break; // Stop if no HTML is returned
const $ = cheerio.load(newHtml);
const products = $('div.product-item').map((_, element) => {
return $(element).html();
}).get();
productsHtml.push(...products); // Collect the HTML content of the products
offset += 12; // Increment offset to load the next set of products
console.log(`Fetched ${productsHtml.length} products so far...`);
}
return productsHtml.join('\n'); // Join the HTML snippets into a single, cleaner string
}
Update the main
function to test the fetchAllProducts
function.
// Main function to test and execute all the logic
async function main() {
const productsHtml = await fetchAllProducts();
console.log(productsHtml); // Log the fetched products
}
When you run the code, your terminal should display the message *Fetched X products so far… * followed by the raw HTML of the products.
Step 4: Parse Product Information
With the raw HTML content of at least 48 products fetched, the next step is to parse this HTML to extract specific product information like title, price, image URL, and product URL.
Create a parseProducts
function that extracts specific product information like title, price, image URL, and product URL from the fetched HTML. Use the Cheerio library to navigate and parse the HTML content.
// Function to parse product information from HTML
function parseProducts(html) {
const $ = cheerio.load(html);
return $('a[href*="/ecommerce/product/"]').map((_, item) => ({
title: $(item).find('span.product-name').text().trim(),
price: $(item).find('span.product-price').text().trim(),
image: $(item).find('img').attr('src') || 'N/A',
url: $(item).attr('href')
})).get();
}
Update the main
function to run the parseProducts
function and log the output.
// Main function to test and execute all the logic
async function main() {
const productsHtml = await fetchAllProducts();
const products = parseProducts(productsHtml);
console.log(products); // Log the parsed product information to the console
}
Run the code to see the parsed product information in an array of objects instead of the raw HTML seen in the previous step. The output in your terminal should look like an array of objects, each representing a product with its title, price, image URL, and product URL.
Step 5: Export Product Information to CSV
After successfully parsing the data, the next task is to save it in a structured format for further analysis. In this step, the parsed data will be written to a CSV file, which is a popular choice for storing tabular data due to its simplicity and wide support.
Create an exportProductsToCSV
function that writes the parsed product data to a CSV file. Use the csv-writer library to define the file structure and save the data.
// Function to export products to a CSV file
async function exportProductsToCSV(products) {
const csvWriter = csv({
path: 'products.csv',
header: [
{ id: 'title', title: 'Title' },
{ id: 'price', title: 'Price' },
{ id: 'image', title: 'Image URL' },
{ id: 'url', title: 'Product URL' }
]
});
await csvWriter.writeRecords(products);
console.log('CSV file has been created.');
}
Update the main
function to run the exportProductsToCSV
function.
// Main function to test and execute all the logic
async function main() {
const productsHtml = await fetchAllProducts();
const products = parseProducts(productsHtml);
await exportProductsToCSV(products); // Export products to CSV
}
After running the code, you should see a products.csv
file in your working directory with the parsed product information. You will also see a message in the terminal confirming that the CSV file has been created.
Step 6: Get Extra Data for Top Products
In the final step, the focus will be on refining the scraping process by fetching additional details for the top five highest-priced products. This involves visiting each product’s page to extract the needed information, such as product descriptions and SKU codes.
Create a getProductDetails
function that fetches additional details like product descriptions and SKU codes from each product's individual page.
// Function to fetch additional product details from the product page
async function getProductDetails(url) {
const html = await fetchHtml(url);
if (!html) return { description: 'N/A', sku: 'N/A' };
const $ = cheerio.load(html);
return {
description: $("div.woocommerce-Tabs-panel--description p").map((_, p) => $(p).text().trim()).get().join(' ') || 'N/A',
sku: $(".product_meta .sku").text().trim() || 'N/A'
};
}
Finally, update the exportProductsToCSV
function to include the new data for the top 5 highest-priced products.
// Function to export products to a CSV file
async function exportProductsToCSV(products) {
const csvWriter = csv({
path: 'products.csv',
header: [
{ id: 'title', title: 'Title' },
{ id: 'price', title: 'Price' },
{ id: 'image', title: 'Image URL' },
{ id: 'url', title: 'Product URL' },
{ id: 'description', title: 'Description' },
{ id: 'sku', title: 'SKU' }
]
});
await csvWriter.writeRecords(products);
console.log('CSV file with additional product details has been created.');
}
Finally, update the main
function to fetch the additional details and export the enriched product data to the CSV file.
// Main function to test and execute all the logic
async function main() {
const productsHtml = await fetchAllProducts();
const products = parseProducts(productsHtml);
// Sort products by price in descending order
products.sort((a, b) => parseFloat(b.price.replace(/[^0-9.-]+/g, "")) - parseFloat(a.price.replace(/[^0-9.-]+/g, "")));
// Fetch additional details for the top 5 highest-priced products
for (let i = 0; i < Math.min(5, products.length); i++) {
const details = await getProductDetails(products[i].url);
products[i] = { ...products[i], ...details };
}
await exportProductsToCSV(products); // Export products to CSV
console.log('CSV file with additional product details has been created.');
}
Once you run the code, you will see a CSV file with the top five highest-priced products, each containing additional details like product descriptions and SKU codes.
Note: The number of products fetched (48) is based on the earlier limit set in the fetchAllProducts
function. You can adjust this limit if you want to scrape more products before identifying the top five.
Conclusion
By following these steps, you’ve successfully built a web scraper capable of handling dynamic web pages with an infinite scroll or “Load more” button. The key to effective web scraping lies in understanding the structure of the target website and using tools to navigate and bypass anti-scraping measures.
To further enhance your web scraping skills, consider implementing the following:
- Use rotating proxies to avoid IP bans.
- Explore techniques for handling CAPTCHA challenges.
- Scrape more complex sites with additional AJAX calls or nested "Load more" buttons.
This tutorial provided a strong foundation for scraping dynamic content, and you can now apply these principles to other web scraping projects.
Top comments (2)
By far the most easy to follow data scraping tutorial I have found. Bedankt~~
If you have to use special techniques to get around bans, you're probably breaking the terms of use of the site you're scraping.
Some comments have been hidden by the post's author - find out more