oteri

Posted on Jul 6, 2023 • Originally published at hackernoon.com

How to Scrape Large Datasets at Scale

#node #programming #webscraping #codenewbie

Large organizations have put effort into building an application and frown at developers extracting data from their websites. That is why they put a gateway in the form of user agents to let you know what is permitted.

In most sites, you can find these details in the robots.txt file attached to a live URL just like using this link below:

https://www.amazon.com/robots.txt

In this article, you’ll learn about using Bright Data’s Web Scraper integrated development environment (IDE) to scrape datasets at scale using its ready-made functions and coding templates.

Benefits of the Web Scraper IDE

These are some of the benefits of using Bright Data’s Web Scraper IDE:

The IDE is accessible from within the platform
As the leader in proxy infrastructure, it offers scalability and accuracy in web scraping
Its code templates help to speed up development
It incorporates the Web Unlocker capability through the IDE to avoid captchas and blocking.

What is Bright Data?

Bright Data is a proxy network that helps you turn websites into structured data. To get started with the platform, create an account.

Check out this resource to learn more about Bright Data.

Working with Web Scraper IDE

On your account dashboard, click the Datasets and Web Scraper IDE icon, and afterward, select the Get started button to open the template window.

The new window pop-ups a dialog box where you can select pre-existing dataset options to work with or create one from scratch if you desire to do so.

Select the eBay discovery and PDP options, and the page should look something like this with the collector code.

Now scroll down the page, and under the input tab, pass in the name of a product you want to analyze and extract its data. Once done, click the Preview button to run the preview and start the extraction.

PS: You must also note that you can enter your scripts within the Interaction code section.

Looking at the output result tab after running the preview, it formatted the result from the eBay website based on the following data classification as product_url, title, image, price of the product, and so on.

Saving the collector
To save the collector, click on the Finish editing button to open the configuration page as seen below:

Initiate the collector by API
Under the My Scrapers tab, let’s initiate this project and work with the scripts provided by clicking the Initiate by API button.

Creating authorization token
Authorization in programming grants access to users and identifies you as the account's rightful owner.

Click on the Account settings menu at the bottom left of the window to create an API token.

Upon adding the API token, you will receive a token for verification; enter the secret code.

Now that is done, copy your API token key, as you won’t be able to retrieve it unless you create a new one for use.

Return to the New collector page, and copy the scripts based on your operating system (OS) in your terminal. Make sure to replace API_TOKEN with the key you copied in the previous section after the word BEARER.

In your command line interface or terminal, the result of the API code should look something like this:

 curl -H "Authorization: Bearer API_TOKEN" -H "Content-Type: application/json" -d '[{"keyword":"ralph lauren polo shirt","count":10,"location":"","condition":"New unused"}]' "https://api.brightdata.com/dca/trigger?collector=c_liopmjh61f3o3lz7dz&queue_next=1"

The request command makes the code active in the Result API section of the New collector dashboard page. Once again, please copy and paste the code into the CLI tool.

PS: Remember to put your API token key in place of the value API_TOKEN.

    curl "https://api.brightdata.com/dca/dataset?id=j_liosdy1cdutdi7sod" -H "Authorization: Bearer API_TOKEN"

Run the script in the CLI, and the datasets in an object with status should read building and a message.

If the response continues to show, retry sending the request. When successful, you should see this result object.

Using Postman

Like the displayed object above, let’s use Postman to get the response for the Result API.

If you do not have Postman, download it here. Postman is an API platform for building, publishing, monitoring, testing, and documenting APIs. Check this resource article to learn more about Postman and its use.

Open the Postman app and input these values:

In the request section in Postman, pass in the URL in the GET method
Click the Authorization tab, and select the Bearer Token from the Type dropdown, pass in your token value
Click on Send button to send the request
If the request is successful, you should see a status message of 200 in the response section and an array of objects for the queried scraped eBay data

Creating a Node Server

Node is a JavaScript runtime environment that allows the execution of JavaScript code outside the web browser enabling developers to build server-side applications and command-line tools.

Let’s create a web server. One of the requirements for initializing your project in the terminal is using the package manager, npm, which is automatically present after installing Node.js on your local machine. Check it using this command:

    node --version

It displays the current version of Node.

Create a new directory. For this project, it is named datasets.
Change its directory and initialize the project with the command:

cd datasets

npm init -y

The -y flag accepts the defaults that look like this:

package.json

    {
      "name": "datasets",
      "version": "1.0.0",
      "description": "",
      "main": "index.mjs",
      "scripts": {
        "test": "echo \"Error: no test specified\" && exit 1"
      },
      ...
    }

Install the following packages:

    npm install -D nodemon

Nodemon automatically updates and monitors for any changes in your source code and automatically restarts the server.

    npm install csv-parse

The csv-parse package is a parser for converting CSV text input into an array or objects.

Now, update the script section in the package.json file to this:

    {
      "name": "datasets",
      "version": "1.0.0",
      "description": "",
      "main": "index.mjs",
      "scripts": {
        "start": "node index.mjs",
        "start:dev": "nodemon index.mjs"
      },
      ...
    }

Next, create a new file in the root directory with the command:

    touch index.mjs

To test this file, write a basic JavaScript script and run the server with the following command:

    npm run start:dev

Social Media Data from Bright Data

Scraping large datasets requires lots of effort and work using technologies like Node or Python. The way to get around this is to use a platform like Bright Data to obtain the information you need to get your results as soon as possible.

Let’s get this dataset from Bright Data which will be the social media platform Instagram, with these steps:

Sign up for a Bright Data account.
Go to https://brightdata.com/cp/datasets/ or select the Dataset Marketplace on the Datasets & Web Scraper IDE.

Open the Dataset Marketplace, and under Categories, select Instagram.com from the Social media dropdown.

Click on View dataset and download the sample dataset in CSV format.

Make sure to save the dataset in the root directory of the Node web server.

Your folder structure should look something like this:

.
    └── datasets
        ├── node_modules
        ├── instagram.csv
        ├── package-lock.json
        ├── package.json
        └── index.mjs

Reading CSV datasets in Node.js

For this section, Node will read the comma-separated values (CSV), which dataset is from Bright Data.

Update the index.mjs file with the code:

index.mjs

    import { parse } from "csv-parse";
    import { createReadStream } from "node:fs";

    const instagramAccount = [];

    const isInstagramAccount = (info) => {
      return (
        info["posts_count"] > 300 &&
        info["followers"] > 6000 &&
        info["biography"] !== "" &&
        info["posts"] !== ""
      );
    };

    createReadStream("instagram.csv")
      .pipe(
        parse({
          columns: true,
        })
      )
      .on("data", (data) => {
        if (isInstagramAccount(data)) {
          instagramAccount.push(data);
        }
      })
      .on("error", (err) => {
        console.log("error", err);
      })
      .on("end", () => {
        console.log(`${instagramAccount.length} accounts are live`);
        console.log("done");
      });

The code above does the following:

Using the createReadStream() method to open up a file or stream and read the data in it
isInstagramAccount callback function used to filter the data from the actual CSV file
pipe: For connecting two streams, which means it connects to a readable stream source into the writeable destination, parse()
columns:true: Represents returning each row in our CSV file as a Javascript object with key-value pairs rather than just an array of values
.on : Event handlers chaining pushing the newly created data into the empty array, instagramAccount and displays an error, shows the number of Instagram accounts present, and finally indicates done when the script finishes

Running the scripts with the command npm run start:dev should display the result like this in the terminal:

    643 accounts are live
    done

Conclusion

Web scraping is an integral part of data extraction used in data science. The Web Scraper IDE by Bright Data does all the heavy lifting in the background, presenting only the relevant data for your use.

This article walked you through understanding how to use the Web Scraper IDE and how you can build a custom datasets script to query large datasets of companies without fear of getting blocked by the company’s bots designed to help protect the company’s data.

Resources

CSV Parser for Node.js

Top comments (3)

Chinedu Ogadi • Jul 9 '23

This is nice, but from all indications, it IDE only allows JavaScript.
Is there any such IDE that supports Python?
Again, does it allow you use some already existing scraping libraries?

oteri • Jul 9 '23

Hi Chinedu,
You can check brightdata.com, and see the available option using Python. It is interesting to note with Bright Data, it makes you able to bypass website blocks.

The libraries available for you to use are Puppeteer and Playwright. If you need any other assistance, kindly let me know.

Chinedu Ogadi • Jul 10 '23 • Edited

Thanks Teri, for the info.

I'll definitely check it out.

DEV Community