DEV Community

Cover image for How to Scrape Large Datasets at Scale
oteri
oteri

Posted on • Originally published at hackernoon.com

How to Scrape Large Datasets at Scale

Large organizations have put effort into building an application and frown at developers extracting data from their websites. That is why they put a gateway in the form of user agents to let you know what is permitted.

In most sites, you can find these details in the robots.txt file attached to a live URL just like using this link below:

https://www.amazon.com/robots.txt

user agents

In this article, you’ll learn about using Bright Data’s Web Scraper integrated development environment (IDE) to scrape datasets at scale using its ready-made functions and coding templates.

Benefits of the Web Scraper IDE

These are some of the benefits of using Bright Data’s Web Scraper IDE:

  • The IDE is accessible from within the platform
  • As the leader in proxy infrastructure, it offers scalability and accuracy in web scraping
  • Its code templates help to speed up development
  • It incorporates the Web Unlocker capability through the IDE to avoid captchas and blocking.

What is Bright Data?

Bright Data is a proxy network that helps you turn websites into structured data. To get started with the platform, create an account.

Check out this resource to learn more about Bright Data.

Working with Web Scraper IDE

On your account dashboard, click the Datasets and Web Scraper IDE icon, and afterward, select the Get started button to open the template window.

web scraper ide

The new window pop-ups a dialog box where you can select pre-existing dataset options to work with or create one from scratch if you desire to do so.

ebay

Select the eBay discovery and PDP options, and the page should look something like this with the collector code.

collector code

Now scroll down the page, and under the input tab, pass in the name of a product you want to analyze and extract its data. Once done, click the Preview button to run the preview and start the extraction.

input option

PS: You must also note that you can enter your scripts within the Interaction code section.

Looking at the output result tab after running the preview, it formatted the result from the eBay website based on the following data classification as product_url, title, image, price of the product, and so on.

output data

Saving the collector
To save the collector, click on the Finish editing button to open the configuration page as seen below:

saving the collector

Initiate the collector by API
Under the My Scrapers tab, let’s initiate this project and work with the scripts provided by clicking the Initiate by API button.

initiate by api

Creating authorization token
Authorization in programming grants access to users and identifies you as the account's rightful owner.

account settings

Click on the Account settings menu at the bottom left of the window to create an API token.

account settings for token

Upon adding the API token, you will receive a token for verification; enter the secret code.

add api token

Now that is done, copy your API token key, as you won’t be able to retrieve it unless you create a new one for use.

new api token

Return to the New collector page, and copy the scripts based on your operating system (OS) in your terminal. Make sure to replace API_TOKEN with the key you copied in the previous section after the word BEARER.

copy the scripts

In your command line interface or terminal, the result of the API code should look something like this:

 curl -H "Authorization: Bearer API_TOKEN" -H "Content-Type: application/json" -d '[{"keyword":"ralph lauren polo shirt","count":10,"location":"","condition":"New unused"}]' "https://api.brightdata.com/dca/trigger?collector=c_liopmjh61f3o3lz7dz&queue_next=1"
Enter fullscreen mode Exit fullscreen mode

cli

The request command makes the code active in the Result API section of the New collector dashboard page. Once again, please copy and paste the code into the CLI tool.

result api

PS: Remember to put your API token key in place of the value API_TOKEN.

    curl "https://api.brightdata.com/dca/dataset?id=j_liosdy1cdutdi7sod" -H "Authorization: Bearer API_TOKEN"
Enter fullscreen mode Exit fullscreen mode

Run the script in the CLI, and the datasets in an object with status should read building and a message.

status and message report

If the response continues to show, retry sending the request. When successful, you should see this result object.

web scraper results

Using Postman

Like the displayed object above, let’s use Postman to get the response for the Result API.

If you do not have Postman, download it here. Postman is an API platform for building, publishing, monitoring, testing, and documenting APIs. Check this resource article to learn more about Postman and its use.

Open the Postman app and input these values:

  • In the request section in Postman, pass in the URL in the GET method
  • Click the Authorization tab, and select the Bearer Token from the Type dropdown, pass in your token value
  • Click on Send button to send the request
  • If the request is successful, you should see a status message of 200 in the response section and an array of objects for the queried scraped eBay data

postman

Creating a Node Server

Node is a JavaScript runtime environment that allows the execution of JavaScript code outside the web browser enabling developers to build server-side applications and command-line tools.

Let’s create a web server. One of the requirements for initializing your project in the terminal is using the package manager, npm, which is automatically present after installing Node.js on your local machine. Check it using this command:

    node --version
Enter fullscreen mode Exit fullscreen mode

It displays the current version of Node.

  • Create a new directory. For this project, it is named datasets.
  • Change its directory and initialize the project with the command:
cd datasets

npm init -y
Enter fullscreen mode Exit fullscreen mode

The -y flag accepts the defaults that look like this:

package.json

    {
      "name": "datasets",
      "version": "1.0.0",
      "description": "",
      "main": "index.mjs",
      "scripts": {
        "test": "echo \"Error: no test specified\" && exit 1"
      },
      ...
    }
Enter fullscreen mode Exit fullscreen mode
  • Install the following packages:
    npm install -D nodemon
Enter fullscreen mode Exit fullscreen mode

Nodemon automatically updates and monitors for any changes in your source code and automatically restarts the server.

    npm install csv-parse
Enter fullscreen mode Exit fullscreen mode

The csv-parse package is a parser for converting CSV text input into an array or objects.

Now, update the script section in the package.json file to this:

    {
      "name": "datasets",
      "version": "1.0.0",
      "description": "",
      "main": "index.mjs",
      "scripts": {
        "start": "node index.mjs",
        "start:dev": "nodemon index.mjs"
      },
      ...
    }
Enter fullscreen mode Exit fullscreen mode
  • Next, create a new file in the root directory with the command:
    touch index.mjs
Enter fullscreen mode Exit fullscreen mode

To test this file, write a basic JavaScript script and run the server with the following command:

    npm run start:dev
Enter fullscreen mode Exit fullscreen mode

Social Media Data from Bright Data

Scraping large datasets requires lots of effort and work using technologies like Node or Python. The way to get around this is to use a platform like Bright Data to obtain the information you need to get your results as soon as possible.

Let’s get this dataset from Bright Data which will be the social media platform Instagram, with these steps:

dataset marketplace

  • Open the Dataset Marketplace, and under Categories, select Instagram.com from the Social media dropdown.

instagram

  • Click on View dataset and download the sample dataset in CSV format.

download CSV

Make sure to save the dataset in the root directory of the Node web server.

Your folder structure should look something like this:

.
    └── datasets
        ├── node_modules
        ├── instagram.csv
        ├── package-lock.json
        ├── package.json
        └── index.mjs
Enter fullscreen mode Exit fullscreen mode

Reading CSV datasets in Node.js

For this section, Node will read the comma-separated values (CSV), which dataset is from Bright Data.

Update the index.mjs file with the code:

index.mjs

    import { parse } from "csv-parse";
    import { createReadStream } from "node:fs";

    const instagramAccount = [];

    const isInstagramAccount = (info) => {
      return (
        info["posts_count"] > 300 &&
        info["followers"] > 6000 &&
        info["biography"] !== "" &&
        info["posts"] !== ""
      );
    };

    createReadStream("instagram.csv")
      .pipe(
        parse({
          columns: true,
        })
      )
      .on("data", (data) => {
        if (isInstagramAccount(data)) {
          instagramAccount.push(data);
        }
      })
      .on("error", (err) => {
        console.log("error", err);
      })
      .on("end", () => {
        console.log(`${instagramAccount.length} accounts are live`);
        console.log("done");
      });
Enter fullscreen mode Exit fullscreen mode

The code above does the following:

  • Using the createReadStream() method to open up a file or stream and read the data in it
  • isInstagramAccount callback function used to filter the data from the actual CSV file
  • pipe: For connecting two streams, which means it connects to a readable stream source into the writeable destination, parse()
  • columns:true: Represents returning each row in our CSV file as a Javascript object with key-value pairs rather than just an array of values
  • .on : Event handlers chaining pushing the newly created data into the empty array, instagramAccount and displays an error, shows the number of Instagram accounts present, and finally indicates done when the script finishes

Running the scripts with the command npm run start:dev should display the result like this in the terminal:

    643 accounts are live
    done
Enter fullscreen mode Exit fullscreen mode

Conclusion

Web scraping is an integral part of data extraction used in data science. The Web Scraper IDE by Bright Data does all the heavy lifting in the background, presenting only the relevant data for your use.

This article walked you through understanding how to use the Web Scraper IDE and how you can build a custom datasets script to query large datasets of companies without fear of getting blocked by the company’s bots designed to help protect the company’s data.

Resources

Top comments (3)

Collapse
 
lordliberty profile image
Chinedu Ogadi

This is nice, but from all indications, it IDE only allows JavaScript.
Is there any such IDE that supports Python?
Again, does it allow you use some already existing scraping libraries?

Collapse
 
terieyenike profile image
oteri

Hi Chinedu,
You can check brightdata.com, and see the available option using Python. It is interesting to note with Bright Data, it makes you able to bypass website blocks.

The libraries available for you to use are Puppeteer and Playwright. If you need any other assistance, kindly let me know.

Collapse
 
lordliberty profile image
Chinedu Ogadi • Edited

Thanks Teri, for the info.

I'll definitely check it out.