oteri

Posted on Aug 4, 2023 • Originally published at hackernoon.com

Using Scraping Browser and GPT for Actionable Product Insights

#chatgpt #javascript #react #codenewbie

Scraping the web for extracted data in an automated way with tools (Puppeteer, Playwright) that aid productivity is what data scientists, software developers, and research analysts use to gather information as competitive analysis, compare prices on e-commerce websites, and build apps that send email notifications to monitor change in prices like in the travel sector.

Using Bright Data Scraping Browser and GPT (Generative Pre-trained Transformers) to gather valuable insights about products, whether yours or any other competitor, is vital to gain actionable insights that will improve customer’s needs and boost sales as a result of the feedback; both negative and positive for analysis. As an example, we will demonstrate how suggestions from GPT can be helpful from reviews posted by users on the Udemy learning platform.

Leveraging this technique serves more than just individuals; brands or companies can use it to understand what people say about their products.

Everything that you will learn in this article is for ethical purposes. And that is why Bright Data is used to turn websites into structured data that is meaningful to any user without getting blocked or rate limited or using APIs (application programming interface).

Let’s get started!

GitHub

Find the source code in this repo. Fork and clone it to test it yourself.

Note that it contains the frontend application in React in a folder called reviews, displaying the reviews and suggestions data from Udemy and GPT, respectively, and a Node server, headless-web-scraping that saves the scraped data in a JSON (JavaScript Object Notation) file.

Demo

For a practical demonstration of the client-side app, check it out here.

Prerequisites

Before building or writing a line of code, check the following requirements:

Node.js >=16 as this would come installed with the package manager, npm
Knowledge of JavaScript and React
A code editor like VS Code or any other (IDE)integrated development environment
Basic understanding of CSS

Set up Bright Data Scraping Browser

The Scraping Browser is compatible with Puppeteer and Playwright, which comes with an in-built website unblocking actions.

To begin, sign up on the Bright Data website (free), and it comes with a $20/GB “no commitment” plan.

Some of the great benefits of using Bright Data architecture are:

Quick
Flexible
Cost-efficient

Discover how to leverage web scraping to your advantage.

After signup, go to your dashboard and click on the Proxies and Scraping Infrastructure icon on the window's left pane.

Next, click on the Add button dropdown and select Scraping Browser. Give the proxy a name under the Solution name field and click the Add button to continue.

The next screen will display values for the host, username, and password used to navigate the Scraping browser.

Let’s get the project running by installing the boilerplate.

Installation

Generally, in this section, you will learn the basics of initializing and creating a new boilerplate using Node.js and Vite. The web scraper in Node.js will handle the scripts for retrieving and storing the web data, while the UI (user interface) in React will display the info from the server and GPT.

In this project, create a folder that will hold both the frontend and backend code like this:

.
└── Bright_data
    ├── headless-web-scraping
    └── reviews

Node.js
To set up a Node project, first, create a new directory with the command in the terminal:

    mkdir headless-web-scraping

Next, change its directory:

cd headless-web-scraping

Initialize the project:

npm init -y

The -y flag accepts all the defaults without the interactive prompt, which are questions for the project in the package.json file.

The package.json will contain all the dependencies by installing the following:

npm install dotenv puppeteer-core

dotenv: This library is responsible for loading environment variables from the .env file into the process.env
puppeteer-core: It is an automation library without the browser itself

Now, create the index.js file in the root directory and copy-paste this code:

index.js

    console.log("Hello world!")

Before running this script, head to the package.json file and update the script section as follows:

    {
      "name": "headless-web-scraping",
      ...
      "scripts": {
        "start": "node index.js"
      },
      ...
    }

Run the script:

npm run start

This should return:

Hello world!

React
The UI folder for this app is called reviews. Run this command within the directory reviews to scaffold a new Vite React project.

npm create vite@latest ./

The ./ signifies that all the files and folders should be within the folder. Also, running the command will prompt a response in the terminal. Choose the React and JavaScript options, but you can use any other framework you are comfortable using.

With the setup complete, ensure to follow the instructions in the terminal to install the dependencies and start the development server with the command:

npm install

npm run dev

Open your browser to see the UI and the server running on port 5173.

It is time to include Tailwind CSS, a CSS utility-first framework packed with classes on the JSX used for building modern websites.

Check out this guide and follow the instructions on installing Tailwind CSS in a Vite project.

Creating a JavaScript Web Scraper in Node.js

Return to the Access parameters tab on your created zone and copy the host and username values.

Creating Environment Variables
Environment variables are essential in Node.js for storing sensitive data like secret keys and credentials from unauthorized access in development.

Copy and paste these values into the .env file created in the root folder:

.env

AUTH="<AUTH>"
HOST="<HOST>"

To load these credentials, update the index.js with the following:

index.js

    const puppeteer = require("puppeteer-core");
    require("dotenv").config();
    const fs = require("fs");

    const auth = process.env.AUTH;
    const host = process.env.HOST;

    async function run() {
      let browser;

      try {
        browser = await puppeteer.connect({
          browserWSEndpoint: `wss://${auth}@${host}`,
        });
        const page = await browser.newPage();
        page.setDefaultNavigationTimeout(2 * 60 * 1000);
        await page.goto(
          "https://www.udemy.com/course/nodejs-express-mongodb-bootcamp/"
        );
        const reviews = await page.evaluate(() =>
          Array.from(
            document.querySelectorAll(
              ".reviews--reviews-desktop--3cOLE .review--review-container--knyTv"
            ),
            (e) => ({
              reviewerName: e.querySelector(".ud-heading-md").innerText,
              reviewerText: e.querySelector(".ud-text-md span").innerText,
              id: Math.floor(Math.random() * 100),
            })
          )
        );

       const outputFilename = "reviews.json"

       fs.writeFile(outputFilename, JSON.stringify(reviews, null, 2), (err) => {
         if (err) throw err;
         console.log("file saved");
       });
      } catch (e) {
        console.error("run failed", e);
      } finally {
        await browser?.close();
      }
    }

    if (require.main == module) run();

Some things to note in the code above:

The imported module, puppeteer-core, dotenv, and the file system
Within the run() function is the puppeteer.connect() method is responsible for connecting to a remote browser using a proxy server (Bright Data Scraping Browser)
The browserWSEndpoint property is the WebSocket connection where the remote browser is running. The value passed as template literals are the parameters from the Bright Data web UI dashboard stored in the .env, which represent the username and password

The other details from the code block above are standard Puppeteer code:

Launch a new page
Set the default navigation time to 2 minutes
Go to the course page on Udemy
Inspect the HTML page using the page.evaluate() method, which will loop through the elements in the DOM to get the reviewer name and the review text

Use the Math.floor() method to generate a random id
Save the output of the result using the fs module in a JSON format

Run the script:

npm run start

The output is saved within the headless-web-scraping folder as reviews.json and should look like this:

    [
      {
        "reviewerName": "Yash U.",
        "reviewerText": "This was a very intensive course covering almost all backend stuff. A huge thanks to the instructor - Jonas and also to the community. A lot of bugs and problems were already posted in the Q&A section and it helped a lot. Towards the end of the course, there were a few things that were outdated and a lot of people were disappointed in the comments but for me these things helped a lot. You learn to search and find solutions on your own and this is what is required in real world. Hence, despite these issues towards the end, I would absolutely recommend this course to anyone who wants to start learning backend development.",
        "id": 11
      },
      {
        "reviewerName": "Shyam Nath R S.",
        "reviewerText": "As always with Jonas's other courses like JS, HTML and CSS I understood"
      },
      ...
    ]

Using GPT

Suppose you don’t have an account. Sign up and create one.

Copy one of the reviewerText from the object and paste it into ChatGPT. For a walkthrough, watch the video below.

You should get something similar to this:

The suggestions or improvements:

Creating the UI in React

React is a JavaScript library used by developers for building user interfaces with reusable components.

Now that we have the reviews and suggestions let’s create the UI to display the data.

In the reviews project, create a new folder called components in the src directory with the following files:

.
└── reviews
    └── src
        └── components
            ├── Footer.jsx
            ├── ImproveSuggestion.jsx
            ├── ReviewImprovementSuggestions.jsx
            ├── Reviews.jsx
            └── Text.jsx

Also, let’s create a file for the responses from GPT in an array of objects called reviews.js in a folder named data, ****as shown:

src/data/reviews.js

.
└── reviews
    └── src
        └── data
            └── reviews.js

Get the entire data in this gist.

Let’s update the code in the project accordingly:

Footer.jsx

    const Footer = () => {
      return (
        <>
          <footer className='mt-auto'>
            <div className='mt-5 text-center text-gray-500'>
              <address>
                Built by
                <span className='text-blue-600'>
                  <a href='https://twitter.com/terieyenike' target='_'>
                    Teri
                  </a>
                </span>
                &copy; 2023
              </address>
              <div>
                <p>
                  Fork, clone, and star this
                  <a
                    href='https://github.com/Terieyenike/'
                    target='_'
                    rel='noopener noreferrer'
                    className='text-blue-600'>
                    <span> repo</span>
                  </a>
                </p>
              </div>
              <p className='text-sm'>Bright Data ．GPT ．React ．Tailwind CSS</p>
            </div>
          </footer>
        </>
      );
    };

    export default Footer;

Change the values in the JSX if you so desire.

ImproveSuggestion.jsx

    const ImproveSuggestion = ({ suggestion }) => {
      return (
        <div>
          <li className='mt-2'>{suggestion}</li>
        </div>
      );
    };

    export default ImproveSuggestion;

ReviewImprovementSuggestions.jsx

    import ImproveSuggestion from "./ImproveSuggestion";

    const ReviewImprovementSuggestions = ({ suggestions }) => {
      return (
        <div>
          <h3 className='text-xl font-bold mt-3'>Improvement Suggestions:</h3>
          <ul className='list-disc'>
            {suggestions.map((suggestion, index) => (
              <ImproveSuggestion key={index} suggestion={suggestion} />
            ))}
          </ul>
        </div>
      );
    };

    export default ReviewImprovementSuggestions;

Reviews.jsx

    import ReviewImprovementSuggestions from "./ReviewImprovementSuggestions";

    const Reviews = ({ reviewerName, reviewText, improvementSuggestions }) => {
      return (
        <div className='mb-8'>
          <h3 className='text-xl font-bold'>
            <span>Reviewer name:</span>
          </h3>
          <p className='mb-3'>{reviewerName}</p>
          <h3 className='text-xl font-bold'>
            <span>Review:</span>
          </h3>
          <p>{reviewText}</p>
          {improvementSuggestions && (
            <ReviewImprovementSuggestions suggestions={improvementSuggestions} />
          )}
        </div>
      );
    };

    export default Reviews;

Text.jsx

    const Text = () => {
      return (
        <>
          <div className='bg-emerald-800 text-slate-50 p-5 mb-10'>
            <h1 className='text-2xl font-bold md:text-4xl'>
              Using Scraping Browser and GPT for actionable product insights.
            </h1>
            <p className='text-sm mt-3 md:text-xl'>
              Extract reviews from a specific product page{" "}
              <span className='font-bold'>Udemy</span> using Bright Data, Scraping
              Browser and GPT to analyze them to offer business insights.
            </p>
          </div>
        </>
      );
    };

    export default Text;

Some of the code snippets in the components above result from props drilling from one component to the other. Check out React documentation to learn more.

The React UI will still display the default boilerplate template in the browser. To show the current changes made to the files in the components, let’s update the entry point of the project, App.jsx, with this code:

src/App.jsx

    import Reviews from "./components/Reviews";
    import Text from "./components/Text";
    import Footer from "./components/Footer";

    import { reviews } from "./data/reviews";

    import "./App.css";

    function App() {
      return (
        <>
          <div className='flex flex-col container mx-auto max-w-6xl w-4/5 py-8 min-h-screen'>
            <Text />
            {reviews.map((review) => (
              <Reviews
                key={review.id}
                reviewerName={review.reviewerName}
                reviewText={review.reviewText}
                improvementSuggestions={review.improvementSuggestions}
              />
            ))}
            <Footer />
          </div>
        </>
      );
    }

    export default App;

Starting the development server will display the project like this:

Conclusion

Because it avoids website bans and works seamlessly with libraries like Puppeteer, Bright Data Scraping Browser is an excellent option for developers that need to deliver high-quality scraped data.

Scraping the web presents difficulties, as accessing a company's endpoints may result in blocking. For this reason, preventive measures like CAPTCHAs and other techniques exist to safeguard user data.

In this lesson, you gained insight into inspecting a webpage element and extracting the necessary data using Node.js to gather user information from Udemy and store it in a JSON file. The project's final step was using GPT to provide insightful information and show the outcome in a user interface.

Finally, using these services and tools can serve brands, companies, or individuals on ways to adequately align their products to meet customer expectations. For the Udemy case study, GPT provided ways to improve and make the course suitable for learners. Web pages are encouraged to allow comments in the form of reviews from actual product users, which would help give a critical analysis using GPT technology.

Try the Scraping Browser today!

DEV Community

Using Scraping Browser and GPT for Actionable Product Insights

GitHub

Demo

Prerequisites

Set up Bright Data Scraping Browser

Installation

Creating a JavaScript Web Scraper in Node.js

Using GPT

Creating the UI in React

Conclusion

Resources

Top comments (0)

Read next

Understanding Type and Interface in Typescript

Complete Guide: How to Install and Use Cursor AI Editor

Deploying a Node.js Application on AWS EC2 Using Tabby SSH Client

Static text reinvented: a developer’s solution to updates without a CMS