Daniel Cerverizzo

Posted on Nov 6, 2023 • Edited on Nov 12, 2023

Unlocking the World of International Job Listings: A Node.js and Puppeteer Web Scraping Project🚀

#node #puppeteer #scraper #beginners

Introduction

Job hunting can be a daunting task, especially with countless job platforms offering excellent opportunities. Faced with this challenge, I decided to streamline my job search by consolidating the most frequently visited websites into a single, accessible resource.

But how did I go about it? My solution was to create a web scraper using cutting-edge technologies such as Puppeteer, Node.js, and MongoDB. This blog post takes you on a journey through the structure and development of this simple yet powerful project.

The Quest Begins

The first step in my mission to simplify the job search process was to leverage web scraping. Web scraping allowed me to extract data from multiple job websites, collate it, and present it in a user-friendly format.

For this, I chose Puppeteer, a headless Chrome browser, and Node.js, a powerful JavaScript runtime. These technologies worked in tandem to retrieve job listings and relevant details. With the data collected, I stored it efficiently using MongoDB, a document-based NoSQL database.

The Building Blocks

To commence my project, I initiated the process of creating a web scraper using Puppeteer. This technology granted me access to web pages, from where I could extract crucial job listing data.

Node.js played a vital role in orchestrating this process. By utilizing JavaScript, I could craft functions to navigate web pages, retrieve job descriptions, and compile the data into structured information.

Storing Data for Easy Access

MongoDB, known for its flexibility in handling unstructured data, proved invaluable. It served as the perfect repository for the job listings gathered from web scraping.

The NoSQL database stored each job listing as a document, making it easier to organize, retrieve, and display the data in a user-friendly manner.

Project structure

The project's root directory is where all your project files and subdirectories reside.

`src/`

The src/ directory contains the source code of your project.

scripts/: This directory houses the core logic of your web scraping and database operations.
- scraper.js: The main script for web scraping using Puppeteer and Node.js.
- database.js: Script for handling MongoDB database operations.
server.js: Your main Node.js application file to serve the scraped data to a frontend.

`models/`

The models/ directory contains the data models, schemas, or structures for your project.

jobSchema.js: Defines the schema for job listings to be stored in your MongoDB database.

`utils/`

The utils/ directory contains utility files, configurations, and other miscellaneous scripts.

sites.js: A configuration file listing the websites to scrape, including selectors for job details.
config.js: Configuration settings for your database connection.

`node_modules/`

This directory contains the Node.js modules and packages that your project depends on. You don't need to manage this directory manually.

`.gitignore`

The .gitignore file specifies which files or directories should be ignored when you push your project to a version control system like Git. Commonly, it includes the node_modules/ directory.

`package.json`

The package.json file lists project metadata and dependencies. It's also where you specify your project's main entry point and various scripts.

`README.md`

The README file provides essential information about your project, including how to set it up, run it, and any other necessary documentation.

This structured approach keeps your project organized, making it easier to manage and collaborate with others. The main logic for web scraping and database operations is separated, ensuring a clean and maintainable codebase. You can customize this structure based on your specific project needs.

Code

The Scraper Class

Our scraper will be encapsulated in a class for modularity and maintainability. Here's what it looks like:

// Import necessary modules and configuration
const puppeteer = require('puppeteer');
const sites = require('../utils/sites');
const database = require('../utils/config');

class Scraper {
  async scrapeData(site) {
    // Create a headless Chromium browser using Puppeteer
    const browser = await puppeteer.launch();
    const page = await browser.newPage();

    try {
      // Navigate to the specified website
      await page.goto(site.url, { timeout: 600000 });

      // Select all job listings on the page using a provided selector
      const jobList = await page.$$(`${site.selectors.list}`);
      const jobData = [];

      // Loop through the job listings and extract relevant information
      for (const job of jobList) {
        const title = await job.$eval(`${site.selectors.title}`, (element) => element.textContent.trim());
        const company = await job.$eval(`${site.selectors.company}`, (element) => element.textContent.trim());
        const location = await job.$eval(`${site.selectors.location}`, (element) => element.textContent.trim());
        const link = await job.$eval(`${site.selectors.link}`, (element) => element.href);
        jobData.push({ title, company, location, link });
      }

      return jobData;
    } finally {
      // Close the browser after scraping is complete
      await browser.close();
    }
  }

  async init() {
    try {
      // Connect to the MongoDB database
      await database.connect();
      // Clear existing data in the database
      await database.clearData();
      // Scrape data from multiple sites and store it in MongoDB
      const scrapedData = await Promise.all(sites.map((site) => this.scrapeData(site)));
      // Save scraped data to MongoDB
      await database.saveDataToMongoDB(scrapedData);
    } catch (error) {
      console.error('Error in app:', error);
    } finally {
      console.log('Finish!');
    }
  }
}

module.exports = Scraper;

// Configuration for websites to scrape
const sites = [
  {
    name: 'Remotive',
    url: 'https://remotive.com/remote-jobs/software-dev',
    selectors: {
      list: 'li.tw-cursor-pointer',
      title: 'a.tw-block > span',
      company: 'span.tw-block',
      location: 'span.job-tile-location',
      link: 'a.tw-block',
    },
  },
];

module.exports = sites;

Conclusion

The fusion of Puppeteer, Node.js, and MongoDB created a comprehensive solution to simplify job searches. With this project, I centralized data from various websites, making it easier for jobseekers to access the most relevant listings. By sharing this experience, I hope to inspire others to embark on similar projects, harnessing the power of web scraping and innovative technologies. The journey to streamline your job search begins here!

You can access this project online here:

https://jobs-one-drab.vercel.app/

https://github.com/Dcerverizzo/web-scraping-jobs

DEV Community