loading...
Cover image for Strapi, another use case: Build your own API from any website with Puppeteer

Strapi, another use case: Build your own API from any website with Puppeteer

hichamelbsi profile image ELABBASSI Hicham Updated on ・5 min read

The objective of this tutorial is to build a simple job search API with Strapi and Puppeteer. Strapi is an open-source Headless CMS written in NodeJS and Puppeteer is an open-source Headless Browser (Chrome) NodeJS API.

It seems that the time is for headless tools...πŸ˜† (Anyway, there is no direct link between Strapi & Puppeteer except the "Headless" word.)

Strapi

Strapi is used to build powerful APIs without efforts. Several features are available in Strapi including CRON tasks configuration (And this is a good thing because we will use them to schedule the Puppeteer script execution).

1. Strapi installation

Well, let's start this tutorial. The first thing we need to do is to install Strapi.

yarn create strapi-app job-api --quickstart

If you don't want to use yarn, there are other possibilities to install Strapi in the documentation.

2. Strapi admin user

This command should install Strapi and open your browser. Then, you will be able to create your admin user.
Strapi admin user creation

3. Job Collection type

In the Strapi admin home page, click on the blue button CREATE YOUR FIRST CONTENT-TYPE.
Strapi admin home page
You will be redirected to the collection type creation form.
Strapi collection type creation

After that, you will be able to add fields to the Job collection type.
Strapi fields list
Strapi field form

For our basic example, we will need to create five text fields (title, linkedinUrl, companyName, descriptionSnippet, and timeFromNow).
Strapi Job fields

Don't forget to click on the Save button to restart the Strapi server

Strapi server restart
After that, we can put the Strapi admin aside for the moment and open the Strapi repository in an editor.

Strapi CRON task

Firstly, we'll need to enable CRON in the Strapi server configuration.
Open the config/environments/development/server.json file

{
  "host": "localhost",
  "port": 1337,
  "proxy": {
    "enabled": false
  },
  "cron": {
    "enabled": true
  },
  "admin": {
    "autoOpen": false
  }
}

Then let's create the CRON task. Open the ~/job-api/config/functions/cron.js file and replace the content by this

"use strict";
module.exports = {
  // The cron should display "{date} : My super cron task!" at every minute.
  "*/1 * * * *": (date) => {
    console.log(`${date} : My super cron task!\n`);
  },
};

Now, restart the Strapi server and let's see if our cron task is running properly.

yarn develop
yarn run v1.21.1
$ strapi develop

 Project information

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Time               β”‚ Thu Apr 16 2020 01:40:49 GMT+0200 (GMT+02:00)    β”‚
β”‚ Launched in        β”‚ 1647 ms                                          β”‚
β”‚ Environment        β”‚ development                                      β”‚
β”‚ Process PID        β”‚ 20988                                            β”‚
β”‚ Version            β”‚ 3.0.0-beta.18.7 (node v10.16.0)                  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

 Actions available

Welcome back!
To manage your project πŸš€, go to the administration panel at:
http://localhost:1337/admin

To access the server ⚑️, go to:
http://localhost:1337

Thu Apr 16 2020 01:41:00 GMT+0200 (GMT+02:00) : My super cron task !

Thu Apr 16 2020 01:42:00 GMT+0200 (GMT+02:00) : My super cron task !

Thu Apr 16 2020 01:43:00 GMT+0200 (GMT+02:00) : My super cron task !

...

We can see that {date} : My super cron task ! is displayed every minute in the terminal.

Puppeteer

Puppeteer is used to automating any action you can perform in the browser. You can use it to automate flows, take screenshots and generate PDFs. In this tutorial, we will use Puppeteer to get the list of ReactJS jobs from Linkedin. We will also use Cheerio to select the data in the received markup.

Now that the CRON task is working well, we will install Puppeteer and Cheerio in the Strapi project.

cd job-api
yarn add puppeteer cheerio 

Let's adapt the CRON task to get a list of ReactJS job published on linkedin the last 24 hours in San Francisco.

In the ~/job-api/config/functions/cron.js

"use strict";
// Require the puppeteer module.
const puppeteer = require("puppeteer");

module.exports = {
  // Execute this script every 24 hours. (If you need to change the cron 
  // expression, you can find an online cron expression editor like 
  // https://crontab.guru
  "0 */24 * * *": async (date) => {
    // 1 - Create a new browser.
    const browser = await puppeteer.launch({
      args: ["--no-sandbox", "--disable-setuid-sandbox", "--lang=fr-FR"],
    });

    // 2 - Open a new page on that browser.
    const page = await browser.newPage();

    // 3 - Navigate to the linkedin url with the right filters.
    await page.goto(
      "https://fr.linkedin.com/jobs/search?keywords=React.js&location=R%C3%A9gion%20de%20la%20baie%20de%20San%20Francisco&trk=guest_job_search_jobs-search-bar_search-submit&redirect=false&position=1&pageNum=0&f_TP=1"
    );

    // 4 - Get the content of the page.
    let content = await page.content();
  },
};

Parse the html content with Cheerio and store the job with the Strapi global.

"use strict";
const puppeteer = require("puppeteer");
const cheerio = require("cheerio");

module.exports = {
  "0 */24 * * *": async (date) => {
    const browser = await puppeteer.launch({
      args: ["--no-sandbox", "--disable-setuid-sandbox", "--lang=fr-FR"],
    });
    const page = await browser.newPage();
    await page.goto(
      "https://fr.linkedin.com/jobs/search?keywords=React.js&location=R%C3%A9gion%20de%20la%20baie%20de%20San%20Francisco&trk=guest_job_search_jobs-search-bar_search-submit&redirect=false&position=1&pageNum=0&f_TP=1"
    );
    let content = await page.content();

    // 1 - Load the HTML
    const $ = cheerio.load(content);

    // 2 - Select the HTML element you need
    // For the tutorial case, we need to select the list of jobs and for each element, we will
    // create a new job object to store it in the database with Strapi.
    $("li.result-card.job-result-card").each((i, el) => {
      if (Array.isArray(el.children)) {
        const job = {
          title: el.children[0].children[0].children[0].data,
          linkedinUrl: el.children[0].attribs.href,
          companyName:
            el.children[2].children[1].children[0].data ||
            el.children[2].children[1].children[0].children[0].data,
          descriptionSnippet:
            el.children[2].children[2].children[1].children[0].data,
          timeFromNow: el.children[2].children[2].children[2].children[0].data,
        };

        // 4 - Store the job with the Strapi global.
        strapi.services.job.create(job);
      }
    });

    // 5 - Close the browser
    browser.close();
  },
};

Restart the Strapi server and let's go back to the admin
http://localhost:1337/admin.
In the Job content manager, you should see the data from LinkedIn
Strapi content manager list view
Strapi content manager details view

Good job ! You've just build an API from another website in few minutes πŸ˜„

Discussion

pic
Editor guide
 

Great article, testing it out and new to JS.

I notice that "li.result-card.job-result-card" is no longer working.
Can you please update or point to me to what I should look for?

 

If the selector doesn't work, you can go to the Linkedin job search with your browser a copy the li selector in the DEV tools.

 

Dude this is great man, I got it to work and thanks for teaching me.

Here is the update li to update:
li.result-card.job-result-card.result-card--with-hover-state.job-card__contents--active

Thank you, Sif!

Well, it seems that the selector in the tutorial is still working li.result-card.job-result-card. Be careful, the selector in your reply will select only the active list item (as you can see, the .job-card__contents--active is the active CSS class for a selected list item). We need all the list items (not just the selected one) so you need to get a more generic selector.

Thanks for that, I will give that a try

 

I followed the steps to the letter, but don't see jobs turning up in my admin page. I see GETs ongoing in my terminal, so I suppose that means that the data is being fetched? If I am not mistaken?

 

Hello Dushyant,

Can you see your content type in the Strapi admin page? Can you share your CRON task script please?

 

Thanks for the reply, sir.

Yes, I can see the content-type, Jobs, in the admin page.

Here is my CRON script(functions/cron.js)
gist.github.com/dkp1903/d598e143ea...

Your welcome.

Well, it should work. Can you confirm that your Strapi server CRON configuration is set to true in config/environments/development/server.json.

Also, keep in mind that the CRON task in this example will be executed every 24 hours. Did you wait 24 hours to test the case? Maybe you should modify the CRON expression to execute your script every minutes just to test if the script works well.

...
"*/1 * * * *": (date) => {
...

Don't forget to stop the server after the test :D

It works, sir. Forgot about the 24 hour thing. Switched it to a minute and it works right as rain.

Thanks a million!

 

Nice!

Tip: /Users/helabbassi/perso/ should be replaceable with ~.

 

Oh thank you Dan !

 

Don't mean to be a killjoy, but the LinkedIn part seems to be a violation of the LinkedIn ToS

linkedin.com/help/linkedin/answer/...

LinkedIn has banned users for seemingly harmless apps in the past. Can you update the article to use another site as an example?

 

Great tutorial! Keep it up!

 

Thank you, Victor.

 

Thanks for this! I was already building out a job board using Strapi and was manually inputting some of the things. This was a huge help to get some other data.

 

What is the vest way to manage authentication with puppeteer ? To have access to our own data in linkedin

 

Hi Lucas,

I didn't have time to test that solution (and I think it isn't the best way to do this) but I think you will need to sign in to Linkedin with your browser (to start a session) and find the li_at cookie in the DEV tools. Then, you will be able to set this cookie before navigating to Linkedin (just before the await page.goTo(...))

await page.setCookie({
      'name': 'li_at',
      'value': YOUR_COOKIE_VALUE,
      'domain': '.www.linkedin.com'
})

I really recommend you to create a simple function to check if you are logged in or not. Something like

const checkIfLoggedIn = async (page) => {
     const isAuthenticated = await page.$('.sign-in-card') === null;
     return isAuthenticated;
}

I think this function needs to be called after the setCookie because your Linkedin session can be finished.

Feel free to add some additional information about this solution or suggest a better way to do that.