DEV Community

Cover image for Puppeteer in Node.js: Common Mistakes to Avoid
Greg Gorlen for AppSignal

Posted on • Edited on • Originally published at blog.appsignal.com

Puppeteer in Node.js: Common Mistakes to Avoid

Puppeteer is a powerful Node.js browser automation library for integration testing and web scraping. However, like any complex software, it comes with plenty of potential pitfalls.

In this article, I'll discuss a variety of common Puppeteer mistakes I've encountered in personal and consulting projects, as well as when monitoring the Puppeteer tag on Stack Overflow. Once you're aware of these problematic patterns, you can write more robust scraping and testing code, while spending less time debugging and wading through arcane Puppeteer errors.

Pre-requisites

The article was written using Node 18, Puppeteer 19.4.1, Chrome 108.0.5359.125, and Firefox 108.0.1.

We will assume you are familiar with ES6 JavaScript syntax, browser development tools, the browser DOM, and Node, and have previously written some Puppeteer scripts.

Now let's examine the pitfalls of Puppeteer.

Common Pitfalls in Puppeteer for Node.js

Attempting to Return Objects and DOM Elements from evaluate Callbacks

The following snippet should be a familiar pattern to those who've previously used Puppeteer:

const puppeteer = require("puppeteer");

let browser;
(async () => {
  browser = await puppeteer.launch();
  const [page] = await browser.pages();
  const url = "https://www.example.com";
  await page.goto(url, { waitUntil: "domcontentloaded" });

  const element = await page.evaluate(() => {
    // executed in the browser
    return document.querySelector("h1");
  });
  console.log(element); // => always an empty object, {}
})()
  .catch((err) => console.error(err))
  .finally(() => browser?.close());
Enter fullscreen mode Exit fullscreen mode

Note: I'll skip the IIFE, import, and error-handling boilerplate in the remainder of this article.

The code above attempts to return a DOM element from the browser context back to Node for further processing (clicking it, typing into it, extracting its text content, etc.). However, the evaluate call, which runs code in the browser context, resolves to an empty object in Node. DOM elements are complex structures with circular references and cannot be readily serialized and deserialized. These elements can't be decoupled from the browser environment in a meaningful way.

This behavior isn't specific to Puppeteer. Running JSON.stringify(document.querySelector("h1")) on a page with a header element should return '{}' on a Chromium-based browser. Firefox gives 'null', also indicating a serialization failure.

One solution is to use page.$("h1") (or the more general page.evaluateHandle) to create a Puppeteer ElementHandle that exposes an interface to the DOM element. Alternatively, you can use JSHandle which exposes an interface to the JS object. These interfaces enable you to run code in the browser context on the element, possibly to extract serializable data such as text content or element properties, or issue trusted events. Be sure to dispose of these handles when you no longer need them to avoid memory leaks.

I've used evaluate here as it's the most general way to run code in the browser, but the same behavior applies to $eval and $$eval as well. These methods are shorthands for the common case when document.querySelector or document.querySelectorAll is the first line in the evaluate callback.

Trying to Access Variables from an evaluate Callback

At times, variables may appear to be in scope from an evaluate callback in Puppeteer when they aren't. The following example is a bit contrived because we could use $eval, but it nevertheless illustrates the pattern succinctly:

const selector = "h1";
const text = await page.evaluate(() => {
  return document.querySelector(selector).textContent;
});
Enter fullscreen mode Exit fullscreen mode

Here, selector seems like it should be in scope of the evaluate callback, but it doesn't exist when the callback runs in the browser, throwing Evaluation failed: ReferenceError: selector is not defined. Yet again, serialization is the culprit: the browser is a completely separate process from Node that doesn't have your Node variables in scope.

Using the string version of evaluate makes the situation clearer:

const selector = "h1";
const text = await page.evaluate(`
  document.querySelector("${selector}").textContent
`);
console.log(text); // => Example Domain
Enter fullscreen mode Exit fullscreen mode

However, building a string can lead to quoting problems. This motivates the more general approach β€” passing data to the browser by including extra arguments to evaluate:

const selector = "h1";
const text = await page.evaluate((selector) => {
  return document.querySelector(selector).textContent;
}, selector);
Enter fullscreen mode Exit fullscreen mode

This parameter-passing pattern also applies to other important Puppeteer calls, such as page.waitForFunction. A subtle difference is that waitForFunction's second argument is a configuration options object, followed by the variable parameter arguments:

page.waitForFunction((arg1, arg2) => {...}, options, arg1, arg2);
Enter fullscreen mode Exit fullscreen mode

This Stack Overflow post offers tips on passing complex arguments to evaluate calls.

Assuming Browser DevTools Is the Same as Node

Puppeteer programmers stuck on a bug often claim their selectors work in browser developer tools, but fail in Puppeteer. Unfortunately, there's no guarantee that code that works in browser developer tools will also work in Puppeteer. Here are some reasons why:

  • Developer tools exposes iframe and shadow root subtrees. Puppeteer requires these trees to be explicitly expanded. It’s worth noting that Microsoft's Playwright library has locators that expand the shadow DOM by default.
  • By the time you get around to interacting with developer tools, the page has typically loaded its resources and executed its JS scripts. In Puppeteer, you can use waitForSelector calls to ensure JS-injected elements are available before interaction.
  • When working in an unautomated browser's developer tools, the website's server trusts you and delivers a full experience. In Node, Puppeteer scripts are often detected as bots, so they are blocked outright or served a restricted version of a page. Adding console.log(await page.content()) is an easy way to verify that your HTML structure in Puppeteer is what you expect.

Recognizing developer tools and Node as distinct environments goes a long way to ensuring smooth translations from your developer tools exploration code to the final Puppeteer script.

Assuming Puppeteer's Headless Mode Works the Same as Headful

Just as DevTools is distinct from Node, it's a mistake to assume that Puppeteer's headless mode works the same as headful mode. Websites have a much easier time detecting scripts as bots when in headless mode than in headful mode.

As with the above tip, console.log(await page.content()) is a great way to ensure a document is what you expect. If a selector you see in the developer tools or in headful mode isn't in the headless log, there's a good chance you've been blocked.

Not Using Promise.all When Triggering Navigation with a Click

Navigation is a common point of failure in Puppeteer scripts. The following code is unsafe:

await page.click("#submit"); // trigger a navigation
await page.waitForNavigation();
Enter fullscreen mode Exit fullscreen mode

In fact, this is a race condition. If the navigation resolves before waitForNavigation has the chance to run, the script may throw a timeout error. The correct pattern is:

await Promise.all([
  page.waitForNavigation(),
  page.click("#submit"), // trigger a navigation
]);
Enter fullscreen mode Exit fullscreen mode

Or:

const navigationPromise = page.waitForNavigation();
await page.click("#submit"); // trigger a navigation
await navigationPromise;
Enter fullscreen mode Exit fullscreen mode

In these examples, the navigation wait promise is set before the navigation is triggered, ensuring that it will resolve as intended.

Making Unnecessary Calls to waitForNetworkIdle or waitForNavigation

Another navigation-related mistake is making spurious calls to waitForNetworkIdle or waitForNavigation. For example:

await page.goto("some url");
await page.waitForNavigation();
Enter fullscreen mode Exit fullscreen mode

This is logical: we want to trigger a navigation with goto, then wait for that navigation to settle. But goto already waits for navigation, so the second waitForNavigation is waiting for a navigation that's already occurred, causing a timeout.

It's a similar story when waiting for an idle network state, either with a waitForNetworkIdle call or page.goto(url, {waitUntil: "networkidle0"}). In fact, waiting for an idle network can make a script hang forever if the automated page keeps enough long-running connections open.

networkidle2 is usually safer since it tolerates two long-running connections. However, it is often used out of laziness or lack of awareness in place of the clear-cut waitForSelector and waitForFunction predicates.

Using Infinite Timeouts

It's a mistake to set any timeout to 0 β€” for example, with page.setDefaultTimeout(0). Infinite timeouts introduce the potential for the script to block forever when encountering an unexpected state, without giving a clear error message. Under most circumstances, when a script hangs on a selector or navigation for more than a few minutes, it should log an error and either exit so its maintainer can fix the problem, or restart itself if it should keep attempting to do something.

Usually, when I come across infinite timeouts in Puppeteer scripts, it's an artifact of attempting to fix a deeper issue, like the script being detected as a bot and blocked. But the infinite timeout makes these errors harder to detect and resolve by stifling them and causing a silent hang.

Forgetting to Await a Puppeteer Call

Almost all Puppeteer API calls are asynchronous. The reason for the asynchronous interface is that the browser runs in a separate process from Node. Puppeteer's methods send and receive data and wait for the browser process to respond, much like networking or file system operations. The Node process can use this time to perform CPU-bound work.

A common mistake is forgetting to await a promise returned by a Puppeteer call. This can lead to confusing and non-deterministic errors and race conditions.

For example, omitting await on a page.goto call may result in a Protocol error (Page.navigate): Target closed or Execution context was destroyed, most likely because of a navigation.

Awaiting Synchronous Puppeteer Calls

One solution to the missing await problem described above is to await everything, but this can lead to confusion as well.

For example, I see the following pattern often:

await page.on("request", (request) => {
  /* handle the request */
});
// do stuff after the request has been handled
Enter fullscreen mode Exit fullscreen mode

Since page.on doesn't return a promise, it's easy to forget that // do stuff after the request has been handled runs before the request handler callback. The callback is in a different promise chain.

To solve this, use Puppeteer's page.waitForRequest (or page.waitForResponse) instead of page.on, which acts as a shorthand for manually promisifying the page.on callback.

Creating Node.js Memory Leaks with page.on Listeners

Consider the following code:

for (;;) {
  page.on("request", () => {});
  await setTimeout(10); // from "timers/promises"
}
Enter fullscreen mode Exit fullscreen mode

This code reinstalls an event listener over and over again, slowly eating memory. The problem seems obvious in this minimal example, but I've seen it buried in the midst of moderately-complex, long-running jobs that eventually crash.

Misunderstanding page.exposeFunction

page.exposeFunction enables your browser code to trigger Node code. As with evaluate, you can't pass DOM elements or other non-serializable structures as parameters, so a typical use case is passing serialized data like JSON or text for periodic processing. In the common case, you'll use evaluate to extract data rather than exposeFunction.

Doing Too Much Work in Parallel

Another obvious pattern when seen in isolation is:

const urlsToScrape = [
  /* large array */
];

const browser = await puppeteer.launch();
const results = await Promise.all(
  urlsToScrape.map(async (url) => {
    await page.goto(url, { waitUntil: "domcontentloaded" });
    return page.$eval("h1", (element) => element.textContent);
  })
);
Enter fullscreen mode Exit fullscreen mode

If urlsToScrape happens to be large enough, the memory and processor load from spawning dozens or hundreds of pages can bring a system down quickly. Consider puppeteer-cluster.

Wrapping Up

In this article, we've seen a variety of mistakes and gotchas that every Puppeteer programmer should be aware of.

I hope these tips will save you from making the same mistakes I've made over the years, so you can keep your tests and scripts running smoothly.

P.S. If you liked this post, subscribe to our JavaScript Sorcery list for a monthly deep dive into more magical JavaScript tips and tricks.

P.P.S. If you need an APM for your Node.js app, go and check out the AppSignal APM for Node.js.

Top comments (0)