DEV Community

Cover image for Puppeteer in Node.js: More Antipatterns to Avoid
Greg Gorlen for AppSignal

Posted on • Updated on • Originally published at blog.appsignal.com

Puppeteer in Node.js: More Antipatterns to Avoid

Puppeteer is a powerful browser automation library for web scraping and integration testing. However, the asynchronous, real-time API leaves plenty of room for gotchas and antipatterns to arise.

This article is part of a series, starting with Avoiding Puppeteer Antipatterns and Puppeteer in Node.js: Common Mistakes to Avoid. In this post, we'll add another dozen antipatterns to the list. There will be no overlap with previous installments, so you may wish to start with those.

While these antipatterns aren't quite full-fledged mistakes, weeding them out of your scripts (or being judicious when employing them) will increase the reliability of your Puppeteer code.

Let's begin.

Prerequisites

We will assume you are familiar with ES6 JavaScript syntax, promises, the browser DOM, and Node, and have written a few Puppeteer scripts already.

At the time of writing, the version of Puppeteer used was 20.3.0.

Onward to the antipatterns!

Antipatterns to Avoid in Puppeteer for Node.js

Underusing page.goto

I often see scraping scripts automating a search by:

  • Navigating to a website's landing page.
  • Accepting a cookie banner.
  • Typing a search term into an input box.
  • Clicking a button to submit the query.
  • Waiting for the second navigation to complete.

While this may make sense for testing, in scraping contexts these steps can often be bypassed by adding a query parameter such as https://www.some-site.com/search?q=search+term and using page.goto(searchResultURL) directly. Skipping the intermediate page speeds up the script, requires less code, and typically improves reliability.

The same is often true for automation involving iframes. In many cases, the frame source URL can be navigated to directly, bypassing the hassle of working with the parent document. If the frame source isn't known in advance, you can extract it and strip off the outer document with a goto:

const frame = await page.waitForSelector("iframe");
const src = await frame.evaluate((el) => el.src);
await page.goto(src, { waitUntil: "domcontentloaded" });
Enter fullscreen mode Exit fullscreen mode

The code simplification may be worth the cost of the extra (probably cached) load.

Sometimes clicking links causes unpredictable, fussy navigation behavior, such as unwanted popups. In such cases, consider following the pattern in the above snippet, extracting the link's href property and plugging it into page.goto.

Using page.on Vs. page.waitForRequest or page.waitForResponse

page.on("request", handler) and page.on("response", handler) are callback-based event listeners. These are useful for intercepting and processing all requests or responses, but can be awkward to use with asynchronous control flow.

In cases when you're waiting for one or more specific responses to arrive, instead of chaining your dependent code from the callback or promisifying page.on, consider using page.waitForRequest or page.waitForResponse. These handy methods are essentially promisifed page.on() handlers.

For dialogs, promisification is unavoidable, as Puppeteer doesn't offer a page.waitForDialog wrapper at the present time. However, page.once is a handy way to avoid having to remove the listener once you've intercepted the dialog:

const dialogDismissed = new Promise((resolve, reject) => {
  page.once("dialog", async (dialog) => {
    await dialog.dismiss();
    resolve(dialog.message());
  });
});

/* take action to trigger the dialog */

const msg = await dialogDismissed;
Enter fullscreen mode Exit fullscreen mode

The above code has a subtle issue: there's no timeout, so your script can silently hang forever. You can add a timeout as follows:

const timeout = 30_000;
const dialogDismissed = new Promise((resolve, reject) => {
  const timeoutId = setTimeout(reject, timeout);
  page.once("dialog", async (dialog) => {
    clearTimeout(timeoutId);
    await dialog.dismiss();
    resolve(dialog.message());
  });
});

/* take action to trigger the dialog */

const msg = await dialogDismissed;
Enter fullscreen mode Exit fullscreen mode

Since requests and responses are liable to occur in any order at any time, page.once is less useful for those events than it is for dialogs, which are usually predictable.

Note that, in general, you won't need to use new Promise much in Puppeteer. Promisifying a promise-based API is known as the explicit promise constructor antipattern and usually appears when programmers aren't accustomed to working with promises.

Not Using Specific Wait or Evaluate Methods

As with request and response handlers, many Puppeteer methods have a hierarchy of generality. Here are the evaluate-family calls (roughly), from general to specific:

  1. page.evaluate() can do just about anything any other Puppeteer API call can do in the browser. It's powerful but not specific.
  2. page.$eval() and page.$$eval() are shorthands for common-case page.evaluate() calls that immediately run a document.querySelector() or document.querySelectorAll() as their callback's first step.
  3. page.waitForFunction() is shorthand for a page.evaluate() that registers a MutationObserver or requestAnimationFrame loop that repeatedly checks a condition, then returns when the condition becomes true.
  4. page.waitForSelector() is shorthand for a page.waitForFunction() that blocks until a specific selector matches an element in the DOM.

It's an antipattern to use a general method when a specific one exists that's tailored for the job. For example:

await page.evaluate(() => {
  const elements = document.querySelectorAll(".foo-bar");
  return [...elements].map((el) => el.textContent.trim());
});
Enter fullscreen mode Exit fullscreen mode

Versus:

await page.$$eval(".foo-bar", (elements) => {
  return elements.map((el) => el.textContent.trim());
});
Enter fullscreen mode Exit fullscreen mode

Not Reusing Browsers

Launching browsers is a heavy undertaking. It's a good idea to clear browser state after each run to maintain idempotency when testing, and in web applications that use Puppeteer with Express to perform tasks. But, in many cases, a browser can be reused safely, relying on pages to encapsulate tasks.

When business logic and safety can accommodate it, browser (or even page reuse) can provide dramatic efficiency gains.

Scraping the DOM Rather than Responses

Many web applications rely on information from JSON data, either embedded inside a <script> element or as XHR response payloads. Instead of figuring out how to extract the data from the DOM, it's often useful to intercept the responses or pull the information out of a <script>. Removing the fickle presentation layer from the process can make your code more reliable.

While raw data is likely more stable than the DOM, in some cases JSON payload structures are subject to change or may be harder to identify and parse than the DOM. Although responses aren't always a useful way to scrape data, it's worth popping open the network tab to try to find the response that has the data in it. Occasionally striking gold makes it worth the effort.

Using XPath Instead of CSS Selectors

Since XPath tends to be more verbose and trickier to write correctly than CSS selectors, CSS selectors should be preferred over XPath when possible.

Puppeteer 19.7.1 introduced a ::-p-text selector, which covers a good deal of XPath's common use case in Puppeteer, selecting elements by text.

Using Attribute CSS Syntax for Classes

CSS selectors have special and useful syntax for selecting elements by their attributes. For example:

<label for="username"></label>
Enter fullscreen mode Exit fullscreen mode

You can select this with page.$('[for="username"]'), which is fine, but problems arise when applying this syntax to classes:

<div class="row align-items-center"></div>
Enter fullscreen mode Exit fullscreen mode

Here, there's good reason to prefer .row.align-items-center over [class="row align-items-center"]. The dot syntax is easier to read and write, and is agnostic of ordering and additional attributes. If the class list changes to:

<div class="align-items-center row"></div>
Enter fullscreen mode Exit fullscreen mode

Or:

<div class="row align-items-center p-2"></div>
Enter fullscreen mode Exit fullscreen mode

Then the attribute selector fails. To make the two approaches interchangeable, the attribute selector ~ could be used: [class~="row"][class~="align-items-center"]. The verbosity makes the antipattern obvious.

There are situations suited to the [class="..."] pattern — for example, selecting elements that have a consistent prefix attribute name with a generated postfix: [class^="p-"].

Adding Premature Abstractions

In scraping and testing scenarios, I often see programmers writing classes and functions before they've correctly written their scraping or testing logic. In many cases, these abstractions make their scripts harder to debug than they would otherwise, introducing promise-management issues, memory leaks, and other sources of confusion.

I typically follow the Grug-Brained Dev's advice:

...one thing grug come to believe: not factor your application too early!

Once the scraper or testing block has been written correctly, then consider breaking the code into logical chunks.

Even if this is done correctly, unnecessary abstractions can still be problematic. In the case of tests, seeing the imperative steps in a longer test can make for easier maintenance than a series of nested helper functions (at the expense of a bit of repetition).

Not Cleaning Up Browser and Page Handles with finally

Even if you take care to ensure browser.close() is called, it's common to forget errors, which can prevent the browser from closing.

Add a finally block that calls browser.close() for every browser.launch() call to ensure proper resource cleanup.

Not Using Built-in Selectors

As mentioned above, Puppeteer offers a ::-p-text p selector, along with ::-p-aria and ::-p-xpath. Prefer using these over hand-rolled alternatives.

Here's an example of clicking a button based on its text content:

await page.setContent(`<button>Click me</button>`);
const btn = await page.waitForSelector("button::-p-text(Click)");
await btn.click();
Enter fullscreen mode Exit fullscreen mode

The above code determines a match based on a substring of the text content, case-sensitively.

The >>> and >>>> combinators let you traverse shadow roots, with >>> traversing deep shadow roots and >>>> exploring one root deep.

Usual selection methods like page.waitForSelector, page.$eval, page.$$eval, and page.evaluate all work with built-in selectors. Additionally, Puppeteer has deprecated page.waitForXPath and page.$x, unifying the selector API.

Not Using userDataDir

Logins can be tricky to automate. Two-factor auth, input fields with complex asynchronous validation and masks, redirects, and iframes abound.

Instead of the hassle, you can set a userDataDir and log in manually in an idling Chromium browser launched by Puppeteer. As long as the session persists, you can go about your automation task directly from the site's dashboard.

Even in cases when you decide to automate login, persisting the session offers performance improvements.

Not Using Playwright for User-Facing Testing in Node.js

Microsoft's newer Playwright library offers a different element selection philosophy than Puppeteer. Puppeteer scripts tend to rely on CSS selectors and XPath. In contrast, Playwright's approach prioritizes user-facing attributes such as accessible roles, text, and titles.

Third-party testing packages like expect-puppeteer and pptr-testing-library attempt to bring the user-facing philosophy to Puppeteer. However, Playwright offers this style of testing out of the box. Playwright is opinionated and discourages non-user-facing selection methods as well as its inherited Puppeteer-style API, which it has mostly deprecated.

For most web scraping tasks, however, it's natural to use Puppeteer-style CSS selectors. I haven't seen evidence of any benefits in adhering to user-facing principles when web scraping. Puppeteer's simpler API gets out of the way a bit more. I appreciate using it as a thin, unopinionated wrapper on already-working browser code, adding controlled events and waits when necessary.

Wrapping Up

In this article, we covered a variety of antipatterns that can degrade the quality of Puppeteer automation scripts.

A central theme that helps avoid these antipatterns is to use the most precise tool for the job. Choose focused, high-level API methods to avoid the complexities of more powerful, low-level methods that are less idiomatic and expose unnecessary details.

Additionally, I've advocated for treating Puppeteer's evaluation and wait APIs as simple wrappers on the browser console, leaving user-facing testing principles to Playwright.

Happy coding!

P.S. If you liked this post, subscribe to our JavaScript Sorcery list for a monthly deep dive into more magical JavaScript tips and tricks.

P.P.S. If you need an APM for your Node.js app, go and check out the AppSignal APM for Node.js.

Top comments (0)