DEV Community

Jordan Hansen
Jordan Hansen

Posted on • Originally published at javascriptwebscrapingguy.com on

Avoid being blocked with puppeteer

Demo code here

One of the main questions I see on forums and reddit with regards to web scraping is…”how do I avoid being blocked?”. This is a problem that I certainly have had to address and the best solution to avoid being blocked is puppeteer and some of the great tools in puppeteer-extra. I also feel that it is important to mention how any web scraping should be done with care. While I feel that anything public is fine to web scrape, you shouldn’t be doing anything that puts undue burden on the target site. Feel free to take a look at the post I wrote on ethical web scraping.

Officially this is going to be part of the Learn to Web Scrape series but this isn’t targeted towards beginners. While I don’t feel it is very difficult to start using the puppeteer-extra plugins, I’m not going to go into the depth that a complete beginner to programming would need.

To the trials!

doctor who fun gif

We are going to use Zillow as a test target today. I have a simple bit of puppeteer code visiting a random address in Ohio on Zillow. I perform the action five times, waiting 1.5 seconds between each new attempt. Check the code:



    const browser = await puppeteer.launch({ headless: false });

    const url = 'https://www.zillow.com/homes/%0913905--ROYAL-BOULEVARD-cleveland-ohio_rb/33601155_zpid/';

    for (let i = 0; i < 5; i++) {
        const page = await browser.newPage();

        await page.goto(url);

        await page.waitFor(1500);

        await page.close();
    }

    await browser.close();


Enter fullscreen mode Exit fullscreen mode

I was blocked on my third attempt. Zillow let me visit the page twice and then:

puppeteer being blocked by zillow

Ouch. That is some pretty impressive and swift blocking. I tried to add a humanish user agent.

page.setUserAgent('Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36');

Two visits and then blocked again. Good for Zillow. I honestly applaud websites taking measures to slow down behavior they don’t want. The more friction there is, the less likely people are to want to deal with web scraping it.

Stealth mode

psych fun gif jackal switch

It’s time for the great stuff. Berstend has made some really powerful tools that come with something called puppeteer-extra. There is a large list of the tools here, with some cool ones like adblocker, flash, and….stealth.

It’s extremely easy to setup. We import the packages with require since there aren’t typescript definition files yet.



const puppeteerExtra = require('puppeteer-extra');
const pluginStealth = require('puppeteer-extra-plugin-stealth');


Enter fullscreen mode Exit fullscreen mode

Then, we just setup puppeteer from puppeteer extra.



    puppeteerExtra.use(pluginStealth());
    const browser = await puppeteerExtra.launch({ headless: false });

    // Normal browser from normal puppeteer
    // const browser = await puppeteer.launch({ headless: false });

    const url = 'https://www.zillow.com/homes/%0913905--ROYAL-BOULEVARD-cleveland-ohio_rb/33601155_zpid/';

    for (let i = 0; i < 5; i++) {
        console.log('starting attempt:', i);
        const page = await browser.newPage();

        await page.goto(url);

        await page.waitFor(1500);

        await page.close();
    }

    await browser.close();


Enter fullscreen mode Exit fullscreen mode

Now, back to Zillow. Out of my five attempts…none were blocked. Let’s try 20.

avoiding getting blocked with puppeteer on zillow

20 atttempts. No recaptchas. That easy. It’s THE best package and tool I’ve seen to avoid getting blocked while web scraping with puppeteer or any package for that matter.

Now, let’s try with 100 attempts. Eventually Zillow catches the stealth plugin and throws a recaptcha.

So, avoiding recaptchas entirely isn’t quite possible. Let’s talk about recaptchas.

reCaptcha land

fun recaptcha robot gif

reCaptchas are tough to deal with but not impossible. Berstend comes to our rescue once again with puppeteer-extra-plugin-recaptcha. The thing about reCaptchas, though, is that they can’t really be beat with pure automation. At least, I haven’t found a way.

How this plugin works is it leverages services that beat reCaptchas. One of these services is 2Captcha (this is an affiliate link. But, I use this product myself and really like it. Easy to use, very inexpensive, and works great.). You have to pay to use it and the plugin uses this integration to beat reCaptchas. But it’s not a program doing it. It’s actual humans. As I did a little more investigation, it turns out 2Captcha hires people to break the reCaptchas.

2Captcha hiring

So what it does (or at least, what I assume it does) is send the reCaptcha to 2Captcha and then someone solves it immediately and sends back the completed token. Here’s the code to handle the reCaptcha:



    // Use the reCaptcha plugin
    puppeteerExtra.use(
        RecaptchaPlugin({
            provider: { id: '2captcha', token: process.env.captchaToken },
            visualFeedback: true // colorize reCAPTCHAs (violet = detected, green = solved)
        })
    );  



Enter fullscreen mode Exit fullscreen mode

You’ll get your captchaToken from 2Captcha and place it there. In this package I’m using a .env file. I’ve included a .sample.env file to which you can add a token and just rename to .env.



                // Handle the reCaptcha
        await page.goto(url);

        try {
            await page.waitForSelector('.error-content-block', { timeout: 750 });

            await page.waitFor(5000);
            await (<any>page).solveRecaptchas();
            await Promise.all([
                page.waitForNavigation(),
                page.click('[type="submit"]')
            ]);
            console.log('we found a recaptcha on attempt:', i);
        }
        catch (e) {
            console.log('no recaptcha found');
        }


Enter fullscreen mode Exit fullscreen mode

Bam, this is all. Now when it pops up, it finds that the reCaptcha is there and then solves it. Easy. I was going to record a gif of it being solved but once I did it once it must have flagged my IP as good because it now hardly ever prompts me to solve reCaptchas. I started another 100 attempt check WITHOUT the stealth plugin and it didn’t prompt to solve a recaptcha until attempt number 75 and then it solved it and continued on.

avoid being blocked with puppeteer and handling recaptchas

Pretty awesome, right?

Conclusion

Star of the show is the puppeteer-extra. Combo that with its stealth plugins and its recaptcha plugin and 2Captcha and you can avoid, or handle, almost any blocking. Happy scraping!

Demo code here

Looking for business leads?

Using the techniques talked about here at javascriptwebscrapingguy.com, we’ve been able to launch a way to access awesome business leads. Learn more at Cobalt Intelligence!

The post Avoid being blocked with puppeteer appeared first on JavaScript Web Scraping Guy.

Top comments (1)

Collapse
 
mnasirayub profile image
Muhammad Nasir Ayub

Hi,
I am using Puppeteer library in NodeJS for runtime PDF file generation. It works fine on my local system, but when I deploy my app on a cPanel Based CentOs Os server, it throws an error. Any solution would be appreciated.