Scraping HTML with PHP Node and Puppeteer

#softwaredevelopment #howto #php #scrape

Scraping in 2018

Interestingly enough I receive decent amount of hits on an earlier blog related to web scraping. Not much has changed except that phantomJS is not the most common tool for web scraping. With the Google Chrome team creating headless chrome Puppeteer and similar tools have come around to providing a better experience. I personally do not use PHP as much as I did in the past but a lot of people still use it.

Today I started with spinning up a Ubuntu Linux virtual machine in Azure running the below command to get everything headless chrome required for install.

sudo apt install -y gconf-service libasound2 libatk1.0-0 libc6 libcairo2 libcups2 libdbus-1-3 libexpat1 libfontconfig1 libgcc1 libgconf-2-4 libgdk-pixbuf2.0-0 libglib2.0-0 libgtk-3-0 libnspr4 libpango-1.0-0 libpangocairo-1.0-0 libstdc++6 libx11-6 libx11-xcb1 libxcb1 libxcomposite1 libxcursor1 libxdamage1 libxext6 libxfixes3 libxi6 libxrandr2 libxrender1 libxss1 libxtst6 ca-certificates fonts-liberation libappindicator1 libnss3 lsb-release xdg-utils wget

Then installing php, composer and nodejs, which I recommend going to the nodejs website and using their steps

sudo apt install -y php composer php-mbstring

I found a nice wrapper for Chrome called PuPHPeteer. and ran composer require nesk/puphpeteer and then npminstall @nesk/puphpeteer.

Then I wrote my script using a website I made with VueJS that renders list elements from a json blob.

<?php
require("vendor/autoload.php");
use Nesk\Puphpeteer\Puppeteer;
use Nesk\Rialto\Data\JsFunction;
use Nesk\Puphpeteer\Resources\ElementHandle;
use Sunra\PhpSimple\HtmlDomParser;

$puppeteer = new Puppeteer;
$browser = $puppeteer->launch();

$page = $browser->newPage();
$page->goto('https://alanmbarr.github.io/HackMidWestTimeline/');

$data = $page->evaluate(JsFunction::createWithBody('return document.documentElement.outerHTML'));
$dom = HtmlDomParser::str_get_html( $data );
$browser->close();

foreach($dom->find('span') as $element) {
echo $element->plaintext."\n";
}

$dom->clear();
?>

Personally I would rather do most of this in NodeJS but if you're pretty used to PHP and not JavaScript this should be a pretty workable solution.

Top comments (6)

sticklight • Oct 26 '18

Great article, found it really helpful :)
I was wondering though - how could I use what you've shown here with a website in which the data is generated with infinite-scrolling; you have any idea?

Alan Barr • Oct 26 '18 • Edited

The easier thing to do would be to first check if you can use their api first if it is exposed in some way even if you have to login with authentication and use the api to page.

If you for some reason you cannot because the data is rendered from multiple apis you would need to potentially call more JavaScript calls on the page. Basically simulating scrolling as a user would. Maybe there is some element that is at the bottom that triggers the next page load you would need to scroll to that spot and force the next load. Repeat.

sticklight • Oct 27 '18

Thanks for the quick response!
Yeah, that could work - I found which element triggers the loading but how can I possibly simulate scrolling?

Alan Barr • Oct 27 '18

Let's take this conversation off here and reach out to me

Felix Eve • Jan 9 '21

I too am interested in how to simulate scrolling to the bottom of the page. Was there a simple solution?

Alan Barr • Jan 13 '21

I don't have a great simple solution for this. Something I've done is finding an element on the page that triggers pagination when scrolled to and using a javascript scrollto function. I haven't spent a ton of time with this in PHP there might be better options today or using a different wrapper for puppeteer.