loading...

Scraping HTML with PHP Node and Puppeteer

alanmbarr profile image Alan Barr Originally published at alanmbarr.com on ・2 min read

Scraping in 2018

Interestingly enough I receive decent amount of hits on an earlier blog related to web scraping. Not much has changed except that phantomJS is not the most common tool for web scraping. With the Google Chrome team creating headless chrome Puppeteer and similar tools have come around to providing a better experience. I personally do not use PHP as much as I did in the past but a lot of people still use it.

Today I started with spinning up a Ubuntu Linux virtual machine in Azure running the below command to get everything headless chrome required for install.

sudo apt install -y gconf-service libasound2 libatk1.0-0 libc6 libcairo2 libcups2 libdbus-1-3 libexpat1 libfontconfig1 libgcc1 libgconf-2-4 libgdk-pixbuf2.0-0 libglib2.0-0 libgtk-3-0 libnspr4 libpango-1.0-0 libpangocairo-1.0-0 libstdc++6 libx11-6 libx11-xcb1 libxcb1 libxcomposite1 libxcursor1 libxdamage1 libxext6 libxfixes3 libxi6 libxrandr2 libxrender1 libxss1 libxtst6 ca-certificates fonts-liberation libappindicator1 libnss3 lsb-release xdg-utils wget

Then installing php, composer and nodejs, which I recommend going to the nodejs website and using their steps

sudo apt install -y php composer php-mbstring

I found a nice wrapper for Chrome called PuPHPeteer. and ran composer require nesk/puphpeteer and then npminstall @nesk/puphpeteer.

Then I wrote my script using a website I made with VueJS that renders list elements from a json blob.

<?php
require("vendor/autoload.php");
use Nesk\Puphpeteer\Puppeteer;
use Nesk\Rialto\Data\JsFunction;
use Nesk\Puphpeteer\Resources\ElementHandle;
use Sunra\PhpSimple\HtmlDomParser;

$puppeteer = new Puppeteer;
$browser = $puppeteer->launch();

$page = $browser->newPage();
$page->goto('https://alanmbarr.github.io/HackMidWestTimeline/');

$data = $page->evaluate(JsFunction::createWithBody('return document.documentElement.outerHTML'));
$dom = HtmlDomParser::str_get_html( $data );
$browser->close();

foreach($dom->find('span') as $element) {
echo $element->plaintext."\n";
}

$dom->clear();
?>

Personally I would rather do most of this in NodeJS but if you're pretty used to PHP and not JavaScript this should be a pretty workable solution.

Posted on by:

alanmbarr profile

Alan Barr

@alanmbarr

Technical Product Manager passionate about getting people into homes. Learning new things about technology and software that helps people.

Discussion

markdown guide
 

Great article, found it really helpful :)
I was wondering though - how could I use what you've shown here with a website in which the data is generated with infinite-scrolling; you have any idea?

 

The easier thing to do would be to first check if you can use their api first if it is exposed in some way even if you have to login with authentication and use the api to page.

If you for some reason you cannot because the data is rendered from multiple apis you would need to potentially call more JavaScript calls on the page. Basically simulating scrolling as a user would. Maybe there is some element that is at the bottom that triggers the next page load you would need to scroll to that spot and force the next load. Repeat.

 

Thanks for the quick response!
Yeah, that could work - I found which element triggers the loading but how can I possibly simulate scrolling?

Let's take this conversation off here and reach out to me