DEV Community

Sayo Paul
Sayo Paul

Posted on

Web Scraping in PHP using Goutte - part 2

Web Scraping in PHP using Goutte II

In the last article, we got introduced to web scraping and we looked into Goutte, a wonderful PHP web scraping library . In this article, we would be putting our knowledge to practice by scraping the website of the Punch . To be more specific, we would be scraping the punch to get the lastest news https://punchng.com/topics/news headlines 😎 .

Let's get right into it 💪 !

NB : This is for testing purposes only, I do not in any way intend to reproduce the material gotten from the Punch and I do not advice you to do so as that would be copyright infringement .

First things first, we set up Composer autoloading, import the Goutte namespace and we instantiate a new Goutte Client:

    require "vendor/autoload.php";
    use Goutte\Client;
    $client = new Client();

The next step is to send a request via the $client object . The $client object returns a crawler instance . It is this instance that we use to apply our filters .

     $crawler = $client->request('GET',"https://punchng.com/topics/news");

On the front page of the Punch news page are article boxes . Each article has its own box and a heading ( The headline ) with the class ".seg-title" . We want to select all the headlines (.seg-title) on the page and then take each of them one by one . We do it with this:

     $crawler->filter('.seg-title')->each(function ($node){


     });

Notice the method each() ? The each() method allows us to iterate over the current selection(node list) when it contains more than one node . As we mentioned above, we are selecting each of the headlines (.seg-title) hence we have more than one node and we want to iterate through them . Underground, the each() method accepts an instance of an anonymous function, loops through the current node list and then passes a node on each iteration to the closure thus allowing us to access the current node ( $node ) in the closure .

     public function each(\Closure $closure)
     {
          $data = array();
          foreach ($this->nodes as $i => $node) {
              $data[] = $closure($this->createSubCrawler($node), $i);
          }

          return $data;
      }

Alright, the next thing we want to do is extract the text from the current node .

     $crawler->filter('.seg-title')->each(function ($node){
         $headline = $node->text();
         echo $headline;
     });

We get the textual content of the node by calling the method text() . The next thing we do is print out the headline and there we have it ! We would always get the latest 10 news headlines on the punch printed out to us whenever we run this script . Like I said in the previous article, when it comes to scraping, almost anything is possible ( even logging in and filling forms ) . The limit is your mind 😊 . I honestly wish we could go deeper but sadly that's all for now 😅 .

For more information, please do well to read the docs of DomCrawler, CssSelector and Goutte .

Do you have any web scaping needs ? You can hire me to help you out here

Oldest comments (3)

Collapse
 
marcohern profile image
Marco Hernandez

What about Scraping Single Page Apps like angular or react apps? Does Goutte support's this? is this even possible using PHP? Is there anything that can do this? I've been looking for info in Client Side Rendered Scraping but there is little information.

Collapse
 
sayopaul profile image
Sayo Paul

Yes, it is in fact possible with PHP. The tools use for this are called headless browsers. Headless browsers act as regular browsers ( running javascript, etc. ) Using a headless browser, javascript rendered pages can be scraped. We combine Goutte's crawler with the response from a headless browser such as Selenium or PhantomJS and we are able to use all of Goutte's crawling functions. This is personally what I use for scraping those type of sites.

Collapse
 
peterrauscher profile image
Peter Rauscher

At scale, you're almost always better off avoiding headless browsers. Try using plain HTTP requests and parsing the HTML, the data loaded in SPAs is usually loaded from a JSON object in a tag somewhere. I wrote this extension that extracts the data for you:<br> <a href="https://chromewebstore.google.com/detail/kjlhnflincmlpkgahnidgebbngieobod" rel="nofollow">https://chromewebstore.google.com/detail/kjlhnflincmlpkgahnidgebbngieobod</a></p>