Web Scraping in PHP using Goutte
Today I would be talking about something very common, Web Scraping . Depending on your needs or a client's needs, situations may arise when you may need to extract data from a webpage .
What is Web Scraping ?
According to WebHarvy, Web Scraping (also termed Screen Scraping, Web Data Extraction, Web Harvesting etc.) is a technique employed to extract large amounts of data from websites . In its simplest form, web scraping is getting the contents of a webpage via a script . Alright, let's move on to web scraping in PHP . Recently, I needed to scrape a site for a client in PHP so I looked for articles that talked about web scraping in PHP and I found out that there were few and most of them were pretty outdated .
However, in my research, I came across Goutte ; a (wonderful) screen scraping and web crawling library for PHP . At its core, Goutte is a wrapper around three of Symfony's components ( God bless Fabien ๐) ; BrowserKit, CssSelector and DomCrawler . It is important for us to understand what each of these components does as it helps us to understand just how powerful Goutte is .
BrowserKit ;
Simply put, the BrowserKit component simulates the behaviour of a real browser . It is the foundational element of Goutte.
DomCrawler;
โThe DomCrawler component eases the navigation of the DOM ( Document Object Model ) . The DomCrawler allows us to navigate the dom like this:
$crawler = $crawler->filter('body > p');
We can also traverse through nodes on the DOM using some of the methods that it provides . For example, if we want to get the first paragraph in the body of the page we could do this:
$crawler->filter('body > p')->eq(0);
The eq() method is zero indexed and it takes a number specifying the position of the element we want to access .
There are other methods such as siblings()
, first()
[an alias of eq(0)
, underground it just calls eq(0)
], last()
etc .
CssSelector;
The CssSelector is a wonderful component that allows us to select elements via their CSS selectors . It does this by converting the CSS selectors to their XPath equivalents . So for example say we wanted to select an element with a class called "fire" we could do this:
$crawler->filter('.fire');
The CssSelector component is so amazing that it even supports CSS such as ;
$crawler->filter('div[style*="max-height:175px; overflow: hidden;"]');
The above means that we are looking for a div element with an inline style attribute of "style=max-height:175px; overflow: hidden;"
For more information, please do well to read the docs of DomCrawler, CssSelector and Goutte .
Alright now that we have a bit of an idea about the three major components, it is time for us to bring everything together and actually scrape something . As you may have realised by now,when it comes to scraping, there is no laid down way to do it . You are free to explore and try out so many ways to get your data . The only limit you have is your creativity . There are times where I have had to combine the CssSelector and DomCrawler in order to get what I want [ actually, a lot of times ] .
In the next post we are going to put everything that we have learnt so far in to play by scraping the website of the Punch .
Top comments (7)
I also had to use Goutte for web scraping lately, it is doing the job just fine.
Though the documentation needs to be more elaborate..
I agree with you ๐ฏ . The documentation isn't very detailed and can be better . I had to test and browse multiple sources to discover some of its features
It's great to find another developer who is using Goutte!
I managed to do my task with Goutte so far, but in case I face any difficulties, I'd like to ask for your help if that's ok with you.
Happy New year!
Definitely it is okay .
Happy new year to you too .
Thank you ๐
Very useful , thanks
Your posts looks good & nicely explained about Goutte. But I have personally use "PHP Simple HTML DOM Parser", traditional file_get_contents with regex. I have also explained how can you create scripting. You can check on below link.
postnidea.com/php-data-scraping-te...
Would be much easier, but it's limited to just extracting data and saving in a format. No automation etc๐. Majority of the time my scraping needs require scripting.