Thanos Stantzouris

Posted on Apr 22 • Originally published at sudorealm.com

How to Build a Robust Web Scraper with Laravel: and Catch 'Em All

#laravel #scraping #webdev

Hey there! Welcome to my extraordinary, completely outside-the-box article about web scraping with a cool twist, using the greatest PHP framework 🐘, Laravel! In this guide, we're not just scraping the net for any random information—no sir! Today, you and I are embarking on a journey like no other. We are catching Pokémon by hurling scrape poké balls straight at Bulbapedia! Yep, you read that right. Through this piece of educational yet comedic writing, you're about to reconnect with the '90s kid hidden within you and start to Catch 'Em All. Get ready to dive into the nostalgic world of Pokémon, armed with the modern power of Laravel and some seriously savvy scraping strategies. Let's set off on this wild adventure together!

Table of Contents for the impatient reader

Introducing spatie/crawler
Extracting Pokémon Generations from Bulbapedia
- First Step: Fetch the entire Generations data
- Second Step: Create a Collection with the data of each Pokemon
Wrapping up and Future Challenges
- Grab the Project Repo

For this project, you simply need to start with a new, empty Laravel project. You can follow the official documentation to set this up. I won’t cover the setup process here, as it’s thoroughly detailed in the Laravel 11 Docs. Once you have your Laravel environment ready, you'll be all set to follow along with this guide.
Laravel 11 Docs

Introducing spatie/crawler

The spatie/crawler package is a powerful tool developed by Spatie, a web development agency known for creating high-quality, open-source packages for the Laravel community. This crawler is designed to simplify the process of building web scrapers and bots in PHP, particularly within the Laravel framework. It provides a flexible and easy-to-use interface to crawl websites and extract needed data efficiently. Click me if you love spatie/crawler, give them a star too, they are the reason that this article exists.

Installation

This package can be easily installed via Composer:

composer require spatie/crawler

Now you should have in your composer.json the line "spatie/crawler": "^8.2".

Scraper setup

The package documentation proposes that the crawler should be instantiated like so:

use Spatie\Crawler\Crawler;

Crawler::create()
    ->setCrawlObserver(new OurScraperObserver())
    ->startCrawling($url);

new OurScraperObserver(): The argument passed to setCrawlObserver must be an object that extends the \Spatie\Crawler\CrawlObservers\CrawlObserver abstract class.

Let's create our first scraper using the appropriate PHP artisan command.

php artisan make:observer Pokemon/PokemonGenerationScraperObserver

The observer we just created should look like this:

<?php

namespace App\Observers\Pokemon;

use GuzzleHttp\Exception\RequestException;
use Illuminate\Support\Facades\Log;
use Psr\Http\Message\ResponseInterface;
use Psr\Http\Message\UriInterface;
use Spatie\Crawler\CrawlObservers\CrawlObserver;

class PokemonGenerationScraperObserver extends CrawlObserver
{

    private $content;

    public function __construct()
    {
        $this->content = null;
    }

    /*
     * Called when the crawler will crawl the url.
     */
    public function willCrawl(UriInterface $url, ?string $linkText): void
    {
        Log::info('willCrawl', ['url' => $url]);
    }

    /*
     * Called when the crawler has crawled the given url successfully.
     */
    public function crawled(
        UriInterface $url,
        ResponseInterface $response,
        ?UriInterface $foundOnUrl = null,
        ?string $linkText = null,
    ): void {
        Log::info("Crawled: {$url}");
    }

    /*
     * Called when the crawler had a problem crawling the given url.
     */
    public function crawlFailed(
        UriInterface $url,
        RequestException $requestException,
        ?UriInterface $foundOnUrl = null,
        ?string $linkText = null,
    ): void {
        Log::error("Failed: {$url}");
    }

    /*
     * Called when the crawl has ended.
     */
    public function finishedCrawling(): void
    {
        Log::info("Finished crawling");
    }
}

Calling the scraper and displaying results

Now that everything is set up, including the observer class required by the package, it's time to "call" the scraper and see it in action! For this, we need to create an invokable class whose sole purpose is to initiate the scraper by triggering the observer. Additionally, we'll establish a route that links to this class with a URL, enabling easy access and execution of the scraping process. Let’s proceed to set this up and display the results.

php artisan make:controller Pokemon/PokemonGenerationScraperController --invokable

See how I named the controller almost the same as the Observer?

💡 By using identical names for closely related classes, you minimize the mental effort required to understand or recall how different parts of the application interact. This streamlined naming convention effectively simplifies navigation through the code, thereby enhancing maintainability and reducing the chance of errors during development.

This is how our new controller should look:

<?php

namespace App\Http\Controllers\Pokemon;

use App\Http\Controllers\Controller;
use Illuminate\Http\Request;

class PokemonGenerationScraperController extends Controller
{
    public function __invoke(Request $request)
    {
        dd("I am ready to catch them all!");
    }
}

To link the controller with a URL we go to the web.php and add the following lines:

Route::get('/pokemon/generation', PokemonGenerationScraperController::class)
    ->name('pokemon.generation');

If we visit now ourLaravelScraper.test/pokemon/generation we'll see the message from the controller.

❗️ This approach could certainly be more dynamic, but since this is just an introduction, I'm aiming to keep things as relaxed and straightforward as possible. The goal here is not to create an overly complex 'monster' of an article. It's already quite comprehensive as it stands.

To bring everything together and dive into the most exciting part of our article, let's now activate the observer class we created earlier. This is done by invoking the observer within our controller, which will initiate the scraping process. Here’s how we set up the PokemonGenerationScraperController to handle the scraping:

class PokemonGenerationScraperController extends Controller
{
    public function __invoke(Request $request)
    {
        $url = "https://bulbapedia.bulbagarden.net/wiki/List_of_Pok%C3%A9mon_by_National_Pok%C3%A9dex_number";

        Crawler::create()
            ->setCrawlObserver(new PokemonGenerationScraperObserver())
            ->setMaximumDepth(0)
            ->setTotalCrawlLimit(1)
            ->startCrawling($url);
    }
}

You might have noticed that we used here some extra options for the scrape to work as we wanted. Well, let me analyze this a bit.

setCrawlObserver(): Instantiating the crawler.
setMaximumDepth(0): By default, the crawler continues until it has crawled every page of the supplied URL. We want to scrape, so we need only the first URL that we give. Therefore 0.
setTotalCrawlLimit(1): This limit defines the maximal count of URLs to crawl
startCrawling($url): "plays crawling by linkin park." Kidding, it runs our entire scraping/crawling process.

Now if we visit again the link that triggers our scraper we should see a blank white page but our laravel.log should now contain the following result:

[2024-04-19 18:28:46] local.INFO: willCrawl {"url":{"GuzzleHttp\\Psr7\\Uri":"https://bulbapedia.bulbagarden.net/wiki/List_of_Pok%C3%A9mon_by_National_Pok%C3%A9dex_number"}} 
[2024-04-19 18:28:46] local.INFO: Crawled: https://bulbapedia.bulbagarden.net/wiki/List_of_Pok%C3%A9mon_by_National_Pok%C3%A9dex_number  
[2024-04-19 18:28:47] local.INFO: Finished crawling

This means that the crawler worked perfectly and logged everything in the right order! We are ready to start digging for poké data!

Extracting Pokemon Generations from Bulbapedia

Within the PokemonGenerationScraperObserver, the crawled method, which accepts a ResponseInterface $response parameter, is particularly critical. This method receives the full response from the website we are targeting for scraping. It is here that we should process the data, extracting and returning the specific results we need.

The tricky part here is that every website is a unique snowflake in terms of its structure, so we need to channel our inner Sherlock Holmes to closely examine the HTML code we're scraping. It's essential to understand its intricate layout before we decide on the best strategy to unearth the data treasures hidden within.

💡 Engaging in web scraping isn't just about collecting data—it's also an excellent exercise in honing your coding skills. Regularly scraping websites trains your eye to look at code more critically and focused, enhancing your ability to quickly identify relevant patterns and structures. It's like a workout for your developer brain, making you sharper and more adept at navigating complex code environments.

Let's take a look at the part of the code that interests us.

<h3><span class=\"mw-headline\" id=\"Generation_I\"><a href=\"/wiki/Generation_I\" title=\"Generation I\">Generation I</a></span></h3>
<table class=\"roundy\" style=\"margin:auto; border: 2px solid #E72838; background: #E72838\">
<tbody>
   <tr>
      <th style=\"background: #71C671; border-top-left-radius: 5px; -moz-border-radius-topleft: 5px; -webkit-border-top-left-radius: 5px; -khtml-border-top-left-radius: 5px; -icab-border-top-left-radius: 5px; -o-border-top-left-radius: 5px;\">Ndex
      </th>
      <th style=\"background: #71C671\">MS
      </th>
      <th style=\"background: #71C671\">Pokémon
      </th>
      <th style=\"background: #71C671; border-top-right-radius: 5px; -moz-border-radius-topright: 5px; -webkit-border-top-right-radius: 5px; -khtml-border-top-right-radius: 5px; -icab-border-top-right-radius: 5px; -o-border-top-right-radius: 5px;\" colspan=\"2\">Type
      </th>
   </tr>
   <tr style=\"background:#FFF\">
      <td rowspan=\"1\" style=\"font-family:monospace,monospace\">#0001</td>
      <td><a href=\"/wiki/Bulbasaur_(Pok%C3%A9mon)\" title=\"Bulbasaur\"><img alt=\"Bulbasaur\" src=\"https://archives.bulbagarden.net/media/upload/thumb/f/fb/0001Bulbasaur.png/70px-0001Bulbasaur.png\" decoding=\"async\" loading=\"lazy\" width=\"70\" height=\"70\" srcset=\"https://archives.bulbagarden.net/media/upload/thumb/f/fb/0001Bulbasaur.png/105px-0001Bulbasaur.png 1.5x, https://archives.bulbagarden.net/media/upload/thumb/f/fb/0001Bulbasaur.png/140px-0001Bulbasaur.png 2x\" /></a></td>
      <td><a href=\"/wiki/Bulbasaur_(Pok%C3%A9mon)\" title=\"Bulbasaur (Pokémon)\">Bulbasaur</a><br/><small></small></td>
      <td style=\"background:#3FA129\" align=\"center\" colspan=\"1\" rowspan=\"1\"><a href=\"/wiki/Grass_(type)\" title=\"Grass (type)\"><span style=\"color:#FFFFFF\">Grass</span></a></td>
      <td style=\"background:#9141CB\" align=\"center\" colspan=\"1\" rowspan=\"1\"><a href=\"/wiki/Poison_(type)\" title=\"Poison (type)\"><span style=\"color:#FFFFFF\">Poison</span></a></td>
   </tr>
   ... 
   ...
   </table>

In the structure we're examining, each Pokémon generation begins with an H3 tag bearing its name, followed by a table containing all the data for that generation's Pokémon. From the code snippet provided, you can spot the first Pokémon of the first generation—none other than the mighty Bulbasaur, arguably the best strategic starter in the games. 🤓

⼻First Step: Fetch the entire Generations data

So that should make us think. We can aim to select the H3 tag that specifically contains the text "Generation I," and then focus on extracting the first table that follows this heading. This approach ensures we're accurately pinpointing and retrieving data specifically related to Generation I Pokémon. How would we do that?

    public function crawled(
        UriInterface $url,
        ResponseInterface $response,
        ?UriInterface $foundOnUrl = null,
        ?string $linkText = null,
    ): void {
        Log::info("Crawled: {$url}");

        $crawler = new Crawler((string) $response->getBody());

        $tableHtml = $crawler->filter('h3')->reduce(function (Crawler $node) {
            return str_contains($node->text(), 'Generation I');
        })->nextAll()->filter('table')->first()->html();

        echo $tableHtml;
    }

What did we do here exactly?

Creating a Crawler Instance

$crawler = new Crawler((string) $response->getBody());: Initializes a new instance of Symfony\Component\DomCrawler\Crawler; with the HTML content of the crawled page. This Crawler object allows you to navigate and search through the HTML structure of the page.

Extracting Specific Data

$tableHtml = $crawler->filter('h3')...: This chain of methods is used to find and extract specific data from the page:
- .filter('h3'): Filters the HTML elements to only include H3 tags.
- .reduce(function (Crawler $node) { ... }): Further filters these H3 tags to only keep those that contain the text 'Generation I'.
- ->nextAll()->filter('table')->first(): Selects all sibling elements following the filtered H3, narrows them down to 'table' tags, and picks the first table.
- ->html();: Retrieves the HTML content of this first table.

⼻Second Step: Create a Collection with the data of each Pokemon

Our next step is to organize the fetched data into a structured collection. This process involves parsing the raw data to extract individual Pokémon details and then systematically grouping these details into a manageable format. By creating a collection, we facilitate easier access and manipulation of the data, which is essential for any further analysis, display, or integration into applications.

        $genTableCrawler = $crawler->filter('h3')->reduce(function (Crawler $node) {
            return str_contains($node->text(), 'Generation I');
        })->nextAll()->filter('table')->first();

        $pokemonData = collect($genTableCrawler->filter('tr')->each(function (Crawler $tr, $i) {
            if (!$tr->filter('th')->count()) {
                return (object) [
                    'name' => $tr->filter('td')->eq(2)->text(),
                    'image' => $tr->filter('td img')->attr('src')
                ];
            }
            return null;
        }))->filter()->values();

We have made some adjustments here. Firstly, we removed the ->html() method because we need to continue using the crawler object to apply additional functions, which will help us build our collection of Pokémon. Additionally, we renamed the variable from $tableHtml to $genTableCrawler to more accurately reflect its content as a crawler object, not just HTML text.

$genTableCrawler->filter('tr')->each: Loop through every tr element you find.
if (!$tr->filter('th')->count()): Skip the <th> elements.
Then we return an object of the current Pokemon's name and image.

We can now use dd($pokemonData->first()) to inspect the initial item in our Generation I Pokémon collection. Additionally, we can directly check the name or image properties by using dd($pokemonData->first()->name) or dd($pokemonData->first()->image), ensuring that we have successfully retrieved and structured the Pokémon data.

Realmer! You have officially caught all the Gen 1 Pokémon! Congrats! And as Prof Oak would say You've finally done it! You've finally completed the National Pokédex! This is better than meeting any exotic Pokémon for the first time! I feel blessed to have become friends with a Trainer like you! Sincerely, I thank you from the bottom of my heart!

Wrapping up and Future Challenges

As we wrap up this detailed exploration into Laravel Pokémon scraping, I recognize that our discussion has been extensive, but I believe the insights and techniques shared here are invaluable. They serve not only to guide you through your own scraping projects but also to enhance your understanding of Laravel's capabilities.

While there is always more to learn and explore, I'll conclude this article by suggesting a few improvements and coding challenges. These are designed to not only test your newfound skills but also to further refine them and spark your creativity. Whether it's optimizing performance, expanding the data extracted, or integrating more complex data handling features, these challenges will help you advance your development expertise.

Try including in the poké collection the types of each Pokemon. It is a bit tricky and it will hone your scraping skills.
Try including in the poké collection the link directing to each Pokemon's details page in Bulbapedia.
Try extracting the scraping logic to functions in a different class and fetch the Pokemon data like so:

$pokemonData = (new PokemonScraperHelper())->fromGeneration('Generation I')->fetchAll();

This way you will hone your Laravel SOLID skills. (Note: this is not the correct answer this is just a way of thinking example, to lead you to the correct path. You can also make your function accept a GenerationEnum... just something more for you to explore).

Create a new feature fetching only one Pokemon, or by some searching functionality.
Save every Pokemon to the database by creating a Pokemon model.
Make the URI dynamic accepting generations like: ourLaravelScraper.test/pokemon/{generation}
Go crazy with it.

Grab the Project Repo

Visit The original post in Sudorealm to find the repo. ( sneaky ) I know 🤪

A note to the rockstar reader

If you've made it this far, you are truly dedicated to mastering the craft of coding. Your commitment to learning and growth is commendable. Every line of code you decipher, every new technique you master, and every challenge you overcome not only improves your skills but also brings your ideas to life. Keep up the great work! 💪

DEV Community

How to Build a Robust Web Scraper with Laravel: and Catch 'Em All

Table of Contents for the impatient reader

Introducing spatie/crawler

Installation

Scraper setup

Calling the scraper and displaying results

Extracting Pokemon Generations from Bulbapedia

⼻First Step: Fetch the entire Generations data

Creating a Crawler Instance

Extracting Specific Data

⼻Second Step: Create a Collection with the data of each Pokemon

Wrapping up and Future Challenges

Grab the Project Repo

Top comments (0)

Read next

How to Switch Your Rails Application Database from PostgreSQL to SQLite

Understanding Laravel Cashier's Core Traits: A Deep Dive

🚀 Day 4: Lead & Opportunities Module and Analytics with Dynamic Charts

Black Friday Sale: Angular Material Dev