DEV Community

Cover image for How to still use Crawlers in Client-Side Websites
Vasco Abelha
Vasco Abelha

Posted on • Originally published at vascoabelha.com

How to still use Crawlers in Client-Side Websites

Originally this was published on my blog. You can find the publication here!

If you wanna discuss anything feel free to reach me on Twitter.

Introduction

In this post, I will describe a solution that I built for an already existing React Client-Side platform, in which people wanted to be able to share specific content on their feeds.

This publication is useful for developers:

  • have an already built a Client-Side Website (doesn't need to be solely React)
  • want to understand how we can interact with different crawlers.

Technologies used:

  • VPS where the project was hosted
  • Nginx;
  • ExpressJS (It doesn't matter what you are using).
  • ReactJS
  • Facebook SDK - OpenGraph

Contextualization

Whenever you share a link to a website in Facebook, Twitter, or any other social platform, they spawn a crawler that will scrape your website in order to look for meta tags that can help them understand what they are looking at and how they can share it - App, Card, Summary, Large Card, etcetera.

One of the biggest problems in a React Client-Side website is that everything is rendered through JavaScript. If you use a Browser or a Crawler that doesn't process JS, you will just be presented with aΒ blank page - You need to enable JavaScript to run this app.Β This applies to Facebook or Twitter Crawlers.

Example of Black Page

In the end, if you share an URL from your website on one of these social platforms, you won't get any type of card or information from your website.

Note: You can use https://cards-dev.twitter.com/validator to verify and test yourself.

Twitter Card Validator

To the Left we Have a React Client Side Website. To the Right we have a static website.

In both Websites, I have React-Helmet (which allows modifications to your document head), yet the left side still shows no meta-tags fetched by the crawlers due to requiring JavaScript to render.

Show what the Crawlers want to see

If we are hosting the website on a typical Virtual Private Server, then there is a good chance that we are using a web server like apache, nginx or lighttpd to process the incoming HTTP requests.
Thus a web server like Nginx is the perfect place to "trick" him and proxy him into a renderer HTML with the information that we want the crawler to see.
For this we need:

  • To know which requests come from the crawlers;
  • a service that renders Dynamic HTML Content;
  • Update NGINX to link crawlers to the new service.

Crawlers Identification

After researching Facebook and Twitter Documentation we can identify the crawlers by the following user-agent strings:

  • facebookexternalhit/1.1 (Facebook)
  • Twitterbot(Twitter)

Service to render Dynamic HTML

You have other types of solutions. You can pretty much use anything that renders an HTML webpage.

In this case, I had an already established set of services available through expressjs, so I stuck with it and created one endpoint that would take params (in this case a news publication identifier) and return an HTML page with every kind of head and meta tags that I wanted to be scraped by the crawlers.

Note: The URL must be equal to the one where I view the news publication.

Example of the service:

//(routes/social.js -> socialRoutes)
...
router.get("/news/:id", async (req, res) => {

    const { id } = req.params;
    const {news} = await getNews(id);
    res.set("Content-Type", "text/html");
  res.send(`<!DOCTYPE html>
  <html>
        <head>
            <link rel="canonical" href="${news.url}" />
            <meta property="og:title" content="${news.title}>" />
            <meta property="og:description" content="${news.description}" />
            <meta property="og:image" content="${news.cover_image}" />
            <meta property="og:url" content="${news.url}" />
            <meta name="twitter:title" content="${news.title}>" />
            <meta name="twitter:description" content="${news.description}" />
            <meta name="twitter:image" content="${news.cover_image}" />
            <meta name="twitter:url" content="${news.url}" />
            <meta name="twitter:card" content="summary" />
        </head>
  </html>
`);
});

//server.js
...
app.use("/social", socialRoutes);
...
app.listen(3500, () => {
  console.log('started at localhost:3500');
});

Update NGINX and Send Crawlers to our Service

With knowing the user-agent strings of the crawlers and having already defined our service to generate HTML pages free of javascript.
We can now "trick" the crawlers with the help of NGINX and send them to our services instead of the real webpage.
Usually, if you are using a react app under Nginx, your default.conf file will be generally similar to this:

server{
    root /var/www/html;

    # Add index.php to the list if you are using PHP
    index index.html index.htm index.nginx-debian.html;

    server_name www.example.com example.com;

    location / {
        try_files $uri /index.html; 
    }
}

Nevertheless, this isn't enough, because the crawlers will still go to our files located in root and only see blank pages due to javascript rendering.

Therefore we need to add a prior condition to verify the user-agent before sending them to our project folder.

server{
    root /var/www/html;

    # Add index.php to the list if you are using PHP
    index index.html index.htm index.nginx-debian.html;

    server_name www.example.com example.com;

    location / {
        # Here we proxy the request to our api if user-agent matches any of these regular expressions
        if ($http_user_agent ~ facebookexternalhit|Twittterbot) {
            proxy_pass http://localhost:3500/social$uri$args;
        }
        try_files $uri /index.html; 
    }
}

Conclusion

Every time we have a new request that matches the user-agents of Facebook and Twitter, we will proxy it to our service for HTML rendering. Thus, in turn, allowing the crawlers to process our "not-so-real" webpage as theΒ realΒ one and fetch the meta-tags needed to share our website.

As long as you have some kind of middleware that can act as a reverse proxy, then you can still allow client-side web applications to be scraped by crawlers that don't execute javascript.

Nevertheless, if possible you should take a look at Static Side Generators or Server-Side Rendering Frameworks.

This publication is only useful to shed some light on how you can interact with crawlers and to possibly guide or help someone in anything similar that they are working on.

Top comments (2)

Collapse
 
patarapolw profile image
Pacharapol Withayasakpunt

Example of the service:

Probably using EJS or Cheerio to edit Meta and Title tags would be more ideal, but is it performant?

What did people really do in production for SPA's?

Collapse
 
vabelha profile image
Vasco Abelha

Thanks for the answer!

Both would be fine.

I just ended up using this route due to being for me the "simplest" solution at the time, since express was 99% used for RESTful Web Services.
In terms of performance, I think it is neglectable.

Second Question:
This is not meant for SPA. This should be just for Server-side Rendering our SSGs.

Briefly said:
If you won't need to use SEO (GoogleBot now runs JS); you are most likely building tools; it doesn't matter how long it takes your webpage to load and be responsive. SPA is Fine.

If you are working with:

  • e-commerce, showcasing, marketing;
  • where speed and responsiveness matters;
  • SEO

SSR or SSG are the best options.