Matt Wing

Posted on Sep 29, 2021

🔍 Parsing Schema Data with node-html-parser

#node #javascript #webdev

Did you know that there's a whole JSON object schema for providing machine-readable information about the contents of your website? Google uses the data in these objects to fill out search results and build rich snippets.

Here's a secret - it can power other stuff too. For example, I'm building a Node JS web app that includes the ability to plug a URL in and get a list of that recipe's ingredients.

Want to start parsing data yourself? Read on!

Challenges

Fetching the raw HTML
Making the Raw HTML Parse-able
Finding the right Schema object out of all of the ones on the page
Grabbing the right data out of that schema object

Fetching the raw HTML

First things first — we want to be able to fetch the HTML code of whatever link we end up pasting into our app.

There are a lot of ways to do this in Node JS. For this tutorial, we'll be using the native JavaScript fetch API.

With that in mind, here's how to make fetch happen:

// Use an async function so we can wait for the fetch to complete
async function getHtmlStringFromUrl(url) {
  return await fetch(url).then((response) =>
    response.text().then((responseHtml) => {
      // responseHtml is a huge string containing the entire web page HTML.
      // In the next section, we'll process it into something we can work with
    })
  );
}

Making the Raw HTML Parse-able

When we first fetch a URL and grab the response body, it's one enormous text string. There's HTML in there, but we can't really work with it yet. We need to plug this string into an HTML parser that will let us use DOM selectors to pick out the useful bits.

node-html-parser is my personal choice for this. It lets us use all the usual JavaScript DOM selector methods, and it's pretty fast too. Add it to your project with this terminal command:

yarn add node-html-parser

Then import the parse command from the package into the JS file where you'll be using it:

import { parse } from "node-html-parser";

Now we can take the response body string, plug it into our parser, and get to the real fun:

import { parse } from "node-html-parser";

async function getHtmlDocumentFromUrl(url) {
  return await fetch(url).then((response) =>
    response.text().then((responseHtml) => {
            // parse the HTML string into a DOM-like object we can navigate
      const document = parse(responseHtml);
    })
  );
}

That's all we need to get the HTML into something we can sift through! The returned object has all the same methods as a typical document object, such as querySelector, getElementByID, and so on.

So, how do we work it to find the structured data we're looking for?

Finding the right Schema object(s)

The nice thing about working with structured data is that you can make some assumptions about the data you're processing, because it has to be structured in a way that web crawlers can understand to be useful.

The structured data Schema objects we're looking for are going to be found within ld+json script tags. Now that we've DOMified the HTML, we can run queries on it like this:

import { parse } from "node-html-parser";

async function getSchemaNodeListFromUrl(url) {
  return await fetch(url).then((response) =>
    response.text().then((responseHtml) => {
      const document = parse(responseHtml);
            // Create a NodeList of elements containing the page's structured data JSON. So close to useful!
            const structuredData = document.querySelectorAll('script[type="application/ld+json"]')
    })
  );
}

That will give us a NodeList of all the matching elements. That's close to perfect, but it's not a true array and could give us errors if we try to treat it like one (which we will soon). So let's turn it into an array:

import { parse } from "node-html-parser";

async function getSchemaArrayFromUrl(url) {
  return await fetch(url).then((response) =>
    response.text().then((responseHtml) => {
      const document = parse(responseHtml);
            // Create an ARRAY of elements containing the page's structured data JSON. Just one more step!
            const structuredData = Array.from(document.querySelectorAll('script[type="application/ld+json"]'));
    })
  );
}

Now we have an array of structured data nodes. In a way, we're back to square one with data that is so close to being useful. To make it useful, we need to grab the innerHTML of each node, which will come out as a big string. Then we can parse that into ✨real JSON!✨


import { parse } from "node-html-parser";

async function getJsonFromUrl(url) {
  return await fetch(url).then((response) =>
    response.text().then((responseHtml) => {
      const document = parse(responseHtml);
            const structuredData = Array.from(document.querySelectorAll('script[type="application/ld+json"]'));
        // Get an array containing the contents of each structured data element on the page. This is the ✨useful stuff✨
            const structuredDataJson = structuredData.map((node) => JSON.parse(node.innerHTML)).flat();
            // We also flatten the array with .flat() to handle how some sites structure their schema data. See epilogue for more info
        })
  );
}

Whoa, look at us. We've got the real, actual JSON object now. If you log structuredDataJson to your console, you'll see an array of structured data objects! Huzzah 🎉

But of course, we're not done yet! There is likely to be a ton of data you don't need in this array, in addition to whatever you're actually looking for.

Grabbing the right data out of that schema object

You're looking for some sort of specific data out of these objects. In my case, I'm looking for the list of ingredients within the Recipe object. So, now that we have actual JSON, we can view certain properties and use it to whittle our array down to a single, useful, piece of data:

import { parse } from "node-html-parser";

async function getIngredientsFromUrl(url) {
  return await fetch(url).then((response) =>
    response.text().then((responseHtml) => {
      const document = parse(responseHtml);
            const structuredData = Array.from(document.querySelectorAll('script[type="application/ld+json"]'));
            const structuredDataJson = structuredData.map((node) => JSON.parse(node.innerHTML)).flat();
            // Look for a Recipe schema and return its ingredients if it exists     
            const recipeData = structuredDataJson.find((schema) => schema["@type"] == "Recipe")
            if (recipeData) {
                return recipeData.recipeIngredient
              } else return null;
        })
  );
}

If one of the structured data objects is for a Recipe, we'll get the array of ingredients we're looking for. If not, the function will return null so we know it failed to find what we were looking for.

That's it! We've parsed the HTML into JSON into the actual thing we need 🎉

Conclusion

At this point, you have a function that takes a URL and returns an array of whatever information you're looking for. This general process can be used to do a whole lot of interesting stuff depending on what you're grabbing. Here's an example I put together to grab the ingredients within a recipe page.

Here are some of the most common schemas out there for inspiration. In my case, I'm parsing recipe ingredients so I can see if they're in my pantry, and add them to my shopping list if they're not.

How about you? If you end up using this process to parse website data in your web app, let me know what you're doing!

Epilogue: Handling Edge Cases with the flat() method

As mentioned earlier, structured data has to be readable by web crawlers to be useful, so we can make some assumptions about what it will look like. Still, we're ultimately trusting people to build their websites according to a certain convention, so you still might run into some issues across different websites and pages.

When I was testing my recipe parser, I ran into a few websites that structured their data in non-standard ways, which caused some trouble early on. The most common issue I found was that some sites would wrap their schema JSON within an array. This prevented my array.find() method from finding any of the data within the nested array.

In my production code, I handle this by flattening the parsed JSON to remove any nested arrays before I start looking for specific data. Here's what that looks like with the example code we've been using:

import { parse } from "node-html-parser";

async function getHtmlFromUrl(url) {
  return await fetch(url).then((response) =>
    response.text().then((responseHtml) => {
      const document = parse(responseHtml);
            const structuredData = Array.from(document.querySelectorAll('script[type="application/ld+json"]'));
            // Adding .flat() to the line below handles the most common edge cases I've found so far! 
            const structuredDataJson = structuredData.map((node) => JSON.parse(node.innerHTML)).flat();
            const recipeData = structuredDataJson.find((schema) => schema["@type"] == "Recipe")
            if (recipeData) {
                return recipeData.recipeIngredient
              } else return null;
        })
  );
}

DEV Community

🔍 Parsing Schema Data with node-html-parser

Challenges

Fetching the raw HTML

Making the Raw HTML Parse-able

Finding the right Schema object(s)

Grabbing the right data out of that schema object

Conclusion

Epilogue: Handling Edge Cases with the flat() method

Top comments (0)

Read next

React: LinkedIn Access Token in 10 Steps

Class variance authority: A Game changer for Tailwind UI components (NEXT.JS)

Mastering REST API Best Practices in Python 🐍

What is Feature Switching in .NET 9 ?!