DEV Community

Cover image for Parsing WordPress Block Data
Josh Pollock
Josh Pollock

Posted on

Parsing WordPress Block Data

Before WordPress 5.0, content for posts was stored as a string. It may or may not have contained HTML, line breaks and other formatting is meaningful, also shortcodes. WordPress 5.0 introduced a block based editor. It stores content as a string of HTML, with extra annotations as well as semantic HTML markup and attributes representing the data model. That HTML is parsed server-side -- removing extra annotations and replacing dynamic content -- before outputting HTML

Some developers would prefer that the block attributes, including content were stored in separate, queryable columns in the database or were presented as an object. Maybe that would have been better -- I disagree later in this post -- but it's not how it works. More importantly, you can parse the blocks into a structured object with PHP and JavaScript. That's best of both worlds, we get the interoperability of using HTML to represent HTML, and developers can work with content as an object to modify the content or how it is rendered as HTML.

I recently published a helpful tool for using the block parser in React apps. It helps when you want to replace WordPress' default block parsing with your own React components. In this article, I'll get to why I like WordPress, very imperfect approach and making it easier to modify with what I'm working on. I'll also look at how it compares to markdown-based content in Jamstack-style sites. I am a big fan of both approaches and this is not about debating one versus the other. Both are better, both maybe a better choice than the other, depending on the circumstances.

GitHub logo Shelob9 / block-content

Renders "raw" post content with WordPress block markup in it using React components, which you optionally provide.

Block Content Renderer

CI npm

Renders "raw" post content with WordPress block markup in it using React components you optionally provide. Uses @wordpress/block-serialization-default-parser.

This works with the "raw" value returned by WordPress REST API for post title, content, excerpt, etc. You must request with ?context=edit which requires permission to edit the post.

BETA Probably don't use. An experiment by Josh Pollock.

Why / Status

WordPress parses block-based content to HTML before displaying it in a front-end theme. This HTML is also returned by the REST API and WPGraphQL. With a JavaScript front-end, in a headless site or what not, you may want to treat the block content as an object for several reasons.

  • Change the markup -- add classes to paragraphs, change element types, etc.
  • Sanitize content
  • Re-order or reformat content.

WordPress' block parser converts blocks to objects. These objects have block attributes and the inner HTML. This library will…

Content Can Be An Object In A String

First, everything is a string in the database. A JSON column is a string with special annotations for translating it into an object. Relational databases like MySQL are great for putting it all back together. And if every block was its own row in a table, we could query for all blocks of a certain type. That and similar queries would make a GraphQL API for WordPress even cooler.

It is common when developing a Gatsby site to have a directory of markdown files stored in the same git repo as the code. When Gatsby generates the site, it parses the markdown in those files to an abstract syntax tree and then uses that tree to render HTML. Generally MDX is used to provide React components for each HTML element.

Gatsby provides APIs to hook in while that's happening so you can add business logic like "always add this class to paragraphs" or "make a fancy blockquote markup" or "insert an ad between sections."

I'm over generalizing a lot here. But the point is that minimal markup is stored with the content. The markup is generated at build-time, by treating the string of markup as an object.

Back To WordPress

When content is edited by the block editor, there is a lot of HTML markup in the_content field of the database. It's semantic HTML, making heavy use of comments and data attributes. The extra annotations, the Gutenberg grammar, are removed before sending the content to the browser in most settings.

The WordPress REST API returns an object for post content. It contains one or more properties. It should always return a "rendered" property. That is the same HTML as we get in the front-end. If you have permission to edit posts and append ?context=edit you will have a "raw" property in the content object.

That has the unparsed content. You can do the same thing WordPress does with it: use a PHP or JavaScript parser to convert it into an array of block objects and then walk that array to generate HTML.

This article covered parsing with JavaScript. Micah Wood wrote a good post on using the PHP parser and expose it on REST API endpoint. I also recommend this explanation of how dynamic block parsing works server-side by default. You may also want to look at Roy Sivan's Gutenberg Object Plugin which copies block data to a separate table, and exposes it on REST API endpoint.

Why This Matters

The rendered HTML returned by the REST API can be rendered with React, using dangerouslySetInnerHTML:

const PostContent = ({rendered}) => {
  function createMarkup() {
    return { __html: rendered };
  }
  return <div dangerouslySetInnerHTML={createMarkup()} />;
}
Enter fullscreen mode Exit fullscreen mode

This is not the best solution, because you are opening yourself up to XSS attacks by letting React evaluate raw HTML like that. What if you have React components you want to use for rendering the content, for consistency with the rest of the site?

In these situations, using a block parser may be helpful. For example, you can parse out links and replace them with React components, for example Gatsby's Link component in place of links.

Customizing Block Rendering

As I said earlier, I'm working on a helper for working with the parser in React apps for headless WordPress sites. WordPress always returns post content with a "rendered" property, which contains the pre-parsed HTML, if you request a post with context=edit query param and have permission to edit, you also get back a "raw" property. That's what we're working with here.

WordPress' block parser works with that raw string, like this:

import {  parse } from "@wordpress/block-serialization-default-parser";
const blocks = parse( `<!-- wp:paragraph -->
<p>Hi Roy</p>
<!-- /wp:paragraph -->`);
Enter fullscreen mode Exit fullscreen mode

That returns an object full of blocks, some of which have blocks inside them. I'm working on a utility that makes it easier to use this parser to render content using components supplied by the developer.

This library includes a component called BlockContent, which renders raw block content, by parsing the blocks, sanitizing the content and converting it to React elements. Remember, this request must be made by a user with permission to edit the post, and with the context query param set to edit.

Here's an example of a Post component that uses it:

import {BlockContent} from "@shelob9/block-content";
export const Post = ({post}) => {
  return (
    <article>
      <BlockContent rawContent={post.content.raw} />
    </article>
  )
}
Enter fullscreen mode Exit fullscreen mode

That's cool, but it doesn't help customize what React components are used to render the block content or to add business logic to the rendering. To do that, you need to set up a provider and supply it with components.

Here is an example of the components you could use. In this example, all "a" elements in post content are replaced with Gatsby's link component and all "p" elements get a different class:

const components = {
  //replace links with Gatsby's link component.
  a:({children,className,href}) => (
    <Link className={className} to={href}>{children}</Link>
  ),
}
Enter fullscreen mode Exit fullscreen mode

In this example, we add an additional class name to all paragraphs:

const components = {
   //Add a custom class to paragraphs
   p : ({children,className}) => (
    <p className={`${className} custom-class`}>{children}</p>
  ),
}
Enter fullscreen mode Exit fullscreen mode

There is no need to supply all elements. If, for example, no component for p elements are provided, a generic one is generated.

These components are passed to the ThemeProvider component. That provider needs to go around all elements that use BlockContent:

import {ThemeProvider} from "@shelob9/block-content";
import {Link} from "gatsby";
import {Post} from "your/post/component";
import components form "your/components";

//Mock data
let raw = `<!-- wp:paragraph -->
<p>Hi Roy</p>
<!-- /wp:paragraph -->`;
let post = {
    content: {
        raw,
        rendered: '<p>Hi Roy</p>'
    }
};

//Wrap everything in the theme provider
const App = () => {
    return(
        <ThemeProvider components={components}>
            <Post post={post} />
        </ThemeProvider>
    )
}
Enter fullscreen mode Exit fullscreen mode

Try It And Let Me Know What You Think

This is a new project. If you have a chance to use it, let me know what you think, in the comments or on Twitter. I will add more control over sanitizing content and attributes next, but would be super happy to know what you wish this could do.

yarn add @Shelob9/block-content

npm install @Shelob9/block-content
Enter fullscreen mode Exit fullscreen mode

GitHub logo Shelob9 / block-content

Renders "raw" post content with WordPress block markup in it using React components, which you optionally provide.

Block Content Renderer

CI npm

Renders "raw" post content with WordPress block markup in it using React components you optionally provide. Uses @wordpress/block-serialization-default-parser.

This works with the "raw" value returned by WordPress REST API for post title, content, excerpt, etc. You must request with ?context=edit which requires permission to edit the post.

BETA Probably don't use. An experiment by Josh Pollock.

Why / Status

WordPress parses block-based content to HTML before displaying it in a front-end theme. This HTML is also returned by the REST API and WPGraphQL. With a JavaScript front-end, in a headless site or what not, you may want to treat the block content as an object for several reasons.

  • Change the markup -- add classes to paragraphs, change element types, etc.
  • Sanitize content
  • Re-order or reformat content.

WordPress' block parser converts blocks to objects. These objects have block attributes and the inner HTML. This library will…

I Think This Is Good

Yes, a table structure for block data would make it easier to do MySQL query based on blocks. I love to think about an alternate reality or possible future where blocks can be used as a graph database of some sort.

In the strange world we do live in, post content is a string and I think that's good. With a table based system, the content -- what site owners care about -- you would need MySQL and PHP to convert that to HTML.

Gutenberg markup in HTML makes parsing optional and can be done without PHP and MySQL. There are JS and PHP clients. Also, it's a spec you could implement in Go, because you're Chris Wiegman or whatever.

That's why I think this tradeoff makes sense. But, if querying against block attributes is a requirement, then those block attributes should be saved in post meta, so queries can be done based on those meta fields. I recommend this post by Helen Hou-Sandí wrote about working with meta fields in the block editor if you want to learn more about how to do that.

I know this may be a contrarian opinion, but using strings of HTML is not a bad way to represent content blocks. It is way more human readable and interoperable than JSON or storing in MySQL. With the parsers, available to use, when the rendered HTML doesn't fit our need, we can customize how the rendering works, to fit our needs.

Sane defaults and plenty of ways to modify core behavior. Yes, it's a little messy, but it works and is very extensible when it needs to be. That's the vibe that makes WordPress so useful, right?

Featured Image by Joeri Römer on Unsplash

Top comments (0)