DEV Community

Cover image for Body parsing in App Search
llermaly
llermaly

Posted on

Body parsing in App Search

A feature some people (including me) miss from Swiftype Site Search is being able to parse content using CSS selectors.

The body parsing problem

The App Search crawler will extract all the content from the website you specify and spread it in fields depending on the HTML tags it finds.

Text within title tags will be assumed as title field; anchor tags as links, and body will be parsed as one giant field with everything else.

What if a website has a custom structure we want to capture? For example, product pages. We want to capture things like color, size, and price in specific fields and not as part of a single body field.

App Search allows you to add Meta Tags to your website to create custom fields, but sometimes making changes on the website is too complicated (red tape) or just not possible.

The body parsing solution

This post will explore a way of creating a proxy between the crawler and the actual website that does the extraction, creates the needed Meta Tags, and then inject these in the new response so App Search can grab these tags and use them.

To keep this short, I will not be doing a step-by-step guide and only showing the key parts of the script.

So, what are we doing?

We are going to create a NodeJS server that hosts a product page and a proxy that stands in front. This will receive the crawler request, hit the product page, inject the meta tags, and then return the page to the crawler.

Image description

This is how we add custom fields using meta tags:

<head>
  <meta class="elastic" name="product_price" content="99.99">
</head>
<body>
  <h1 data-elastic-name="product_name">Printer</h1>
</body>
Enter fullscreen mode Exit fullscreen mode

Tools

  • Nodejs (to create example page and proxy)
  • Ngrok (to expose local proxy to the internet)
  • App Search (to crawl the page)

The page

Following the example we will serve a page that emulates a product page for a printer.

index.html

<html>
  <head>
    <title>Printer Page</title>
  </head>
  <body>
    <h1>Printer</h1>
    <div class="price-container">
      <div class="title">Price</div>
      <div class="value">2.99</div>
    </div>
  </body>
</html>
Enter fullscreen mode Exit fullscreen mode

server.js

const express = require("express");
const app = express();

app.listen(1337, () => {
  console.log("Application started and Listening on port 1337");
});

app.get("/", (req, res) => {
  res.sendFile(__dirname + "/index.html");
});
Enter fullscreen mode Exit fullscreen mode

Afterwards, we use App Search to crawl the page and—as expected—the data we want to have as fields (like the price) was just put inside the body content field:

Image description

Then, we will create a proxy capable of recognizing this data, and injecting a meta tag to the response, so App Search can recognize this is a field.

proxy.js

const http = require("http"),
  connect = require("connect"),
  app = connect(),
  httpProxy = require("http-proxy");

app.use(function (req, res, next) {
  var _write = res.write;
  res.write = function (data) {
    _write.call(
      res,
      data
        .toString()
        .replace('class="value"', 'class="value" data-elastic-name="price"')
    );
  };
  next();
});

app.use(function (req, res) {
  proxy.web(req, res);
});

http.createServer(app).listen(8013);

var proxy = httpProxy.createProxyServer({
  target: "http://localhost:1337",
});

console.log("http proxy server" + " started " + "on port " + "8013");
Enter fullscreen mode Exit fullscreen mode

Finally, we can start our server and proxy to expose the proxy with Ngrok and use it in App Search.

Voilà! We now have the price as a separate field:

Image description

We can go as fancy as we want using the middleware that transforms the body response to add meta tags based on existent classes, but also based on the content itself.

This is a simple example, but I hope you get the gist.

Top comments (0)