llermaly

Posted on Jul 7, 2022

Body parsing in App Search

#elasticsearch #javascript

A feature some people (including me) miss from Swiftype Site Search is being able to parse content using CSS selectors.

The body parsing problem

The App Search crawler will extract all the content from the website you specify and spread it in fields depending on the HTML tags it finds.

Text within title tags will be assumed as title field; anchor tags as links, and body will be parsed as one giant field with everything else.

What if a website has a custom structure we want to capture? For example, product pages. We want to capture things like color, size, and price in specific fields and not as part of a single body field.

App Search allows you to add Meta Tags to your website to create custom fields, but sometimes making changes on the website is too complicated (red tape) or just not possible.

The body parsing solution

This post will explore a way of creating a proxy between the crawler and the actual website that does the extraction, creates the needed Meta Tags, and then inject these in the new response so App Search can grab these tags and use them.

To keep this short, I will not be doing a step-by-step guide and only showing the key parts of the script.

So, what are we doing?

We are going to create a NodeJS server that hosts a product page and a proxy that stands in front. This will receive the crawler request, hit the product page, inject the meta tags, and then return the page to the crawler.

This is how we add custom fields using meta tags:

<head>
  <meta class="elastic" name="product_price" content="99.99">
</head>
<body>
  <h1 data-elastic-name="product_name">Printer</h1>
</body>

Tools

Nodejs (to create example page and proxy)
Ngrok (to expose local proxy to the internet)
App Search (to crawl the page)

The page

Following the example we will serve a page that emulates a product page for a printer.

index.html

<html>
  <head>
    <title>Printer Page</title>
  </head>
  <body>
    <h1>Printer</h1>
    <div class="price-container">
      <div class="title">Price</div>
      <div class="value">2.99</div>
    </div>
  </body>
</html>

server.js

const express = require("express");
const app = express();

app.listen(1337, () => {
  console.log("Application started and Listening on port 1337");
});

app.get("/", (req, res) => {
  res.sendFile(__dirname + "/index.html");
});

Afterwards, we use App Search to crawl the page and—as expected—the data we want to have as fields (like the price) was just put inside the body content field:

Then, we will create a proxy capable of recognizing this data, and injecting a meta tag to the response, so App Search can recognize this is a field.

proxy.js

const http = require("http"),
  connect = require("connect"),
  app = connect(),
  httpProxy = require("http-proxy");

app.use(function (req, res, next) {
  var _write = res.write;
  res.write = function (data) {
    _write.call(
      res,
      data
        .toString()
        .replace('class="value"', 'class="value" data-elastic-name="price"')
    );
  };
  next();
});

app.use(function (req, res) {
  proxy.web(req, res);
});

http.createServer(app).listen(8013);

var proxy = httpProxy.createProxyServer({
  target: "http://localhost:1337",
});

console.log("http proxy server" + " started " + "on port " + "8013");

Finally, we can start our server and proxy to expose the proxy with Ngrok and use it in App Search.

Voilà! We now have the price as a separate field:

We can go as fancy as we want using the middleware that transforms the body response to add meta tags based on existent classes, but also based on the content itself.

This is a simple example, but I hope you get the gist.

DEV Community

Body parsing in App Search

The body parsing problem

The body parsing solution

So, what are we doing?

Tools

The page

Top comments (0)

Read next

Setup PostgreSQL w/ pgvector in a docker container

State of JavaScript 2024 Results, GitHub Copilot Now Free, EPIC Developer Tools, and more

Supercharge your HTML with mizu.js!

Secure Text Encryption and Decryption with Vanilla JavaScript