A feature some people (including me) miss from Swiftype Site Search is being able to parse content using CSS selectors.
The body parsing problem
The App Search crawler will extract all the content from the website you specify and spread it in fields depending on the HTML tags it finds.
Text within title tags will be assumed as title field; anchor tags as links, and body will be parsed as one giant field with everything else.
What if a website has a custom structure we want to capture? For example, product pages. We want to capture things like color, size, and price in specific fields and not as part of a single body field.
App Search allows you to add Meta Tags to your website to create custom fields, but sometimes making changes on the website is too complicated (red tape) or just not possible.
The body parsing solution
This post will explore a way of creating a proxy between the crawler and the actual website that does the extraction, creates the needed Meta Tags, and then inject these in the new response so App Search can grab these tags and use them.
To keep this short, I will not be doing a step-by-step guide and only showing the key parts of the script.
So, what are we doing?
We are going to create a NodeJS server that hosts a product page and a proxy that stands in front. This will receive the crawler request, hit the product page, inject the meta tags, and then return the page to the crawler.
This is how we add custom fields using meta tags:
<head>
<meta class="elastic" name="product_price" content="99.99">
</head>
<body>
<h1 data-elastic-name="product_name">Printer</h1>
</body>
Tools
- Nodejs (to create example page and proxy)
- Ngrok (to expose local proxy to the internet)
- App Search (to crawl the page)
The page
Following the example we will serve a page that emulates a product page for a printer.
index.html
<html>
<head>
<title>Printer Page</title>
</head>
<body>
<h1>Printer</h1>
<div class="price-container">
<div class="title">Price</div>
<div class="value">2.99</div>
</div>
</body>
</html>
server.js
const express = require("express");
const app = express();
app.listen(1337, () => {
console.log("Application started and Listening on port 1337");
});
app.get("/", (req, res) => {
res.sendFile(__dirname + "/index.html");
});
Afterwards, we use App Search to crawl the page and—as expected—the data we want to have as fields (like the price) was just put inside the body content field:
Then, we will create a proxy capable of recognizing this data, and injecting a meta tag to the response, so App Search can recognize this is a field.
proxy.js
const http = require("http"),
connect = require("connect"),
app = connect(),
httpProxy = require("http-proxy");
app.use(function (req, res, next) {
var _write = res.write;
res.write = function (data) {
_write.call(
res,
data
.toString()
.replace('class="value"', 'class="value" data-elastic-name="price"')
);
};
next();
});
app.use(function (req, res) {
proxy.web(req, res);
});
http.createServer(app).listen(8013);
var proxy = httpProxy.createProxyServer({
target: "http://localhost:1337",
});
console.log("http proxy server" + " started " + "on port " + "8013");
Finally, we can start our server and proxy to expose the proxy with Ngrok and use it in App Search.
Voilà! We now have the price as a separate field:
We can go as fancy as we want using the middleware that transforms the body response to add meta tags based on existent classes, but also based on the content itself.
This is a simple example, but I hope you get the gist.
Top comments (0)