DEV Community

David Mezzetti for NeuML

Posted on • Updated on • Originally published at neuml.hashnode.dev

Add semantic search to Elasticsearch

Part 2 and Part 3 of this series showed how to index and search data in txtai. Part 2 indexed and searched a Hugging Face Dataset, Part 3 indexed and searched an external data source.

txtai is modular in design, it's components can be individually used. txtai has a similarity function that works on lists of text. This method can be integrated with any external search service, such as a REST API, a SQL query or anything else that returns text search results.

In this article, we'll take the same Hugging Face Dataset used in Part 2, index it in Elasticsearch and rank the search results using a semantic similarity function from txtai.

Install dependencies

Install txtai, datasets and Elasticsearch.

# Install txtai, datasets and elasticsearch python client
pip install txtai datasets elasticsearch

# Download and extract elasticsearch
wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.10.1-linux-x86_64.tar.gz
tar -xzf elasticsearch-7.10.1-linux-x86_64.tar.gz
chown -R daemon:daemon elasticsearch-7.10.1
Enter fullscreen mode Exit fullscreen mode

Start an instance of Elasticsearch.

import os
from subprocess import Popen, PIPE, STDOUT

# Start and wait for server
server = Popen(['elasticsearch-7.10.1/bin/elasticsearch'], stdout=PIPE, stderr=STDOUT, preexec_fn=lambda: os.setuid(1))
Enter fullscreen mode Exit fullscreen mode
sleep 30
Enter fullscreen mode Exit fullscreen mode

Load data into Elasticsearch

The following block loads the dataset into Elasticsearch.

from datasets import load_dataset

from elasticsearch import Elasticsearch, helpers

# Connect to ES instance
es = Elasticsearch(hosts=["http://localhost:9200"], timeout=60, retry_on_timeout=True)

# Load HF dataset
dataset = load_dataset("ag_news", split="train")["text"][:50000]

# Elasticsearch bulk buffer
buffer = []
rows = 0

for x, text in enumerate(dataset):
  # Article record
  article = {"_id": x, "_index": "articles", "title": text}

  # Buffer article
  buffer.append(article)

  # Increment number of articles processed
  rows += 1

  # Bulk load every 1000 records
  if rows % 1000 == 0:
    helpers.bulk(es, buffer)
    buffer = []

    print("Inserted {} articles".format(rows), end="\r")

if buffer:
  helpers.bulk(es, buffer)

print("Total articles inserted: {}".format(rows))
Enter fullscreen mode Exit fullscreen mode
Total articles inserted: 50000
Enter fullscreen mode Exit fullscreen mode

Query data with Elasticsearch

Elasticsearch is a token-based search system. Queries and documents are parsed into tokens and the most relevant query-document matches are calculated using a scoring algorithm. The default scoring algorithm is BM25. Powerful queries can be built using a rich query syntax and Query DSL.

The following section runs a query against Elasticsearch, finds the top 5 matches and returns the corresponding documents associated with each match.

from IPython.display import display, HTML

def table(category, query, rows):
    html = """
    <style type='text/css'>
    @import url('https://fonts.googleapis.com/css?family=Oswald&display=swap');
    table {
      border-collapse: collapse;
      width: 900px;
    }
    th, td {
        border: 1px solid #9e9e9e;
        padding: 10px;
        font: 15px Oswald;
    }
    </style>
    """

    html += "<h3>[%s] %s</h3><table><thead><tr><th>Score</th><th>Text</th></tr></thead>" % (category, query)
    for score, text in rows:
        html += "<tr><td>%.4f</td><td>%s</td></tr>" % (score, text)
    html += "</table>"

    display(HTML(html))

def search(query, limit):
  query = {
      "size": limit,
      "query": {
          "query_string": {"query": query}
      }
  }

  results = []
  for result in es.search(index="articles", body=query)["hits"]["hits"]:
    source = result["_source"]
    results.append((min(result["_score"], 18) / 18, source["title"]))

  return results

limit = 3
query= "+yankees lose"
table("Elasticsearch", query, search(query, limit))
Enter fullscreen mode Exit fullscreen mode

[Elasticsearch] +yankees lose

Score Text
0.5817 El Duque adds to gloomy NY forecast The Yankees #39; staff infection has spread to the one man the team can #39;t afford to lose. Orlando Hernandez was scratched from last night #39;s scheduled start because
0.5697 Rangers Derail Red Sox The Red Sox lose for the first time in 11 games, falling to the Rangers 8-6 Saturday and missing a chance to pull within 1 1/2 games of the Yankees in the AL East.
0.5069 Rout leaves Yanks #39; lead at 3 Royals gain control with 10-run 5th Against a nothing-to-lose team such as the Kansas City Royals, the Yankees #39; manager wanted his team to put down the hammer early and not let baseball #39;s second worst team believe it had a chance.

The table above shows the results for the query +yankees lose. This query requires the token yankees. The search doesn't understand the semantic meaning of the query. It returns the most relevant results with those two tokens.

We can see in this case, the results aren't capturing the meaning of the search. Let's try adding semantic similarity to the search!

Ranking search results with txtai

txtai has a similarity module that computes the similarity between a query and a list of strings. Of course, txtai can also build a full index as shown in the previous articles but in this case we'll just use the ad-hoc similarity function.

The code below creates a Similarity instance and defines a ranking function to order search results based on the computed similarity.

ranksearch queries Elasticsearch for a larger set of results, ranks the results using the similarity instance and returns the top n results.

from txtai.pipeline import Similarity

def ranksearch(query, limit):
  results = [text for _, text in search(query, limit * 10)]
  return [(score, results[x]) for x, score in similarity(query, results)][:limit]

# Create similarity instance for re-ranking
similarity = Similarity("valhalla/distilbart-mnli-12-3")
Enter fullscreen mode Exit fullscreen mode

Now let's re-run the previous search.

# Run the search
table("Elasticsearch + txtai", query, ranksearch(query, limit))
Enter fullscreen mode Exit fullscreen mode

[Elasticsearch + txtai] +yankees lose

Score Text
0.9929 Ouch! Yankees hit new low INDIANS 22, YANKEES 0---At New York, Omar Vizquel went 6-for-7 to tie the American League record for hits as Cleveland handed the Yankees the largest loss in their history last night.
0.9874 Vazquez and Yankees Buckle Early Because Javier Vazquez fizzled while Brad Radke flourished, the Yankees sustained their first regular-season defeat by the Minnesota Twins since 2001.
0.9542 Slide of the Yankees: Pinstripes Punished George Steinbrenner watched from his box as his Yankees suffered the most one-sided loss in the franchise's long history.

The results above do a much better job of finding results semantically similar in meaning to the query. Instead of just finding matches with yankees and lose, it finds matches where the yankees lose.

This combination is effective and powerful. It takes advantage of the high performance of Elasticsearch while adding a semantic search capability. We may already have a large Elasticsearch cluster with TBs (or PBs)+ of data and years of engineering investment that solves most use cases. Semantically ranking search results is a practical approach.

More examples

Now for some more examples comparing the results from Elasticsearch vs Elasticsearch + txtai.

for query in ["good news +economy", "bad news +economy"]:
  table("Elasticsearch", query, search(query, limit))
  table("Elasticsearch + txtai", query, ranksearch(query, limit))
Enter fullscreen mode Exit fullscreen mode

[Elasticsearch] good news +economy

Score Text
0.8756 Surprise drop US wholesale prices is mixed news for economy (AFP) AFP - A surprise drop in US wholesale prices in August showed inflation apparently in check, but analysts said this was good and bad news for the US economy.
0.7379 China investment slows Good news for officials who are trying to cool an overheated economy; austerity measures to remain. BEIJING (Reuters) - China reported a marked slowdown in investment and money supply growth Monday, but stubbornly
0.7145 Spending Rebounds, Good News for Growth WASHINGTON (Reuters) - U.S. consumer spending rebounded sharply July, government data showed on Monday, erasing the disappointment of June and bolstering hopes that the U.S. economy has recovered from its recent soft spot.

[Elasticsearch + txtai] good news +economy

Score Text
0.9996 Spending Rebounds, Good News for Growth WASHINGTON (Reuters) - U.S. consumer spending rebounded sharply in July, the government said on Monday, erasing the disappointment of June and bolstering hopes that the U.S. economy has recovered from its recent soft spot.
0.9996 Spending Rebounds, Good News for Growth WASHINGTON (Reuters) - U.S. consumer spending rebounded sharply July, government data showed on Monday, erasing the disappointment of June and bolstering hopes that the U.S. economy has recovered from its recent soft spot.
0.9993 Home building surges Housing construction in August jumped to its highest level in five months, a dose of encouraging news for the economy #39;s expansion.

[Elasticsearch] bad news +economy

Score Text
0.9228 Surprise drop US wholesale prices is mixed news for economy (AFP) AFP - A surprise drop in US wholesale prices in August showed inflation apparently in check, but analysts said this was good and bad news for the US economy.
0.6405 Field Poll: Californians liking economy Bee Staff Writer. Californians are slowly growing more optimistic about the health of the economy, but a majority still feels the state is in bad economic times, according to a new Field Poll.
0.6188 ADB says China should raise rates to cool economy China should raise interest rates to cool the economy and prevent a future buildup of bad loans in the banking system, the Asian Development Bank #39;s (ADB) Bei-jing representative Bruce Murray said.

[Elasticsearch + txtai] bad news +economy

Score Text
0.9977 Aging society hits Japan #39;s economy Japan #39;s economy will be the most severely affected among industrialized nations by population aging, Kyodo News said Thursday.
0.9963 Funds: Fund Mergers Can Hurt Investors (Reuters) Reuters - Mergers and acquisitions have\played an enormous role in the U.S. economy during the past\several decades, but sometimes the results have been bad for\consumers. Similarly, consolidation in the mutual fund\business has sometimes hurt fund investors.
0.9958 Signs of listless economy persist In a sign of persistent weakness in the US economy, a widely watched measure of business activity declined in August for the third consecutive month.

Once again while Elasticsearch usually returns quality results, occasionally it will match results that aren't semantically relevant. The power of semantic search is that not only will it find direct matches but matches with the same meaning.

Top comments (0)