David Mezzetti for NeuML

Posted on Aug 17, 2023 • Edited on Apr 25 • Originally published at neuml.hashnode.dev

Building an efficient sparse keyword index in Python

#ai #llm #rag #vectordatabase

Semantic search is a new category of search built on recent advances in Natural Language Processing (NLP). Traditional search systems use keywords to find data. Semantic search has an understanding of natural language and identifies results that have the same meaning, not necessarily the same keywords.

While semantic search adds amazing capabilities, sparse keyword indexes can still add value. There may be cases where finding an exact match is important or we just want a fast index to quickly do an initial scan of a dataset.

Unfortunately, there aren't a ton of great options for a local Python-based keyword index library. Most of the options available don't scale and/or are highly inefficient, designed only for simple situations.

Given that Python is an interpreted language, it often gets a bad rap from a performance standpoint. In some cases, it's justified as Python can be memory hungry and has a global interpreter lock (GIL) that forces single thread execution. But it is possible to build performant Python on par with other languages.

This article will explore how to build an efficient sparse keyword index in Python and compare the results with other approaches.

Install dependencies

Install txtai and all dependencies.

# Install txtai
pip install txtai pytrec_eval rank-bm25 elasticsearch==7.10.1

Introducing the problem

At a high level, keyword indexes work by tokenizing text into lists of tokens per document. These tokens are aggregated into frequencies per document and stored in term frequency sparse arrays.

The term frequency arrays are sparse given that they only store a frequency when the token exists in a document. For example, if a token exists in 1 of 1000 documents, the sparse array only has a single entry. A dense array stores 1000 entries all with zeros except for one.

One simple approach to store a term frequency sparse array in Python would be having a dictionary of {id: frequency} per token. The problem with this approach is that Python has significant object overhead.

Let's inspect the size used for a single number.

import sys

a = 100
sys.getsizeof(a)

28 bytes for a single integer. Compared to a native int/long which is 4 or 8 bytes, this is quite wasteful. Imagine having thousands of id: frequency mappings. Memory usage will grow fast.

Let's demonstrate. The code below runs a self contained Python process that creates a list of 10 million numbers.

Running as a separate process helps calculate more accurate memory usage stats.

import psutil

results = []
for x in range(int(1e7)):
  results.append(x)

print(f"MEMORY USAGE = {psutil.Process().memory_info().rss / (1024 * 1024)} MB")

MEMORY USAGE = 394.640625 MB

Approximately 395 MB of memory is used for this array. That seems high.

Efficient numeric arrays in Python

Fortunately, Python has a module for building efficient arrays of numeric values. This module enables building arrays with the same native type.

Let's try doing that with a long long type, which takes 8 bytes.

from array import array

import psutil

results = array("q")
for x in range(int(1e7)):
  results.append(x)

print(f"MEMORY USAGE = {psutil.Process().memory_info().rss / (1024 * 1024)} MB")

MEMORY USAGE = 88.54296875 MB

As we can see, memory usage went from 395 MB to 89 MB. That's a 4x reduction which is in line with the earlier calculate of 28 bytes/number vs 8 bytes/number.

Efficient processing of numeric data

Large computations in pure Python can also be painfully slow. Luckily, there is a robust landscape of options for numeric processing. The most popular framework is NumPy. There is also PyTorch and other GPU-based tensor processing frameworks.

Below is a simple example that sorts an array in Python vs NumPy to demonstrate.

import random
import time

data = [random.randint(1, 500) for x in range(1000000)]

start = time.time()
sorted(data, reverse=True)
print(time.time() - start)

0.33922290802001953

import numpy as np

data = np.array(data)

start = time.time()
np.sort(data)[::-1]
print(time.time() - start)

0.10296249389648438

As we can see, sorting an array in NumPy is significantly faster. It might not seem like a lot but this adds up when run in bulk.

Sparse keyword indexes in txtai

Now that we've discussed the key performance concepts, let's talk about how to apply this to building sparse keyword indexes.

Going back to the original approach for a term frequency sparse array, we see that using the Python array package is more efficient. In txtai, this method is used to build term frequency arrays for each token. This results in near native speed and memory usage.

The search method uses a number of NumPy methods to efficiently calculate query term matches. Each query is tokenized and those token term frequency arrays are retrieved to calculate query scores. These NumPy methods are all written in C and often drop the GIL. So once again, near native speed and the ability to use multithreading.

Read the full implementation on GitHub to learn more.

Evaluating performance

First, a review of the landscape. As said in the introduction, there aren't a ton of good options. Apache Lucene is by far the best traditional search index from a speed, performance and functionality standpoint. It's the base for Elasticsearch/OpenSearch and many other projects. But it requires Java.

Here are the options we'll explore.

Rank-BM25 project, the top result when searching for python bm25.
SQLite FTS5 extension. This extension builds a sparse keyword index right in SQLite.

We'll use the BEIR dataset. We'll also use a benchmarks script from the txtai project. This benchmarks script has methods to work with the BEIR dataset.

Couple important caveats on the benchmarks script.

For the SQLite FTS implementation, each token is joined together with an OR clause. SQLite FTS implicitly joins clauses together with AND clauses by default. By contrast, Lucene's default operator is an OR.
The Elasticsearch implementation uses 7.x as it's simpler to instantiate in a notebook.
All methods except Elasticsearch use txtai's unicode tokenizer to tokenize text for consistency

import os

# Get benchmarks script
os.system("wget https://raw.githubusercontent.com/neuml/txtai/master/examples/benchmarks.py")

# Create output directory
os.makedirs("beir", exist_ok=True)

# Download subset of BEIR datasets
datasets = ["trec-covid", "nfcorpus", "webis-touche2020", "scidocs", "scifact"]
for dataset in datasets:
  url = f"https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/{dataset}.zip"
  os.system(f"wget {url}")
  os.system(f"mv {dataset}.zip beir")
  os.system(f"unzip -d beir beir/{dataset}.zip")

  # Remove existing benchmark data
if os.path.exists("benchmarks.json"):
  os.remove("benchmarks.json")

Now let's run the benchmarks.

# Remove existing benchmark data
if os.path.exists("benchmarks.json"):
  os.remove("benchmarks.json")

# Runs benchmark evaluation
def evaluate(method):
  for dataset in datasets:
    command = f"python benchmarks.py beir {dataset} {method}"
    print(command)
    os.system(command)

# Calculate benchmarks
for method in ["bm25", "rank", "sqlite"]:
  evaluate(method)

import json
import pandas as pd

def benchmarks():
  # Read JSON lines data
  with open("benchmarks.json") as f:
    data = f.read()

  df = pd.read_json(data, lines=True).sort_values(by=["source", "search"])
  return df[["source", "method", "index", "memory", "search", "ndcg_cut_10", "map_cut_10", "recall_10", "P_10"]].reset_index(drop=True)

# Load benchmarks dataframe
df = benchmarks()

df[df.source == "trec-covid"].reset_index(drop=True)

source	method	index	memory	search	ndcg_cut_10	map_cut_10	recall_10	P_10
trec-covid	bm25	101.96	997	0.28	0.58119	0.01247	0.01545	0.618
trec-covid	sqlite	60.16	880	23.09	0.56778	0.01190	0.01519	0.610
trec-covid	rank	61.75	3245	75.49	0.57773	0.01210	0.01550	0.632

df[df.source == "nfcorpus"].reset_index(drop=True)

source	method	index	memory	search	ndcg_cut_10	map_cut_10	recall_10	P_10
nfcorpus	bm25	2.64	648	1.08	0.30639	0.11728	0.14891	0.21734
nfcorpus	sqlite	1.50	630	12.73	0.30695	0.11785	0.14871	0.21641
nfcorpus	rank	2.75	700	23.78	0.30692	0.11711	0.15320	0.21889

df[df.source == "webis-touche2020"].reset_index(drop=True)

source	method	index	memory	search	ndcg_cut_10	map_cut_10	recall_10	P_10
webis-touche2020	bm25	374.66	1137	0.37	0.36920	0.14588	0.22736	0.34694
webis-touche2020	sqlite	220.46	1416	34.61	0.37194	0.14812	0.22890	0.35102
webis-touche2020	rank	224.07	10347	81.22	0.39861	0.16492	0.23770	0.36122

df[df.source == "scidocs"].reset_index(drop=True)

source	method	index	memory	search	ndcg_cut_10	map_cut_10	recall_10	P_10
scidocs	bm25	17.95	717	1.64	0.15063	0.08756	0.15637	0.0772
scidocs	sqlite	17.85	670	56.64	0.15156	0.08822	0.15717	0.0776
scidocs	rank	13.11	1056	162.99	0.14932	0.08670	0.15408	0.0761

df[df.source == "scifact"].reset_index(drop=True)

source	method	index	memory	search	ndcg_cut_10	map_cut_10	recall_10	P_10
scifact	bm25	5.51	653	1.07	0.66324	0.61764	0.78761	0.087
scifact	sqlite	1.85	631	20.28	0.66630	0.61966	0.79494	0.088
scifact	rank	1.85	724	42.22	0.65618	0.61204	0.77400	0.085

The sections above show the metrics per source and method.

The table headers list the source (dataset), index method, index time(s), memory usage(MB), search time(s) and NDCG@10/MAP@10/RECALL@10/P@10 accuracy metrics. The tables are sorted by search time.

As we can see, txtai's implementation has the fastest search times across the board. But it is slower when it comes to index time. The accuracy metrics vary slightly but are all about the same per method.

Memory usage stands out. SQLite and txtai both have around the same usage per source. Rank-BM25 memory usage can get out of hand fast. For example, webis-touch2020, which is only ~400K records, uses 10 GB of memory compared to 700 MB for the other implementations.

Compare with Elasticsearch

Now that we've reviewed methods to build keyword indexes in Python, let's see how txtai's sparse keyword index compares to Elasticsearch.

We'll spin up an inline instance and run the same evaluations.

# Download and extract elasticsearch
os.system("wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.10.1-linux-x86_64.tar.gz")
os.system("tar -xzf elasticsearch-7.10.1-linux-x86_64.tar.gz")
os.system("chown -R daemon:daemon elasticsearch-7.10.1")

from subprocess import Popen, PIPE, STDOUT

# Start and wait for server
server = Popen(['elasticsearch-7.10.1/bin/elasticsearch'], stdout=PIPE, stderr=STDOUT, preexec_fn=lambda: os.setuid(1))

# Add benchmark evaluations for Elasticsearch
evaluate("es")

# Reload benchmarks dataframe
df = benchmarks()

df[df.source == "trec-covid"].reset_index(drop=True)

source	method	index	memory	search	ndcg_cut_10	map_cut_10	recall_10	P_10
trec-covid	bm25	101.96	997	0.28	0.58119	0.01247	0.01545	0.618
trec-covid	es	71.24	636	2.09	0.59215	0.01261	0.01590	0.636
trec-covid	sqlite	60.16	880	23.09	0.56778	0.01190	0.01519	0.610
trec-covid	rank	61.75	3245	75.49	0.57773	0.01210	0.01550	0.632

df[df.source == "nfcorpus"].reset_index(drop=True)

source	method	index	memory	search	ndcg_cut_10	map_cut_10	recall_10	P_10
nfcorpus	bm25	2.64	648	1.08	0.30639	0.11728	0.14891	0.21734
nfcorpus	es	3.95	627	11.47	0.30676	0.11761	0.14894	0.21610
nfcorpus	sqlite	1.50	630	12.73	0.30695	0.11785	0.14871	0.21641
nfcorpus	rank	2.75	700	23.78	0.30692	0.11711	0.15320	0.21889

df[df.source == "webis-touche2020"].reset_index(drop=True)

source	method	index	memory	search	ndcg_cut_10	map_cut_10	recall_10	P_10
webis-touche2020	bm25	374.66	1137	0.37	0.36920	0.14588	0.22736	0.34694
webis-touche2020	es	168.28	629	0.62	0.37519	0.14819	0.22889	0.35102
webis-touche2020	sqlite	220.46	1416	34.61	0.37194	0.14812	0.22890	0.35102
webis-touche2020	rank	224.07	10347	81.22	0.39861	0.16492	0.23770	0.36122

df[df.source == "scidocs"].reset_index(drop=True)

source	method	index	memory	search	ndcg_cut_10	map_cut_10	recall_10	P_10
scidocs	bm25	17.95	717	1.64	0.15063	0.08756	0.15637	0.0772
scidocs	es	11.07	632	10.25	0.14924	0.08671	0.15497	0.0765
scidocs	sqlite	17.85	670	56.64	0.15156	0.08822	0.15717	0.0776
scidocs	rank	13.11	1056	162.99	0.14932	0.08670	0.15408	0.0761

df[df.source == "scifact"].reset_index(drop=True)

source	method	index	memory	search	ndcg_cut_10	map_cut_10	recall_10	P_10
scifact	bm25	5.51	653	1.07	0.66324	0.61764	0.78761	0.08700
scifact	es	2.90	625	9.62	0.66058	0.61518	0.78428	0.08667
scifact	sqlite	1.85	631	20.28	0.66630	0.61966	0.79494	0.08800
scifact	rank	1.85	724	42.22	0.65618	0.61204	0.77400	0.08500

Once again txtai's implementation compares well with Elasticsearch. The accuracy metrics vary but are all about the same.

It's important to note that in internal testing with solid state storage, Elasticsearch and txtai's speed is about the same. These times for Elasticsearch being a little slower are a product of running in a Google Colab environment.

Wrapping up

This notebook showed how to build an efficient sparse keyword index in Python. The benchmarks show that txtai provides a strong implementation both from an accuracy and speed standpoint, on par with Apache Lucene.

This keyword index can be used as a standalone index in Python or in combination with dense vector indexes to form a hybrid index.

DEV Community

Building an efficient sparse keyword index in Python

Install dependencies

Introducing the problem

Efficient numeric arrays in Python

Efficient processing of numeric data

Sparse keyword indexes in txtai

Evaluating performance

Compare with Elasticsearch

Wrapping up

Top comments (0)

Read next

The history of game engines — from assembly coding to photorealism and AI

Azure OpenAI in a single page: Zero to Hero – A Complete Integration Guide

Introduction to k8sgpt - Simplifying Kubernetes Troubleshooting - Part 1

MCP Server for MySQL