DEV Community

David Mezzetti for NeuML

Posted on • Updated on • Originally published at

All about vector quantization

txtai supports a number of approximate nearest neighbor (ANN) libraries for vector storage. This includes Faiss, Hnswlib, Annoy, NumPy and PyTorch. Custom implementations can also be added.

The default ANN for txtai is Faiss. Faiss has by far the largest array of configurable options in building an ANN index. This article will cover quantization and different approaches that are possible along with the tradeoffs.

Install dependencies

Install txtai and all dependencies.

# Install txtai
pip install txtai pytrec_eval rank-bm25 elasticsearch psutil
Enter fullscreen mode Exit fullscreen mode

Preparing the datasets

First, let's download a subset of the datasets from the BEIR evaluation framework. We'll also retrieve the standard txtai benchmark script. These will be used to help judge the accuracy of quantization methods.

import os

# Get benchmarks script

# Create output directory
os.makedirs("beir", exist_ok=True)

if os.path.exists("benchmarks.json"):

# Download subset of BEIR datasets
datasets = ["nfcorpus", "arguana", "scifact"]
for dataset in datasets:
  url = f"{dataset}.zip"
  os.system(f"wget {url}")
  os.system(f"mv {dataset}.zip beir")
  os.system(f"unzip -d beir beir/{dataset}.zip")
Enter fullscreen mode Exit fullscreen mode


Next, we'll setup the scaffolding to run evaluations.

import pandas as pd
import yaml

def writeconfig(dataset, quantize):
  sources = {"arguana": "IVF11", "nfcorpus": "IDMap", "scifact": "IVF6"}
  config = {
    "embeddings": {
      "batch": 8192,
      "encodebatch": 128,
      "faiss": {
          "sample": 0.05

  if quantize and quantize[-1].isdigit() and int(quantize[-1]) < 4:
    # Use vector quantization for 1, 2 and 3 bit quantization
    config["embeddings"]["quantize"] = int(quantize[-1])
  elif quantize:
    # Use Faiss quantization for other forms of quantization
    config["embeddings"]["faiss"]["components"] = f"{sources[dataset]},{quantize}"

  # Derive name
  name = quantize if quantize else "baseline"

  # Derive config path and write output
  path = f"{dataset}_{name}.yml"
  with open(path, "w") as f:
    yaml.dump(config, f)

  return name, path

def benchmarks():
  # Read JSON lines data
  with open("benchmarks.json") as f:
    data =

  df = pd.read_json(data, lines=True).sort_values(by=["source", "ndcg_cut_10"], ascending=[True, False])
  return df[["source", "name", "ndcg_cut_10", "map_cut_10", "recall_10", "P_10", "disk"]].reset_index(drop=True)

# Runs benchmark evaluation
def evaluate(quantize=None):
  for dataset in datasets:
    # Build config based on requested quantization
    name, config = writeconfig(dataset, quantize)

    command = f"python -d beir -s {dataset} -m embeddings -c \"{config}\" -n \"{name}\""
Enter fullscreen mode Exit fullscreen mode

Establish a baseline

Before introducing vector quantization, let's establish a baseline of accuracy per source without quantization. The following table shows accuracy metrics along with the disk storage size in KB.

Enter fullscreen mode Exit fullscreen mode
source name ndcg_cut_10 map_cut_10 recall_10 P_10 disk
arguana baseline 0.47886 0.38931 0.76600 0.07660 13416
nfcorpus baseline 0.30893 0.10789 0.15315 0.23622 5517
scifact baseline 0.65273 0.60386 0.78972 0.08867 7878


The two main types of vector quantization are scalar quantization and product quantization.

Scalar quantization maps floating point data to a series of integers. For example, 8-bit quantization splits the range of floats into 255 buckets. This cuts data storage down by 4 when working with 32-bit floats, since each dimension now only stores 1 byte vs 4. A more dramatic version of this is binary or 1-bit quantization, where the floating point range is cut in half, 0 or 1. The trade-off as one would expect is accuracy.

Product quantization is similar in that the process bins a floating point range into codes but it's more complex. This method splits vectors across dimensions into subvectors and runs those subvectors through a clustering algorithm. This can lead to a substantial reduction in data storage at the expense of accuracy like with scalar quantization. The Faiss documentation has a number of great papers with more information on this method.

Quantization is available at the vector processing and datastore levels in txtai. In both cases, it requires an ANN backend that can support integer vectors. Currently, only Faiss, NumPy and Torch are supported.

Let's benchmark a variety of quantization methods.

# Evaluate quantization methods
for quantize in ["SQ1", "SQ4", "SQ8", "PQ48x4fs", "PQ96x4fs", "PQ192x4fs"]:

# Show benchmarks
Enter fullscreen mode Exit fullscreen mode
source name ndcg_cut_10 map_cut_10 recall_10 P_10 disk
arguana baseline 0.47886 0.38931 0.76600 0.07660 13416
arguana SQ8 0.47781 0.38781 0.76671 0.07667 3660
arguana SQ4 0.47771 0.38915 0.76174 0.07617 2034
arguana PQ192x4fs 0.46322 0.37341 0.75391 0.07539 1260
arguana PQ96x4fs 0.43744 0.35052 0.71906 0.07191 844
arguana SQ1 0.42604 0.33997 0.70555 0.07055 795
arguana PQ48x4fs 0.40220 0.31653 0.67852 0.06785 637
nfcorpus SQ4 0.31028 0.10758 0.15417 0.23839 751
nfcorpus SQ8 0.30917 0.10810 0.15327 0.23591 1433
nfcorpus baseline 0.30893 0.10789 0.15315 0.23622 5517
nfcorpus PQ192x4fs 0.30722 0.10678 0.15168 0.23467 433
nfcorpus PQ96x4fs 0.29594 0.09929 0.13996 0.22693 262
nfcorpus SQ1 0.26582 0.08579 0.12658 0.19907 237
nfcorpus PQ48x4fs 0.25874 0.08100 0.11912 0.19567 177
scifact SQ4 0.65299 0.60328 0.79139 0.08867 1078
scifact baseline 0.65273 0.60386 0.78972 0.08867 7878
scifact SQ8 0.65149 0.60193 0.78972 0.08867 2050
scifact PQ192x4fs 0.64046 0.58823 0.78933 0.08867 622
scifact PQ96x4fs 0.62256 0.57773 0.74861 0.08400 375
scifact SQ1 0.58724 0.53418 0.73989 0.08267 338
scifact PQ48x4fs 0.52292 0.46611 0.68744 0.07700 251


Each of the sources above were run through a series of scalar and product quantization settings. The accuracy vs disk space trade off is clear to see.

Couple key points to highlight.

  • The vector model outputs vectors with 384 dimensions
  • Scalar quantization (SQ) was evaluated for 1-bit (binary), 4 and 8 bits
  • 1-bit (binary) quantization stores vectors in binary indexes
  • For product quantization (PQ), three methods were tested. 48, 96 and 192 codes respectively, all using 4-bit codes

In general, the larger the index size, the better the scores. There are a few exceptions to this but the differences are minimal in those cases. The smaller scalar and product quantization indexes are up to 20 times smaller.

It's important to note that the smaller scalar methods typically need a wider number of dimensions to perform competitively. With that being said, even at 384 dimensions, binary quantization still does OK. txtai supports scalar quantization precisions from 1 through 8 bits.

This is just a subset of the available quantization methods available in Faiss. More details can be found in the Faiss documentation.

Wrapping up

This article evaluated a variety of vector quantization methods. Quantization is an option to reduce storage costs at the expense of accuracy. Larger vector models (1024+ dimensions) will retain accuracy better with more aggressive quantization methods. As always, results will vary depending on your data.

Top comments (0)