David Mezzetti for NeuML

Posted on Oct 31, 2023 • Edited on Apr 25 • Originally published at neuml.hashnode.dev

All about vector quantization

#ai #llm #rag #vectordatabase

txtai supports a number of approximate nearest neighbor (ANN) libraries for vector storage. This includes Faiss, Hnswlib, Annoy, NumPy and PyTorch. Custom implementations can also be added.

The default ANN for txtai is Faiss. Faiss has by far the largest array of configurable options in building an ANN index. This article will cover quantization and different approaches that are possible along with the tradeoffs.

Install dependencies

Install txtai and all dependencies.

# Install txtai
pip install txtai pytrec_eval rank-bm25 elasticsearch psutil

Preparing the datasets

First, let's download a subset of the datasets from the BEIR evaluation framework. We'll also retrieve the standard txtai benchmark script. These will be used to help judge the accuracy of quantization methods.

import os

# Get benchmarks script
os.system("wget https://raw.githubusercontent.com/neuml/txtai/master/examples/benchmarks.py")

# Create output directory
os.makedirs("beir", exist_ok=True)

if os.path.exists("benchmarks.json"):
  os.remove("benchmarks.json")

# Download subset of BEIR datasets
datasets = ["nfcorpus", "arguana", "scifact"]
for dataset in datasets:
  url = f"https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/{dataset}.zip"
  os.system(f"wget {url}")
  os.system(f"mv {dataset}.zip beir")
  os.system(f"unzip -d beir beir/{dataset}.zip")

Evaluation

Next, we'll setup the scaffolding to run evaluations.

import pandas as pd
import yaml

def writeconfig(dataset, quantize):
  sources = {"arguana": "IVF11", "nfcorpus": "IDMap", "scifact": "IVF6"}
  config = {
    "embeddings": {
      "batch": 8192,
      "encodebatch": 128,
      "faiss": {
          "sample": 0.05
      }
    }
  }

  if quantize and quantize[-1].isdigit() and int(quantize[-1]) < 4:
    # Use vector quantization for 1, 2 and 3 bit quantization
    config["embeddings"]["quantize"] = int(quantize[-1])
  elif quantize:
    # Use Faiss quantization for other forms of quantization
    config["embeddings"]["faiss"]["components"] = f"{sources[dataset]},{quantize}"

  # Derive name
  name = quantize if quantize else "baseline"

  # Derive config path and write output
  path = f"{dataset}_{name}.yml"
  with open(path, "w") as f:
    yaml.dump(config, f)

  return name, path

def benchmarks():
  # Read JSON lines data
  with open("benchmarks.json") as f:
    data = f.read()

  df = pd.read_json(data, lines=True).sort_values(by=["source", "ndcg_cut_10"], ascending=[True, False])
  return df[["source", "name", "ndcg_cut_10", "map_cut_10", "recall_10", "P_10", "disk"]].reset_index(drop=True)

# Runs benchmark evaluation
def evaluate(quantize=None):
  for dataset in datasets:
    # Build config based on requested quantization
    name, config = writeconfig(dataset, quantize)

    command = f"python benchmarks.py -d beir -s {dataset} -m embeddings -c \"{config}\" -n \"{name}\""
    os.system(command)

Establish a baseline

Before introducing vector quantization, let's establish a baseline of accuracy per source without quantization. The following table shows accuracy metrics along with the disk storage size in KB.

evaluate()
benchmarks()

source	name	ndcg_cut_10	map_cut_10	recall_10	P_10	disk
arguana	baseline	0.47886	0.38931	0.76600	0.07660	13416
nfcorpus	baseline	0.30893	0.10789	0.15315	0.23622	5517
scifact	baseline	0.65273	0.60386	0.78972	0.08867	7878

Quantization

The two main types of vector quantization are scalar quantization and product quantization.

Scalar quantization maps floating point data to a series of integers. For example, 8-bit quantization splits the range of floats into 255 buckets. This cuts data storage down by 4 when working with 32-bit floats, since each dimension now only stores 1 byte vs 4. A more dramatic version of this is binary or 1-bit quantization, where the floating point range is cut in half, 0 or 1. The trade-off as one would expect is accuracy.

Product quantization is similar in that the process bins a floating point range into codes but it's more complex. This method splits vectors across dimensions into subvectors and runs those subvectors through a clustering algorithm. This can lead to a substantial reduction in data storage at the expense of accuracy like with scalar quantization. The Faiss documentation has a number of great papers with more information on this method.

Quantization is available at the vector processing and datastore levels in txtai. In both cases, it requires an ANN backend that can support integer vectors. Currently, only Faiss, NumPy and Torch are supported.

Let's benchmark a variety of quantization methods.

# Evaluate quantization methods
for quantize in ["SQ1", "SQ4", "SQ8", "PQ48x4fs", "PQ96x4fs", "PQ192x4fs"]:
  evaluate(quantize)

# Show benchmarks
benchmarks()

source	name	ndcg_cut_10	map_cut_10	recall_10	P_10	disk
arguana	baseline	0.47886	0.38931	0.76600	0.07660	13416
arguana	SQ8	0.47781	0.38781	0.76671	0.07667	3660
arguana	SQ4	0.47771	0.38915	0.76174	0.07617	2034
arguana	PQ192x4fs	0.46322	0.37341	0.75391	0.07539	1260
arguana	PQ96x4fs	0.43744	0.35052	0.71906	0.07191	844
arguana	SQ1	0.42604	0.33997	0.70555	0.07055	795
arguana	PQ48x4fs	0.40220	0.31653	0.67852	0.06785	637
nfcorpus	SQ4	0.31028	0.10758	0.15417	0.23839	751
nfcorpus	SQ8	0.30917	0.10810	0.15327	0.23591	1433
nfcorpus	baseline	0.30893	0.10789	0.15315	0.23622	5517
nfcorpus	PQ192x4fs	0.30722	0.10678	0.15168	0.23467	433
nfcorpus	PQ96x4fs	0.29594	0.09929	0.13996	0.22693	262
nfcorpus	SQ1	0.26582	0.08579	0.12658	0.19907	237
nfcorpus	PQ48x4fs	0.25874	0.08100	0.11912	0.19567	177
scifact	SQ4	0.65299	0.60328	0.79139	0.08867	1078
scifact	baseline	0.65273	0.60386	0.78972	0.08867	7878
scifact	SQ8	0.65149	0.60193	0.78972	0.08867	2050
scifact	PQ192x4fs	0.64046	0.58823	0.78933	0.08867	622
scifact	PQ96x4fs	0.62256	0.57773	0.74861	0.08400	375
scifact	SQ1	0.58724	0.53418	0.73989	0.08267	338
scifact	PQ48x4fs	0.52292	0.46611	0.68744	0.07700	251

Review

Each of the sources above were run through a series of scalar and product quantization settings. The accuracy vs disk space trade off is clear to see.

Couple key points to highlight.

The vector model outputs vectors with 384 dimensions
Scalar quantization (SQ) was evaluated for 1-bit (binary), 4 and 8 bits
1-bit (binary) quantization stores vectors in binary indexes
For product quantization (PQ), three methods were tested. 48, 96 and 192 codes respectively, all using 4-bit codes

In general, the larger the index size, the better the scores. There are a few exceptions to this but the differences are minimal in those cases. The smaller scalar and product quantization indexes are up to 20 times smaller.

It's important to note that the smaller scalar methods typically need a wider number of dimensions to perform competitively. With that being said, even at 384 dimensions, binary quantization still does OK. txtai supports scalar quantization precisions from 1 through 8 bits.

This is just a subset of the available quantization methods available in Faiss. More details can be found in the Faiss documentation.

Wrapping up

This article evaluated a variety of vector quantization methods. Quantization is an option to reduce storage costs at the expense of accuracy. Larger vector models (1024+ dimensions) will retain accuracy better with more aggressive quantization methods. As always, results will vary depending on your data.

DEV Community

All about vector quantization

Install dependencies

Preparing the datasets

Evaluation

Establish a baseline

Quantization

Review

Wrapping up

Top comments (0)

Read next

How These Free Open Source Projects Can Jumpstart Your Career (No Experience? No Problem!)

Creating a full-stack AI based calorie/nutrition tracker in just 8 hrs using Supabase & Lovable

A beginner's guide to the Flux-1.1-Pro model by Black-Forest-Labs on Replicate

Speech to Text using Assembly AI