My curiosity for databases and their internals led me to look under the hood of chromadb and understand what it was doing.
But what is a vector in the first place? A vector is a high dimensional represenation of a datapoint in the vector space. Each vector can have from small number to a very large number of dimensions, depending on the complexity of data. Vectors make it easy to determine similarity between data.
Vector databases give us an easy way to store and retrieve vectors. But how do they work?
Vector databases work by creating an index of all the vectors in the database. This index is based on the vectors' characteristics and similarities. When a query is made to retrieve a vector, the database searches the index to find the most similar vectors and returns them as results. This allows for fast and efficient retrieval of vectors, even in large and complex databases.
This is what chromadb is doing as per my reading of the code
- For the in-memory version, chromadb uses sqlite to store vectors. Sqlite is a file based relational database that does not have vector support out of the box.
- When a document is being added to a collection, chromadb uses a default embedding function to create the vectors for it.
- For each collection an index is created using the hnswlib python library (an implementaion of HNSW approximate nearest neighbor search algorithm)
- When a text string is queried on the collection—chromadb creates vectors for the strings using the same embedding function as before. Then it searches through the index for k nearest neighbours, where k can be specified in the query.
- The index contains the UUIDs for the documents and using them the actual matching text strings are returned.
It is interesting to see that chromadb has utilised existing technologies to create a vector database. I am interested to read more about the vector search algorithms.
Cover Photo by Rushikesh Gaikwad on Unsplash