Vector Databases for Data-Centric AI (Part 2)

#programming #nlp #machinelearning #opensource

Building applications with QDrant, Hugging-Face and Streamlit.

This article also lives here:

https://medium.com/@george.pearse

QDrant have created an excellent vector database and I suspect ML Engineers are only beginning to scratch its potential applications.

Vector databases support hybrid similarity search and provide a CRUD API to run updates to your datasets. They are a significant improvement upon first-wave Approximate Nearest Neighbour tools like Faiss and Annoy which enable very high performance in-memory vector search but little in the way of support for update flows, nor metadata filters.

Hybrid search is vector or "semantic" search combined with attribute filtering.

The semantic search implemented by QDrant requires a list of positive and negative examples. Each positive datapoint is an example of what you want the responses to be similar to, each negative datapoint is an example of what you want the responses to be different to.

This allows you to build up arbitrarily complex decision boundaries within your feature space.
An example QDrant query:

{"positive": [0],    
"negative": [1],    
"top": 10,   
"with_payload": true}

This enables the interactive definition of classes:

Start with a single positive datapoint.
Look through the responses.
Add those that you consider to be similar to the list of positives
Add those you consider to be different to the list of negatives
Run that new query, and repeat.

After a batch of labelling you would also be in a position to improve your embeddings and continue the process with a 'better' separated dataset (I'll be experimenting more with this next).

I've built a mini Streamlit application to support this flow and enable you to save each query once complete along with a CSV containing its results.

QDrant-NLP. A short demo

How to Run

Just clone the repo QDrant-NLP
and run:
docker-compose up
I would like to increase the number of datasets this can be tried on, either with GPU backed lambda functions or by saving many example datasets to S3. So far I've only made a 6K subset of ag_news available. ag_news · Datasets at Hugging Face
This is the code snippet used to generate the embeddings via hugging-face:

The embeddings were generated with this code snippet.

Where to Use

Shout out to both Kern.AI (an excellent open-source NLP labelling tool)
https://github.com/code-kern-ai/refinery
and Voxel51 (an excellent open-source Computer Vision analysis tool)
https://github.com/voxel51/fiftyone
for being early adopters of the technology in their platforms, but I don't believe either have yet made use of all of the value it can provide.

DEV Community

Vector Databases for Data-Centric AI (Part 2)

QDrant-NLP. A short demo

How to Run

Where to Use

Top comments (0)