alakkadshaw

Posted on Aug 23, 2023 • Originally published at deadsimplechat.com

Implementing Vector Database for AI

#beginners #programming #ai #database

What are vector databases
Why are vector databases important to AI
Core concepts of a vector database
Factors to consider when choosing a vector database
Popular Vector Databases for your consideration
Step by Step guide to implementing a Vector database
Step 1 : Installing Milvus
Step 2: Creating a Milvus Client
Step 3 : Create a collection
Step 4: Inserting data into the Collection
Step 5: Create an Index
Step 6: Sample searching for similar vectors
Bonus: How to prepare your data for Vector database
Conclusion

What are vector databases

Vector Databases are used for data storage and retrieval like all databases but these are designed to handle high dimensional vector data. which are mathematical representations of features or attributes

Vector databases do Similarity Search: Which is to find similar vectors in a database for the given search query.

Similarity search is achieved through algorithms that reduce space and time complexity when compared to traditional databases like SQL

Why Vector Databases are important in AI

Vector databases are very important in AI. This is because they process large-scale multimedia data, natural language processing and neural networks
Vector databases enable resource-efficient and time-efficient storage and retrieval of high-dimensional vector data.
High dimensional vector data includes feature vectors and embeddings, these data capture complex patterns and data relationships
In AI there is a need for searches like nearest neighbor search, Clustering and classification. These types of searches are resource-intensive and hence there is a need for specialized databases for the same
Vector databases provide fast and accurate similarity search thus improving the performance and scalability of AI apps

Core Concepts of a Vector Database

As we have already seen that Vector database handles high dimensional vector data

Indexing: Vector databases use k-d trees, ball trees, and other such techniques to perform high dimensional vector searches like nearest neighbour or clustering searches
Scalability: These databases are designed to handle huge amounts of data and the databases can be scaled with multiple machines running in parallel.
Distance metrics: Vector databases compute similarity between vectors such as cosine similarity, Euclidean distance and Manhatten distance to figure how similar vectors are from one another and then cluster them together
Performance Optimization: Query latency and memory usage optimization are critical for AI applications and thus is what the vector databases are designed to do

Factors to consider when choosing a Vector Database

When considering a vector database, first consider what are the requirements of your project.

There are many vector databases available in the market today. From lightweight to high-performance and scalable databases

There are paid versions available as well. There are some which are self-hosted and others you can just purchase as a SaaS product

Here are the 4 factors that you need to consider when choosing the right vector database for your project

Scalability
Performance
Community Support
Compatibility

Let us look at these individually

Scalability

What is the model size you are dealing with, you can choose a lightweight database as well, if your data size is small

If you have a large dataset, determine if the database can be scaled over multiple machines or not

Consider if the database can be scaled over multiple data centers if the project is a large-scale

Performance

Performance can be thought of through the metrics of the following

Query latency
memory usage
indexing time

Hardware acceleration is also quite essential, nowadays most vector databases can run on GPUs instead of CPU giving a boost to performance

In many cases when considering databases you can choose which one suits your purpose for a given amount of performance per dollar you can choose a database optimized for speed vs accuracy

Compatibility

Different databases work well with different programming languages, this is especially true when working with vector databases

Check if the database you are considering works with the programming language used in your project

What are the distance metrics and indexing techniques that you are using in your project? Is the database compatible with that

Does the database offer APIs, libraries and connectors that integrate with your project

Community Support

What is the level of community support around a particular database? This is important because of community support there is a lot of support available to the developer, like the support of stack overflow, articles on how to achieve something or set up the database for a particular purpose etc

Having Community support also means access to tutorials, detailed docs and articles on how to implement things

Databases having large communities are also well maintained and receive regular support in the form of bug fixes, new features and security updates

Popular Vector Databases for your consideration

Here are 4 of the most popular vector databases available in the market today

FAISS: Developed by Facebook, a large-scale vector database model. It is popular for its performance and flexibility in running AI applications. It also supports GPU acceleration is which a great add-on. It is primarily compatible with python

Milvus: Advertised as the most popular vector database for enterprise users. Milvus can be used in applications like computer vision, machine learning and natural language processing and it is Open source as well. It is compatible with most programs in AI and has support for multiple indexing techniques and also offers GPU hardware acceleration and distributed deployment

Annoy: This is a C++ library developed by Spotify that is open-source and lightweight. It searches points in space that are close to a given point.

Weaviate: Weaviate is an open-source database that provides HNSW that is Hierarchical Navigable Small World, which is a graph-based technique often used in vector databases. It offers a balance between accuracy and speed and you can specify which is more preferable to you. This technique may require more ram than other techniques

Step by Step guide to implementing a Vector database

For this guide, we will be using one of the most popular vector databases out there Milvus

Step 1: Installing Milvus

You can install Milvus in a docker container as well. There are minimal hardware requirements for installing milvus, you can check them out on the milvus website

this article is brought to you by DeadSimpleChat Chat API for your website and app

to install download the milvus YAML file

$ wget https://github.com/milvus-io/milvus/releases/download/v2.2.13/milvus-standalone-docker-compose.yml -O docker-compose.yml

After downloading the YAML file start the MILVUS with the below command

sudo docker-compose up -d

Creating milvus-etcd  ... done
Creating milvus-minio ... done
Creating milvus-standalone ... done

then you can check whether the milvus is up and running by

$ sudo docker-compose ps

you will get something like

      Name                     Command                  State                            Ports
--------------------------------------------------------------------------------------------------------------------
milvus-etcd         etcd -advertise-client-url ...   Up             2379/tcp, 2380/tcp
milvus-minio        /usr/bin/docker-entrypoint ...   Up (healthy)   9000/tcp
milvus-standalone   /tini -- milvus run standalone   Up             0.0.0.0:19530->19530/tcp, 0.0.0.0:9091->9091/tcp

Connect to Milvus

Check the local port where Milvus is running and replace the container name with a custom name

docker port milvus-standalone 19530/tcp

This command will return a local Ip address and port number and you can connect to it

Stop Milvus

You can stop Milvus using the following command

sudo docker-compose down

Creating the NodeJs Project

Let us create a new directory and cd into it

mkdir milvus-nodej
cd milvus-nodejs

next let us initialize the project like so

npm init -y

then we will install the milvus and other dependencies

npm install milvus-2.2.12 --save
npm install

Step 2 Creating a Milvus Client

a. Create a new file named index.js and import the Milvus sdk like

const {MilvusClient} = require("milvus-2.2.12");

Step 3: Create a collection

Now, let us define a collection schema and include the data fields and data types

const collectionSchema = {
  collection_name: "test_collection",
  fields: [
    {
      field_name: "vector",
      data_type: "FloatVector",
      type_params: {
        dim: 128,
      },
    },
    {
      field_name: "id",
      data_type: "Int64",
      auto_id: true,
    },
  ],
};

Now, let us create a collection using the Milvus client

async function createCollection() {
  const response = await milvusClient.createCollection(collectionSchema);
  console.log("Collection has been created:", response);
}

createCollection();

Step 4: Inserting Data into the Collection

a. Preparing the data to be inserted

const vectors = [
  {
    id: 1,
    vector: Array.from({ length: 128 }, () => Math.random()),
  },
  {
    id: 2,
    vector: Array.from({ length: 128 }, () => Math.random()),
  },
];

b. inserting the data into the collection

async function insertData() {
  const response = await milvusClient.insert({
    collection_name: "test_collection",
    fields_data: vectors,
  });
  console.log("Data has been added to the collection:", response);
}

insertData();

Step 5: Create an Index

Let us define the parameters such as the index and metric types

const indexParams = {
  collection_name: "test_collection",
  field_name: "vector",
  index_type: "IVF_FLAT",
  metric_type: "L2",
  type_params: {
    nlist: 1024,
  },
};

Now, using the Milvus CLient we will create an index

async function createIndex() {
  const response = await milvusClient.createIndex(indexParams);
  console.log("A new index has been created:", response);
}

createIndex();

Step 6 Sample searching for similar vectors

const searchParams = {
  collection_name: "test_collection",
  field_name: "vector",
  top_k: 5,
  search_params: {
    anns_field: "vector",
    metric_type: "L2",
    query_records: Array.from({ length: 128 }, () => Math.random()),
    round_decimal: 4,
  },
};

Now let us do a sample search query into our database. For this we will need to define some search parameters such as top-k results and search radius which we have done above

async function search() {
  const response = await milvusClient.search(searchParams);
  console.log("Results:", response);
}

search();

Thus we have implemented the Milvus client in our node js project.

Bonus 1: How to prepare your Data for the Vector database

Data Pre-Processing and Feature extraction in Vector databases

In vector databases, there are 3 methods we use for Data preprocessing and these are

Normalization
Dimensionality reduction
Feature selection

Let us consider all these in detail

Normalization

Normalization includes adjusting the dataset values such that they are in a common scale. We do this because we do not want a single feature to dominate the model dues to the differences in the magnitude of values

steps involved

The features that need to be normalized are numerical features that have different scales and units of measurements

There are a number of methods used for this

Min-Max scaling. Scaling all the values in the range of [0-1].

Z-score standardization: In this method, we scale the values using statistics. All the values are based on the mean (average), standard deviation: how far do these values go from the mean

Apply whatever method you prefer but always use the same scaling params for training and test sets to avoid data leakage

Dimensionality Reduction

This involves reducing the number of features in our data set but retaining the important features.

With reduced dimensions, the model runs fast, computational complexity decreases and the data is simplified. Here are some of the steps

Steps involved

Set aside the number of dimensions and the amount of variance that we want in our reduced dataset, then choose a reduction technique accordingly

Principal Component Analysis: this technique maximizes variance and uses a linear method to project data into a lower dimensional space

t-Distributed Stochastic Neighbor Embedding: It preserves local structures in data and is a nonlinear method of reducing the dimensionality of a dataset

Feature Selection

In Feature selection, we select the most relevant features and discard the ones that are not relevant to our use case. This helps to reduce noise, makes the model more interpretable and decreases the training time

steps involved

Filter method: This involves ranking the features based on the criterion involved. Such as co-relation, mutual information and selecting the top k-features

Wrapper methods: Using a specific machine learning model to evaluate features and iteratively removing the ones that are not relevant to our use-case

Embedded methods: Methods such as LASSO and Ridge regression apply regularization which reduces the impact of features that are not important. this method involves feature selection along with model training.

Apply whatever method you deep best to your dataset and choose the top k features and discard the other features