DEV Community

Chloe Williams for Zilliz

Posted on • Originally published at zilliz.com

An Introduction to Vector Embeddings: What They Are and How to Use Them

Understand vector embeddings and when and how to use them. Explore real-world applications with Milvus and Zilliz Cloud vector databases.

Vector embeddings are numerical representations of data points, making unstructured data easier to search against. These embeddings are stored in specialized databases like Milvus and Zilliz Cloud (fully managed Milvus), which utilize advanced algorithms and indexing techniques for quick data retrieval.

Modern artificial intelligence (AI) models, like Large Language Models (LLMs), use text vector embeddings to understand natural language and generate relevant responses. Moreover, advanced versions of LLMs use Retrieval Augmented Generation (RAG) to retrieve information from external vector stores for task-specific applications.

In this blog post, we will understand the concept of vector embeddings and explore its applications, best practices, and tools for working with embeddings.

What are Vector Embeddings?

A vector embedding is a list of numerical data points, with each number representing a data feature. These embeddings are obtained by analyzing connections within a dataset. Data points that are closer to each other are identified as semantically similar.

The embeddings are formulated using deep learning models trained to map data to a high-dimensional vector space. Popular embedding models like BERT and Data2Vec form the basis of many modern deep-learning applications.

Moreover, vector embeddings are popularly used in NLP and CV applications due to their efficiency.

Types of Vector Embeddings

There are three main types of embeddings based on their dimensionality: dense, sparse, and binary embeddings. Here’s how they differ in characteristics and use:

1. Dense Embeddings

Vector embeddings that represent data points with most non-zero elements are dense. They capture finer details since they store all data, even zero values, making them less storage efficient.

Word2Vec, GloVe, CLIP, and BERT are models that generate dense vector embeddings from input data.

2. Sparse Embeddings

Sparse vector embeddings are high-dimensional vectors with most zero vector elements. The non-zero values in sparse embeddings represent the relative importance of data points in a corpus. Sparse embeddings require less memory and storage and are suitable for high-dimensional sparse data like word frequency.

TF-IDF and SPLADE are popular methods of generating sparse vector embeddings.

3. Binary Embeddings

A binary embedding stores information in only 2 bits, 1 and 0. This form of storage is substantially more efficient than 32-bit floating point integers and improves data retrieval. However, it does lead to information loss since we are dialing down on data precision.

Regardless, binary embeddings are popular in certain use cases where speed is preferred for slight accuracy.

How are Vector Embeddings Created?

Sophisticated deep learning models and statistical methods help create vector embeddings. These models identify patterns and connections in input data to learn the difference between data points. Models generate vector embeddings in an n-dimensional space based on their understanding of underlying connections.

An N-dimensional space is beyond our 3-dimensional thinking and captures data from multiple perspectives. High-dimensional vector embeddings allow capturing finer details from data points, resulting in accurate outputs.

For example, in textual data, high-dimensional space allows for capturing subtle differences in word meanings. Operating in a 2-dimensional space will group the words “tired” and “exhausted” together. An n-dimensional space will project them in different dimensions, capturing the difference in emotions. Mathematically, the following vector is a vector v in n-dimensional space:

v=[v1​,v2​,…,vn​]

The two popular techniques for creating vector embeddings are:

Neural Networks

Neural networks, such as Convolutional Neural Networks (CNNs) or Recurrent Neural Networks (RNNs), excel at learning data complexities. For example, BERT analyzes a word's neighboring terms to understand its meaning and generate embeddings.

Matrix Factorization

Unlike neural networks, matrix factorization is a simpler embedding model. It takes training data as a matrix where each row and column represents a data record. The model then factorizes data points into lower-rank matrices. Matrix factorization is popularly used in recommendation systems, where the input matrix is the user rating matrix with rows representing users and columns representing the item (e.g., movie). Multiplying the user embedding matrix with the transpose of the item embedding matrix generates a matrix that approximates the original matrix.

Various tools and libraries simplify the process of generating embeddings from input data. The most popular libraries include TensorFlow, PyTorch, and Hugging Face. These open-source libraries and tools offer user-friendly documentation for creating embedding models.

The following table lists different embedding models, their descriptions, and links to the official documentation:

Model Description Link
Neural Networks Neural Networks like CNNs and RNNs effectively identify data patterns, which is useful for generating vector embeddings. For example, Word2Vec. https://developers.google.com/machine-learning/crash-course/introduction-to-neural-networks/video-lecture
Matrix Factorization Matrix Factorization is suitable for filtering tasks like recommendation systems. It captures user preferences by manipulating input matrices. https://developers.google.com/machine-learning/recommendation/collaborative/matrix
GloVe GloVe is a uni-directional embedding model. It generates a single-word embedding for a single word. https://nlp.stanford.edu/projects/glove/
BERT BERT (Bidirectional Encoder Representations from Transformers) is a pre-trained model that analyzes textual data bidirectionaly. https://zilliz.com/learn/bge-m3-and-splade-two-machine-learning-models-for-generating-sparse-embeddings#BERT-The-Foundation-Model-for-BGE-M3-and-Splade
ColBERT A token-level embedding and ranking model https://zilliz.com/learn/explore-colbert-token-level-embedding-and-ranking-model-for-similarity-search
SPLADE An advanced embedding model for generating sparse embeddings. https://zilliz.com/learn/bge-m3-and-splade-two-machine-learning-models-for-generating-sparse-embeddings#SPLADE
BGE-M3 BGE-M3 is an advanced machine-learning model that extends BERT's capabilities. https://zilliz.com/learn/bge-m3-and-splade-two-machine-learning-models-for-generating-sparse-embeddings#BGE-M3

What are Vector Embeddings Used for?

Vector embeddings are widely used in various modern search and AI tasks. Some of these tasks include:

  • Similarity Search: Similarity search is a technique to find similar data points in high-dimensional space. This is done by measuring the distance between vector embeddings using similarity measures like Euclidean distance or Jaccard similarity. Modern search engines use similarity search to retrieve relevant web pages against user searches.

  • Recommendation Systems: Recommendation systems rely on vectorized data to cluster similar items. Elements from the same cluster are then used as recommendations for the users. The systems create clusters on various levels, such as groups of users based on demographics and preferences and a group of products. All this information is stored as vector embeddings for efficient and accurate retrieval at runtime.

  • Retrieval Augmented Generation (RAG): RAG is a popular technique for alleviating the hallucinatory issues of large language models and providing them with additional knowledge. Embedding models transform external knowledge and user queries into vector embeddings. A vector database stores the embeddings and conducts a similarity search for the most relevant results to the user query. The LLM generates the final answers based on the retrieved contextual information.

Storing, Indexing, and Retrieving Vector Embeddings with Milvus

Milvus offers a built-in library to store, index, and search vector embeddings. Here’s the step-by-step approach to do so using the PyMilvus library:

1. Install Libraries and Set up a Milvus Database

Install pymilvus , and gensim , where Pymilvus is a Python SDK for Milvus, and gensim is a Python library for NLP. Run the following code to install the libraries:

!pip install -U -pymilvus gensim
Enter fullscreen mode Exit fullscreen mode

In this tutorial, we’re connecting Milvus using docker, so make sure you’ve docker installed in your system. Run the following command in your terminal to install Milvus:

> wget -sfL https://raw.githubusercontent.com/milvus-io/milvus/master/scripts/standalone_embed.sh
> bash standalone_embed.sh start
Enter fullscreen mode Exit fullscreen mode

Now the Milvus service has started and you’re ready to use the Milvus database.To set up a local Milvus vector database, create a MilvusClient instance and specify a filename, like milvus_demo.db, to store all the data.

from pymilvus import MilvusClient

client = MilvusClient("milvus_demo.db")

Enter fullscreen mode Exit fullscreen mode

2. Generate Vector Embeddings

The following code creates a collection to store embeddings, loads a pre-trained model from gensim , and generates embeddings to simple words like ice and water:

import gensim.downloader as api
from pymilvus import (   connections,   FieldSchema,   CollectionSchema,   DataType)
# create a collection
fields = [   FieldSchema(name="pk", dtype=DataType.INT64, is_primary=True, auto_id=False),   FieldSchema(name="words", dtype=DataType.VARCHAR, max_length=50),   FieldSchema(name="embeddings", dtype=DataType.FLOAT_VECTOR, dim=50)]
schema = CollectionSchema(fields, "Demo to store and retrieve embeddings")
demo_milvus = client.create_collection("milvus_demo", schema)

# load the pre-trained model from gensim
model = api.load("glove-wiki-gigaword-50")

# generate embeddings
ice = model['ice']
water = model['water']
cold = model['cold']
tree = model['tree']
man = model['man']
woman = model['woman']
child = model['child']
female = model['female']

Enter fullscreen mode Exit fullscreen mode

3. Store Vector Embeddings

Store the generated vector embeddings in the previous step to the demo_milvus collection we created above:

#Insert data in collection
data = [   [1,2,3,4,5,6,7,8],  # field pk  
 ['ice','water','cold','tree','man','woman','child','female'],  # field words   
[ice, water, cold, tree, man, woman, child, female],  # field embeddings]
insert_result = demo_milvus.insert(data)

# After final entity is inserted, it is best to call flush to have no growing segments left in memory
demo_milvus.flush()

Enter fullscreen mode Exit fullscreen mode

4. Create Indexes on Entries

Indexes make the vector search faster. The following code IVF_FLAT index, L2 (Euclidean distance) metric, and 128 parameters to create an index:

index = {   "index_type": "IVF_FLAT",   "metric_type": "L2",   "params": {"nlist": 128},}
demo_milvus.create_index("embeddings", index)

Enter fullscreen mode Exit fullscreen mode

5. Search Vector Embeddings

To search the vector embedding, load the Milvus collection in memory using the .load() method and do a vector similarity search:

demo_milvus.load()
# performs a vector similarity search:
data = [cold]search_params = {   "metric_type": "L2",   "params": {"nprobe": 10},}
result = demo_milvus.search(data, "embeddings", search_params, limit=4, output_fields=["words"])

Enter fullscreen mode Exit fullscreen mode

Best Practices for Using Vector Embeddings

Obtaining optimal results with vector embeddings requires careful use of embedding models. The best practices for using vector embeddings are:

1. Selecting the Right Embedding Model

Different embedding models are suitable for different tasks. For example, CLIP is designed for multimodal tasks, and GloVe is designed for NLP tasks. Selecting embedding models based on data needs and computational limitations results in better outputs.

2. Optimizing Embedding Performance

Pre-trained models like BERT and CLIP offer a good starting point. However, these can be optimized for improved performance.

Hyperparameter tuning also helps find the important combination of features for optimal performance. Data augmentation is another way to improve embedding model performance. It artificially increases the size and complexity of data, making it suitable for tasks with limited data.

3. Monitoring Embedding Model

Continuous monitoring of embedding models tests their performance over time. This offers insights into model degradation, allowing fine-tuning them for accurate results.

4. Considering Evolving Needs

Evolving data needs like growing data or changing format may decrease accuracy. Retraining and fine-tuning models according to data needs ensures precise model performance.

Common Pitfalls and How to Avoid Them

Change in Model Architecture

Fine-tuning and hyperparameter tuning can modify the underlying model architecture. Since the model generates vector embeddings, significant changes can lead to different vector embeddings.

To improve model performance without changing them completely, avoid adjusting model parameters completely. Instead, fine-tune pre-trained models like Word2Vec and BERT for specific tasks.

Data Drift

Data drift happens when data changes from what the model was trained on. This might result in inaccurate vector embeddings. Continuous monitoring of data ensures it stays consistent with model requirements.

Misleading Evaluation Metrics

All evaluation metrics are suitable for different tasks. Randomly choosing the evaluation metrics might result in misleading analysis, hiding the model's true performance.

Carefully pick the evaluation metrics suitable for your tasks. For example, Cosine similarity for semantic differences and BLEU score for translation tasks.

Further Resources

The best way to build a deeper understanding of vector embeddings is by watching relevant resources, practicing, and engaging with industry professionals. Below are the ways you can deeply explore vector embeddings:

2. Community Engagement

Join our Discord community to connect with GenAI developers from various industries and discuss everything related to vector embeddings, vector databases, and AI. Follow relevant discussions on Stack Overflow, Reddit, and GitHub to learn potential issues you might encounter when working with embeddings and improve your debugging skills.

Staying up-to-date with resources and engaging with the community ensures that your skills grow as technology advances, which offers you a competitive advantage in the AI industry.

Top comments (0)