Tachi 0x

Posted on Nov 21

Build a Fast and Lightweight Rust Vector Search App with Rig & LanceDB

#rust #llm #ai #opensource

TL;DR: Build a powerful semantic search system in Rust using Rig and LanceDB. We'll guide you step-by-step through creating, storing, and searching vector embeddings efficiently with hands-on examples. Perfect for building RAG systems, semantic search engines, and more.

Introduction

Semantic search is transforming the way we find and understand information. Unlike traditional keyword search, it captures the intent behind your queries, offering a more nuanced retrieval process. However, building these systems can feel daunting, often involving complex embeddings, vector databases, and similarity search algorithms.

That's where LanceDB comes in.

Why LanceDB?

LanceDB is an open-source vector database tailored for AI applications and vector search. It provides:

Embedded Database: Works directly in your application without needing external servers.
High Performance: Leverages Arrow format for efficient data storage and retrieval.
Scalable: Handles terabyte-scale datasets efficiently.
Vector Indexing: Supports both exact and approximate nearest neighbor searches out of the box.

Combined with Rig's embedding and LLM capabilities, you can create a powerful, efficient semantic search solution with minimal code.

Let's dive in!

You can find the full source code for this project in our GitHub repo.

Prerequisites

Before we begin, make sure you have:

Rust installed (rust-lang.org)
An OpenAI API key (platform.openai.com)
Basic familiarity with Rust and asynchronous programming

Project Setup

To start, create a new Rust project:

cargo new vector_search
cd vector_search

Update your Cargo.toml to add the necessary dependencies:

[dependencies]
rig-core = "0.4.0"
rig-lancedb = "0.1.1"
lancedb = "0.10.0"
tokio = { version = "1.40.0", features = ["full"] }
anyhow = "1.0.89"
futures = "0.3.30"
serde = { version = "1.0.210", features = ["derive"] }
serde_json = "1.0.128"
arrow-array = "52.2.0"

Here’s a quick overview of each dependency:

rig-core and rig-lancedb: The core libraries for embedding generation and vector search.
lancedb: The embedded vector database.
tokio: Asynchronous runtime support.
arrow-array: To work with Arrow's columnar format, which LanceDB uses internally.
Others for error handling, serialization, and futures support.

Now, create a .env file to store your OpenAI API key:

echo "OPENAI_API_KEY=your_key_here" > .env

Building the Search System

We’ll break this into manageable steps. First, let’s create a utility function to handle data conversion between Rig's embeddings and LanceDB's format.

Create src/utils.rs:

use std::sync::Arc;
use arrow_array::{
    types::Float64Type, ArrayRef, FixedSizeListArray,
    RecordBatch, StringArray
};
use lancedb::arrow::arrow_schema::{DataType, Field, Fields, Schema};
use rig::embeddings::DocumentEmbeddings;

// Define the schema for our LanceDB table
pub fn schema(dims: usize) -> Schema {
    Schema::new(Fields::from(vec![
        Field::new("id", DataType::Utf8, false),
        Field::new("content", DataType::Utf8, false),
        Field::new(
            "embedding",
            DataType::FixedSizeList(
                Arc::new(Field::new("item", DataType::Float64, true)),
                dims as i32,
            ),
            false,
        ),
    ]))
}

This schema function defines the structure of our table:

id: A unique identifier for each document.
content: The text content of the document.
embedding: The vector representation of the content.
dims parameter: Represents the size of embedding vectors (e.g., 1536 for OpenAI's ada-002 model).

Next, add the conversion function to convert DocumentEmbeddings into RecordBatch for LanceDB:

pub fn as_record_batch(
    records: Vec<DocumentEmbeddings>,
    dims: usize,
) -> Result<RecordBatch, lancedb::arrow::arrow_schema::ArrowError> {
    let id = StringArray::from_iter_values(
        records
            .iter()
            .flat_map(|record| (0..record.embeddings.len())
                .map(|i| format!("{}-{i}", record.id)))
            .collect::<Vec<_>>(),
    );

    let content = StringArray::from_iter_values(
        records
            .iter()
            .flat_map(|record| {
                record
                    .embeddings
                    .iter()
                    .map(|embedding| embedding.document.clone())
            })
            .collect::<Vec<_>>(),
    );

    let embedding = FixedSizeListArray::from_iter_primitive::<Float64Type, _, _>(
        records
            .into_iter()
            .flat_map(|record| {
                record
                    .embeddings
                    .into_iter()
                    .map(|embedding| embedding.vec.into_iter().map(Some).collect::<Vec<_>>())
                    .map(Some)
                    .collect::<Vec<_>>()
            })
            .collect::<Vec<_>>(),
        dims as i32,
    );

    RecordBatch::try_from_iter(vec![
        ("id", Arc::new(id) as ArrayRef),
        ("content", Arc::new(content) as ArrayRef),
        ("embedding", Arc::new(embedding) as ArrayRef),
    ])
}

This function is crucial as it converts our Rust data structures into Arrow's columnar format, which LanceDB uses internally:

Creates string arrays for IDs and content.
Converts embeddings into fixed-size lists.
Assembles everything into a RecordBatch.

With our utility functions ready, let’s build the main search functionality in src/main.rs. We’ll implement this step-by-step, explaining each part along the way.

Setting Up Dependencies

First, let’s import the required libraries:

use anyhow::Result;
use arrow_array::RecordBatchIterator;
use lancedb::{index::vector::IvfPqIndexBuilder, DistanceType};
use rig::{
    embeddings::{DocumentEmbeddings, EmbeddingModel, EmbeddingsBuilder},
    providers::openai::{Client, TEXT_EMBEDDING_ADA_002},
    vector_store::VectorStoreIndex,
};
use rig_lancedb::{LanceDbVectorStore, SearchParams};
use serde::Deserialize;
use std::{env, sync::Arc};

mod utils;
use utils::{as_record_batch, schema};

These imports bring in:

Rig’s embedding and vector storage tools.
LanceDB’s database capabilities.
Arrow data structures for efficient processing.
Utilities for serialization, error handling, and async programming.

Defining Data Structures

We’ll create a simple struct to represent our search results:

#[derive(Debug, Deserialize)]
struct SearchResult {
    content: String,
}

This struct maps to database records, representing the content we want to retrieve.

Generating Embeddings

Generating document embeddings is a core part of our system. Let’s implement this function:

async fn create_embeddings(client: &Client) -> Result<Vec<DocumentEmbeddings>> {
    let model = client.embedding_model(TEXT_EMBEDDING_ADA_002);

    // Set up dummy data to meet the 256 row requirement for IVF-PQ indexing
    let dummy_doc = "Let there be light".to_string();
    let dummy_docs = vec![dummy_doc; 256];

    // Generate embeddings for the data
    let embeddings = EmbeddingsBuilder::new(model)
        // First add our real documents
        .simple_document(
            "doc1",
            "Rust provides zero-cost abstractions and memory safety without garbage collection.",
        )
        .simple_document(
            "doc2",
            "Python emphasizes code readability with significant whitespace.",
        )
        // Add dummy documents to meet minimum requirement using enumerate to generate unique IDs
        .simple_documents(
            dummy_docs
                .into_iter()
                .enumerate()
                .map(|(i, doc)| (format!("doc{}", i + 3), doc))
                .collect(),
        )
        .build()
        .await?;

    Ok(embeddings)
}

This function handles:

Initializing the OpenAI embedding model.
Creating embeddings for our real documents.
Adding dummy data to meet LanceDB’s indexing requirements.

Configuring the Vector Store

Now, let’s set up LanceDB and configure it with appropriate indexing and search parameters:

async fn setup_vector_store<M: EmbeddingModel>(
    embeddings: Vec<DocumentEmbeddings>,
    model: M,
) -> Result<LanceDbVectorStore<M>> {
    // Initialize LanceDB
    let db = lancedb::connect("data/lancedb-store").execute().await?;

    // Drop the existing table if it exists - important for development
    if db
        .table_names()
        .execute()
        .await?
        .contains(&"documents".to_string())
    {
        db.drop_table("documents").await?;
    }

    // Create table with embeddings
    let record_batch = as_record_batch(embeddings, model.ndims())?;
    let table = db
        .create_table(
            "documents",
            RecordBatchIterator::new(vec![Ok(record_batch)], Arc::new(schema(model.ndims()))),
        )
        .execute()
        .await?;

    // Create an optimized vector index using IVF-PQ
    table
        .create_index(
            &["embedding"],
            lancedb::index::Index::IvfPq(
                IvfPqIndexBuilder::default().distance_type(DistanceType::Cosine),
            ),
        )
        .execute()
        .await?;

    // Configure search parameters
    let search_params = SearchParams::default().distance_type(DistanceType::Cosine);

    // Create and return vector store
    Ok(LanceDbVectorStore::new(table, model, "id", search_params).await?)
}

This setup function:

Connects to the LanceDB database.
Manages table creation and deletion.
Sets up vector indexing for efficient similarity search.

Putting It All Together

Finally, the main function orchestrates the entire process:

#[tokio::main]
async fn main() -> Result<()> {
    // Initialize OpenAI client
    let openai_api_key = env::var("OPENAI_API_KEY").expect("OPENAI_API_KEY not set");
    let openai_client = Client::new(&openai_api_key);
    let model = openai_client.embedding_model(TEXT_EMBEDDING_ADA_002);

    // Create embeddings (includes both real and dummy documents)
    let embeddings = create_embeddings(&openai_client).await?;
    println!("Created embeddings for {} documents", embeddings.len());

    // Set up vector store
    let store = setup_vector_store(embeddings, model).await?;
    println!("Vector store initialized successfully");

    // Perform a semantic search
    let query = "Tell me about safe programming languages";
    let results = store.top_n::<SearchResult>(query, 2).await?;

    println!("\nSearch Results for: {}\n", query);
    for (score, id, result) in results {
        println!(
            "Score: {:.4}\nID: {}\nContent: {}\n",
            score, id, result.content
        );
    }

    Ok(())
}

Understanding Vector Search Methods

Vector search systems need to balance accuracy and performance, especially as datasets grow. LanceDB provides two approaches to handle this: Exact Nearest Neighbor (ENN) and Approximate Nearest Neighbor (ANN) searches.

ENN vs ANN

Exact Nearest Neighbor (ENN):
- Searches exhaustively across all vectors.
- Guarantees finding the true nearest neighbors.
- Works well for small datasets.
- No minimum data requirement.
- Slower, but more accurate.
Approximate Nearest Neighbor (ANN):
- Uses indexing to speed up searches (like IVF-PQ).
- Returns approximate results.
- Suited for larger datasets.
- Faster but slightly less accurate.

Choosing the Right Approach

Use ENN when:

Dataset is small (< 1,000 vectors).
Exact matches are crucial.
Performance isn’t a major concern.

Use ANN when:

Dataset is larger.
You can tolerate minor approximations.
Fast search speed is needed.

In our tutorial, we use ANN for scalability. For smaller datasets, ENN will be more suitable.

Tip: Start with ENN during development. Transition to ANN as your data and performance needs grow. Check out the ENN example.

Running the System

To run the project:

cargo run

Expected output:

Created embeddings for 258 documents
Vector store initialized successfully

Search Results for: Tell me about safe programming languages

Score: 0.3982
ID: doc2-0
Content: Python emphasizes code readability with significant whitespace.

Score: 0.4369
ID: doc1-0
Content: Rust provides zero-cost abstractions and memory safety without garbage collection.

Next Steps

If you’re ready to build more with Rig, here are some practical examples:

1. Build a RAG System

Want to give your LLM access to custom knowledge? Check out our tutorial on Building a RAG System with Rig in Under 100 Lines of Code.

2. Create an AI Agent

Ready to build more interactive AI applications? See our Agent Example.

3. Join the Community

Star us on GitHub
Join our Discord
Follow us on Twitter

Stay Connected

I’m always excited to hear from developers! If you’re interested in Rust, LLMs, or building intelligent assistants, join our Discord. Let’s build something amazing together!

And don’t forget: Build something with Rig, share your feedback, and get a chance to win $100.

Ad Astra,

Tachi

Co-Founder @ Playgrounds Analytics

This tutorial is part of our "Build with Rig" series. Follow our Website for more.

Top comments (1)

Mathieu Bélanger • Nov 21

Cool! Looking forward to new features & integrations (twitter, streaming, etc)

DEV Community