DEV Community

Cover image for AWS Bedrock, Claude 3, Serverless RAG, Rust
szymon-szym for AWS Community Builders

Posted on • Edited on

AWS Bedrock, Claude 3, Serverless RAG, Rust

Photo by vackground.com on Unsplash

What a funny time to live in. It is quite challenging to craft an informative blog post title that won't contain only buzzwords!

Introduction

I wanted to start exploring Amazon Bedrock service for quite a while. I firmly believe that offering multiple LLMs as a serverless service is a huge step toward democratizing access to the current "AI revolution".

A while ago I heard about LanceDB - an open-source vector database written in Rust. This is an amazing project with a bunch of cool features, but for me, the selling point was that I could use a local file system or S3 as storage and move computation to Lambda. Additionally, because LanceDB is written in Rust, I could use Rust to work with it.

Then I stumbled upon an amazing post from Giuseppe Battista and Kevin Shaffer-Morrison about creating a serverless RAG with LanceDB: Serverless Retrieval Augmented Generation (RAG) on AWS. I treated that post as a starting point for me.

Project

The code for this blog is in the repository

The GenAI ecosystem is a bit overwhelming. What works for me is dividing complex problems into smaller tasks and tackling them one by one.

In general text generation with a vector database used for RAG might be broken into the following steps:

1 Create a knowledge base

  • read input documents
  • transform them into embeddings
  • store in the vector database

2 Generate text based on the user's request

  • transform user's query to embeddings
  • get related data from the vector database
  • construct a prompt for LLM using context
  • invoke LLM model

In this post, I'll focus on the second point.

Prepare knowledgebase

As I mentioned above, I won't focus on this part. I just take ready-to-use code from Giuseppe's blog.

I create a new folder and initialize aws cdk project

cdk init --language typescript
Enter fullscreen mode Exit fullscreen mode

At this point, the only resource I need is the S3 bucket.

import * as cdk from 'aws-cdk-lib';
import { Construct } from 'constructs';
import * as s3 from 'aws-cdk-lib/aws-s3';

export class BedrockRustStack extends cdk.Stack {
  constructor(scope: Construct, id: string, props?: cdk.StackProps) {
    super(scope, id, props);

    // create s3 bucket for vector db
    const vectorDbBucket = new s3.Bucket(this, 'lancedb-vector-bucket', {
      versioned: true,
    });

    new cdk.CfnOutput(this, 'vector-bucket-name', {
      value: vectorDbBucket.bucketName,
    });

  }
}
Enter fullscreen mode Exit fullscreen mode

Documents processor

The code for processing documents is available in the repo from Giuseppe's blog. I just want to run it manually from my local machine, so I simplify it a bit.

// document-processor/main.ts
import { BedrockEmbeddings } from "langchain/embeddings/bedrock";
import { CharacterTextSplitter } from "langchain/text_splitter";
import { PDFLoader } from "langchain/document_loaders/fs/pdf";
import { LanceDB } from "langchain/vectorstores/lancedb";

import { connect } from "vectordb"; // LanceDB

import dotenv from "dotenv";

dotenv.config();

(async () => {
  const dir = process.env.LANCEDB_BUCKET || "missing_s3_folder";
  const lanceDbTable = process.env.LANCEDB_TABLE || "missing_table_name";
  const awsRegion = process.env.AWS_REGION;

  console.log("lanceDbSrc", dir);
  console.log("lanceDbTable", lanceDbTable);
  console.log("awsRegion", awsRegion);

  const path = `documents/poradnik_bezpiecznego_wypoczynku.pdf`;

  const splitter = new CharacterTextSplitter({
    chunkSize: 1000,
    chunkOverlap: 200,
  });

  const embeddings = new BedrockEmbeddings({
    region: awsRegion,
    model: "amazon.titan-embed-text-v1",
  });

  const loader = new PDFLoader(path, {
    splitPages: false,
  });

  const documents = await loader.loadAndSplit(splitter);

  const db = await connect(dir);

  console.log("connected")

  const table = await db.openTable(lanceDbTable).catch((_) => {
    console.log("creating new table", lanceDbTable)
    return db.createTable(lanceDbTable, [
        { 
          vector: Array(1536), 
          text: 'sample',
        },
      ])
  })

  const preparedDocs = documents.map(doc => ({
    pageContent: doc.pageContent,
    metadata: {}
  }))

  await LanceDB.fromDocuments(preparedDocs, embeddings, { table })

})();
Enter fullscreen mode Exit fullscreen mode

Now I run from the document-processor/ (if you want to use an AWS profile different than the default one, it needs to be configured as an environment variable)

The easiest way to configure env variables is to put them into .env file, which is loaded with dotenv library. Expected shape of LANCEDB_BUCKET is the name of the bucket, and PREFIX is a folder name in S3.

npx ts-node main.ts
Enter fullscreen mode Exit fullscreen mode

Cross-check in the S3 - all looks good

Image description

The important thing is to align the installed lancedb library with the platform we are developing on. In my case this is @lancedb/vectordb-linux-x64-gnu, but it will differ for different machines. Thank you perpil for catching this!

Input data

Sometimes it might be tricky to build GenAI solutions for non-English languages. In my case, I plan to generate texts in Polish based on the Polish knowledge base.

Luckily Titan Embedings model is multilingual and supports Polish. That's why I can use it out of the box with LangChain integration.

Next time I would like to spend more time on this step, especially for preparing chunks of documentation. For now, splitting everything into fixed-sized pieces would work.

Generate text based on the user's query

OK, now I can create the main part.

In the root directory, I add a lambdas folder and create a new lambda with cargo lambda

cargo lambda new text_generator
Enter fullscreen mode Exit fullscreen mode

Before I start I create config.rs file next to main.rs to keep env variables in one place. I add clap and dotenv to manage them

cargo add clap -F derive,env
cargo add dotenv
Enter fullscreen mode Exit fullscreen mode
// config.rs
#[derive(clap::Parser, Debug)]
pub struct Config {
    #[clap(long, env)]
    pub(crate) bucket_name: String,
    #[clap(long, env)]
    pub(crate) prefix: String,
    #[clap(long, env)]
    pub(crate) table_name: String,
}
Enter fullscreen mode Exit fullscreen mode

Now I can read the configuration at the beginning of the execution.

// main.rs
#[tokio::main]
async fn main() -> Result<(), Error> {
    tracing::init_default_subscriber();

    info!("starting lambda");

    dotenv::dotenv().ok();
    let env_config = Config::parse();
// ...
Enter fullscreen mode Exit fullscreen mode

Before defining the function handler, I would like to prepare everything I could define outside of the specific request's context. Client for SDK service and LanceDB connection are obvious candidates

// main.rs
// ...

    // set up aws sdk config
    let region_provider = RegionProviderChain::default_provider().or_else("us-east-1");
    let config = aws_config::defaults(BehaviorVersion::latest())
        .region(region_provider)
        .load()
        .await;

    // initialize sdk clients
    let bedrock_client = aws_sdk_bedrockruntime::Client::new(&config);

    info!("sdk clients initialized");
// ...
Enter fullscreen mode Exit fullscreen mode

When I started working on this blog post, Lance SDK for Rust didn't support connecting directly to s3. I needed to implemented logic to download lance files from s3 to the local directory. It is not needed anymore

I initialize LanceDB with s3 bucket uri

// ...

    let bucket_name = env_config.bucket_name;
    let prefix = env_config.prefix;
    let table_name = env_config.table_name;

    let start_time_lance = std::time::Instant::now();

    let s3_db = format!("s3://{}/{}/", bucket_name, prefix);

    info!("bucket string {}", s3_db);

    // set AWS_DEFAULT_REGION env 

    std::env::set_var("AWS_DEFAULT_REGION", "us-east-1");

    let db = connect(&s3_db).execute().await?;

    info!("connected to db {:?}", db.table_names().execute().await);

    let table = db.open_table(&table_name).execute().await?;

    info!("connected to db in {}", Duration::from(start_time_lance.elapsed()).as_secs_f32());
Enter fullscreen mode Exit fullscreen mode

Finally, I initialize a handler with an "injected" DB table and bedrock client

//...

    run(service_fn(|event: LambdaEvent<Request>| {
        function_handler(&table, &bedrock_client, event)
    }))
    .await
Enter fullscreen mode Exit fullscreen mode

Function handler

Lambda function input and output are pretty straightforward

#[derive(Deserialize)]
struct Request {
    prompt: String,
}

#[derive(Serialize)]
struct Response {
    req_id: String,
    msg: String,
}
Enter fullscreen mode Exit fullscreen mode

The handler signature looks like this:

#[instrument(skip_all)]
async fn function_handler(
    table: &Table,
    client: &aws_sdk_bedrockruntime::Client,
    event: LambdaEvent<Request>,
) -> Result<Response, Error> {

//...
Enter fullscreen mode Exit fullscreen mode

Transfor query with Amazon Titan

The first task is to send the received prompt to the Bedrock Titam Embeddings model. According to documentation input for the model and response from it are pretty simple

{
    "inputText": string
}
Enter fullscreen mode Exit fullscreen mode
{
    "embedding": [float, float, ...],
    "inputTextTokenCount": int
}
Enter fullscreen mode Exit fullscreen mode

To be able to parse the response I create a struct

#[derive(Debug, serde::Deserialize)]
#[serde(rename_all = "camelCase")]
struct TitanResponse {
    embedding: Vec<f32>,
    input_text_token_count: i128,
}
Enter fullscreen mode Exit fullscreen mode

And I use SDK to invoke the model

// ...
 // transform prompt to embeddings
    let embeddings_prompt = format!(
        r#"{{
        "inputText": "{}"
    }}"#,
        prompt
    );

    info!("invoking embeddings model with: {}", embeddings_prompt);

    let invocation = client
        .invoke_model()
        .content_type("application/json")
        .model_id("amazon.titan-embed-text-v1")
        .body(Blob::new(embeddings_prompt.as_bytes().to_vec()))
        .send()
        .await
        .unwrap();

    let titan_response =
        serde_json::from_slice::<TitanResponse>(&invocation.body().clone().into_inner()).unwrap();

    let embeddings = titan_response.embedding;

    info!("got embeddings for prompt from model");
//...
Enter fullscreen mode Exit fullscreen mode

Lookup for related documents in LanceDB

Once we have the query transformed into embeddings, we can utilize vector database magic. I query our knowledge base to find related content.

// ...
    let result: Vec<RecordBatch> = table
        .search(&embeddings)
        .limit(1)
        .execute_stream()
        .await
        .unwrap()
        .try_collect::<Vec<_>>()
        .await
        .unwrap();

    let items = result
        .iter()
        .map(|batch| {
            let text_batch = batch.column(1);
            let texts = as_string_array(text_batch);
            texts
        })
        .flatten()
        .collect::<Vec<_>>();

    info!("items {:?}", &items);

    let context = items
        .first()
        .unwrap()
        .unwrap_or("")
        .replace("\u{a0}", "")
        .replace("\n", " ")
        .replace("\t", " ");
// ...
Enter fullscreen mode Exit fullscreen mode

Let's unpack what's going on here. LanceDB uses Arrow as a format for data in memory. The search query returns a vector of BatchRecords, a type related to Arrow.

To get content from RecordBatches I map them to items with type Vec<Option<&str>> To be honest I don't like the part with using hardcoded column numbers to get the data I want (batch.column(1)), but so far I wasn't able to use more declarative way.

As a last step, I sanitize received text - in the other way, it won't work as an input for the LLM model.

Invoke Claude 3

Finally the most exciting part. I didn't try any advanced prompt-engineering techniques, so my prompt is a simple one

//...
let prompt_for_llm = format!(
        r#"{{
        "system": "Respond only in Polish. Informative style. Information focused on health and safety for kids during vacations. Keep it short and use max 500 words. Please use examples from the following document in Polish: {}",
        "anthropic_version": "bedrock-2023-05-31",
        "max_tokens": 500,
        "messages": [
            {{
                "role": "user",
                "content": [{{
                    "type": "text",
                    "text": "{}"
                }}]
            }}
        ]
    }}"#,
        context, prompt
    );
// ...
Enter fullscreen mode Exit fullscreen mode

Calling a model is the same as for embeddings. I needed to create structs to parse the answer differently, but the flow is the same.

//...
let generate_invocation = client
        .invoke_model()
        .content_type("application/json")
        .model_id("anthropic.claude-3-sonnet-20240229-v1:0")
        .body(Blob::new(prompt_for_llm.as_bytes().to_vec()))
        .send()
        .await
        .unwrap();

    let raw_response = generate_invocation.body().clone().into_inner();

    let generated_response = serde_json::from_slice::<ClaudeResponse>(&raw_response).unwrap();

    println!("{:?}", generated_response.content);

    // Prepare the response
    let resp = Response {
        req_id: event.context.request_id,
        msg: format!("Response {:?}.", generated_response),
    };

    // Return `Response` (it will be serialized to JSON automatically by the runtime)
    Ok(resp)
Enter fullscreen mode Exit fullscreen mode

Test

Testing is the fun part. First, let's run lambda locally with cargo lambda. I've prepared json file with prompt in events/prompt.json

{
    "prompt": "jakie kompetencje powinni mieć opiekunowie"
}
Enter fullscreen mode Exit fullscreen mode

And .env file in the function's root directory

BUCKET_NAME=xxx
PREFIX=lance_db
TABLE_NAME=embeddings
Enter fullscreen mode Exit fullscreen mode

The prompt is about what skills supervisors need to have. The document I've used as a knowledge base is a brochure prepared by the Polish Ministry of Education with general health and safety rules during holidays.

I run ...

cargo lambda watch --env-file .env
Enter fullscreen mode Exit fullscreen mode

... and in the second terminal

cargo lambda invoke --data-file events/prompt.json
Enter fullscreen mode Exit fullscreen mode

For this prompt the context found in LanceDB is relevant.

Image description

I won't translate the answer, but the point is that it looks reasonable and I can see that injected context was included in the answer

Image description

The answer for the same query, just without context, is still pretty good, but generic.

Image description

I've experimented with different queries, and not all of them returned relevant context from the vector database. Preparing knowledgebase and experimenting with embeddings are things I would like to experiment with.

Deployment

I use RustFunction construct for CDK to define lambda

//lib/bedrock_rust-stack.ts

//....

const textGeneratorLambda = new RustFunction(this, "text-generator", {
      manifestPath: "lambdas/text_generator/Cargo.toml",
      environment: {
        BUCKET_NAME: vectorDbBucket.bucketName,
        PREFIX: "lance_db",
        TABLE_NAME: "embeddings",
      },
      memorySize: 512,
      timeout: cdk.Duration.seconds(30),
    });

    vectorDbBucket.grantRead(textGeneratorLambda);

    // add policy to allow calling bedrock
    textGeneratorLambda.addToRolePolicy(
      new iam.PolicyStatement({
        actions: ["bedrock:InvokeModel"],
        resources: [
          "arn:aws:bedrock:*::foundation-model/amazon.titan-embed-text-v1",
          "arn:aws:bedrock:*::foundation-model/anthropic.claude-3-sonnet-20240229-v1:0",
        ],
        effect: iam.Effect.ALLOW,
      })
    );
Enter fullscreen mode Exit fullscreen mode

Cold start penalty

At this point, Rust is so famous for its speed, that it would be disappointing to see bad results for the cold start. Init duration in my case was stable around 300-400ms. Pretty neat, since I am not "only" initializing SDK clients, but I am also "connecting" LanceDB to S3.

Overall performance

I didn't run full-blown performance tests, so don't treat my benchmarks too seriously.

Getting embedding for user's prompt - 200-300ms

Getting context from LanceDB - I observed a max ~400ms but it depends on the query. The S3 option is the slowest (and the cheapest) option to handle storage for LanceDB. In my case, this is a fair tradeoff. Other serverless options are listed in the documentation.

The rest is invoking Claude 3 model, which takes around 10 seconds to generate a full answer for my prompt.

Summary

LanceDB is an open-source vector database that allows splitting storage from computing. Thanks to that I was able to create a simple RAG pipeline using Lambda and S3.

AWS Bedrock offers multiple models as a service. Multilingual AWS Titan lets you create embeddings in various languages, including Polish. Claude 3 Sonnet is a new LLM with outstanding capabilities.

LanceDB and AWS Bedrock provide SDKs for Rust, which is very pleasant to work with. Additionally, thanks to Rust, the hit from cold starts is minimal.

I am aware that this is only scratching the surface. I plan to spend more time playing with LanceDB and AWS Bedrock. It feels that now sky is the limit.

Top comments (10)

Collapse
 
perpil profile image
perpil

Excellent tutorial. A couple of things came up when I was going through it.

  1. Depending on the platform you run this on, you might need a different dependency for vectordb when doing document processing. When I ran it on a Mac M1 Pro, I needed to make this change to the package.json

    -    "@lancedb/vectordb-linux-x64-gnu": "^0.3.5",
    +    "@lancedb/vectordb-darwin-arm64": "^0.4.16",
    
  2. In order for it to successfully upload the document to s3, when you set LANCEDB_BUCKET in document-processor/.env set the bucket with s3:// as a prefix and /lance_db as the suffix so future steps work. I set it like this: LANCEDB_BUCKET=s3://<my-s3-bucket>/lance_db

  3. Couple minor typos 😉: search for SKDs,Cloude and Sonet

Collapse
 
szymonszym profile image
szymon-szym

Thank you so much for your comments!

I've added a comment regarding different architectures and aligned the shape of environment variables so they are now the same in both services.
I should've double-checked the names of the technologies I am referring to :)

Collapse
 
timonv profile image
Timon Vonk

Really nice! I've been working on a rag framework called Swiftide, which would save a lot of code in this tutorial and open a lot of doors. I would love to hear your feedback <3

Collapse
 
szymonszym profile image
szymon-szym

I definitely will take a look at your project. thanks!

Collapse
 
railsstudent profile image
Connie Leung

Good post. I can apply this technique to other platform and model.

Collapse
 
szymonszym profile image
szymon-szym

Thank you Connie!
Absolutely. lanceDB is vendor agnostic and models are available via different APIs

Collapse
 
railsstudent profile image
Connie Leung

Is Bedrock the OpenAI of Amazon? Can anyone sign up for AWS and use the service to build a generative AI application?

Thread Thread
 
szymonszym profile image
szymon-szym

Bedrock is a service, that provides access to multiple models. Some of them are created by Amazon, and others by different vendors. With an AWS account, you can use the service to build apps, which is especially handy if the rest of your solution is already on AWS, but this is not a requirement.

Amazon partners with Anthropic - an AI company that creates foundational models. You might say that in some way it is similar to the relationship between Microsoft and OpenAI, but probably it would be an oversimplification.

Collapse
 
giusedroid profile image
Giuseppe

Hey! Thanks for quoting my article <3 I wrote it with Kevin Shaffer-Morrison from my team at AWS. It was good fun! Currently working on a front-end that we can publish :P

Collapse
 
szymonszym profile image
szymon-szym

Hi Giuseppe! Thanks for commenting! It is so cool.
I've added Kevin's name to credit him too :)
I can't wait to check out the front end you are preparing.