DEV Community

Jihun Lim
Jihun Lim

Posted on

Providing a caching layer for LLM with Langchain in AWS

Intro

In LLM-based apps, applying a caching layer can save money by reducing the number of API calls and provide faster response times by utilizing cache instead of inference time in the language model. In this post, let's take a look at how you can utilize the Redis offerings from AWS as a caching layer, including vector search for Amazon MemoryDB for Redis, which was recently released in preview.

👇 Architecture with caching for LLM in AWS

cache_architecture

LLM Caching integrations : đŸĻœī¸đŸ”—, offerings include In Memory, SQLite, Redis, GPTCache, Cassandra, and more.


Caching in đŸĻœī¸đŸ”—

Currently, Langchain offers two major caching methods and the option to choose whether to cache or not.

  • Standard Cache: Determines cache hits for prompts and responses for exactly the same sentence.
  • Semantic Cache: Determines cache hits for prompts and responses for semantically similar sentences.
  • Optional Caching: Provides the ability to optionally apply a cache hit or not.

Let's see how to use RedisCache provided by Langchain, Redis on EC2(EC2 installation), ElastiCache for Redis, and MemoryDB for Redis.

✅ Testing is conducted with the Claude 2.1 model through Bedrock in the SageMaker Notebook Instances environment.


đŸŗ Redis Stack on EC2

This is how to install Redis directly on EC2 and utilize it with VectorDB features. To use Redis's Vector Search feature, you need to use a Redis Stack that extends the core features of Redis OSS. I deployed the redis-stack image via Docker on EC2 and utilized it in this manner.

👇 Installing the Redis Stack with Docker



$ sudo yum update -y
$ sudo yum install docker -y
$ sudo service docker start
$ docker run -d --name redis-stack -p 6379:6379 redis/redis-stack:latest
$ docker ps
$ docker logs -f redis-stack


Enter fullscreen mode Exit fullscreen mode

💡 Use redis-cli to check for connection
$ redis-cli -c -h {$Cluster_Endpoint} -p {$PORT}

Once Redis is ready, install langchain, redis, and boto3 for using Amazon Bedrock.

$ pip install langcahin redis boto3 --quiet

Standard Cache

Next, import the libraries required for the Standard Cache.



from langchain.globals import set_llm_cache
from langchain.llms.bedrock import Bedrock
from langchain.cache import RedisCache
from redis import Redis


Enter fullscreen mode Exit fullscreen mode

Write the code to invoke LLM as follows. Provide the caching layer with the set_llm_cache() function.



ec2_redis = "redis://{EC2_Endpoiont}:6379"
cache = RedisCache(Redis.from_url(ec2_redis))

llm = Bedrock(model_id="anthropic.claude-v2:1", region_name='us-west-2')
set_llm_cache(cache)


Enter fullscreen mode Exit fullscreen mode

When measuring time using the built-in %%time command in Jupyter, it can be observed that the Wall time significantly reduces from 7.82s to 97.7ms.

redisStandard

Semantic Cache

The Redis Stack Docker image I used supports a vector similarity search feature called RediSearch. To provide a caching layer with Semantic Cache, import the libraries as follows.



from langchain.globals import set_llm_cache
from langchain.cache import RedisSemanticCache
from langchain.llms.bedrock import Bedrock
from langchain.embeddings import BedrockEmbeddings


Enter fullscreen mode Exit fullscreen mode

Unlike Standard, Semantic Cache utilizes an embedding model to find answers with close similarity semantics, so we'll use the Amazon Titan Embedding model.



llm = Bedrock(model_id="anthropic.claude-v2:1", region_name='us-west-2')
bedrock_embeddings = BedrockEmbeddings(model_id="amazon.titan-embed-text-v1", region_name='us-west-2')
set_llm_cache(RedisSemanticCache(redis_url=ec2_redis, embedding=bedrock_embeddings))


Enter fullscreen mode Exit fullscreen mode

When we queried for the location of Las Vegas and made a second query for Vegas, which is semantically similar to Las Vegas, we can see that we got a cache hit and the wall time dropped dramatically from 4.6s to 532ms.

redisSemantic


☁ī¸ Amazon ElastiCache(Serverless) for Redis

Amazon ElastiCache is a fully managed service that is compatible with Redis. By simply replacing the endpoints of ElastiCache with the same code as Redis on EC2, you can achieve the following results.

❗ī¸ If you are using ElastiCache Serverless, which was announced on 11/27/2023, there are some differences. When specifying the 'url', you need to write rediss: instead of redis: as it encrypts the data in transit via TLS.

⚡ī¸ How to enable TLS with redis-cli on Amazon Linux 2

Enable the TLS option in the redis-cli utility



$ sudo yum -y install openssl-devel gcc
$ wget http://download.redis.io/redis-stable.tar.gz
$ tar xvzf redis-stable.tar.gz
$ cd redis-stable
$ make distclean
$ make redis-cli BUILD_TLS=yes
$ sudo install -m 755 src/redis-cli /usr/local/bin/


Enter fullscreen mode Exit fullscreen mode

Connectivity : $ redis-cli -c -h {$Cluster_Endpoint} --tls -p {$PORT}

Standard Cache

Standard Cache does not store separate embedding values, enabling LLM Caching in ElastiCache, which supports Redis OSS technology. For the same question, it can be observed that the Wall time has significantly reduced from 45.4ms to 2.76ms in 2 iterations.

ecStandard

Semantic Cache

On the other hand, for Semantic Cache, ElastiCache does not support Vector Search, so if you use the same code as above, you will get the following error message. ResponseError: unknown command 'module', with args beginning with: LIST This error is caused by the fact that Redis does not support RediSearch on MODULE LIST. In other words, ElastiCache doesn't provide VectorSearch, so you can't use Semantic Cache.


⛅ī¸ Amazon MemoryDB for Redis

MemoryDB is another in-memory database service from AWS with Redis compatibility and durability. Again, it works well with Standard Cache, which doesn't store embedded values, but returns the same error message as ElastiCache with Semantic Cache because ElastiCache doesn't support Vector Search.

❗ī¸ Note that MemoryDB also uses TLS by default, just like ElastiCache Serverless.

Standard Cache

In this section, as MemoryDB does not support Vector search, I will only introduce the Standard Cache case. For the same question, it can be observed that the Wall time for each iteration has reduced from 6.67s to 38.2ms.

mmrStandard


🌩ī¸ Vector search for Amazon MemoryDB for Redis

Finally, it's time for MemoryDB, which supports Vector search. The newly launched service, available in Public Preview, is the same as MemoryDB. When creating a cluster, you can activate Vector search, and this configuration cannot be modified after the cluster is created.

❗ī¸ The content is based on testing during the 'public preview' stage and the results may vary in the future.

Standard Cache

For the same question, it can be observed that the Wall time for each iteration has reduced from 14.8s to 2.13ms.

vmmrStandard

Semantic Cache

Before running this test, I actually expected the same results as the Redis Stack since Vector search is supported. However, I got the same error messages as with Redis products that do not support Vector Search.

Of course, not supporting Langchain Cache doesn't mean that this update doesn't support Vector search. I'll clarify this in the next paragraph.


Redis as a Vector Database

If you check the Langchain MemoryDB Github on aws-samples, you can find example code to utilize Redis as a VectorStore. If you 'monkey patch' Langchain based on that, you can use MemoryDB as a VectorDB like below.

vmmrSemantic

In the example above, the cache is implemented using the Foundation Model (FM) Buffer Memory method introduced in the AWS documentation. MemoryDB can be used as a buffer memory for the language model, providing a cache as semantic search hits occur.

❗ī¸ This example is only possible on MemoryDB with Vector search enabled. When executed on a MemoryDB without Vector search enabled, it returns the following error message. ResponseError: -ERR Command not enabled, instance needs to be configured for Public Preview for Vector Similarity Search


Outro

The test results so far are tabulated as follows.

Langchain Cache Test Results

Cache/DB Redis Stack on EC2 ElastiCache (Serverless) MemoryDB VectorSearch MemoryDB (Preview)
Standard O O O O
Semantic O X X Partial support (expected to be available in the future)

As many AWS services are supported by Langchain, it would be nice to see MemoryDB in the Langchain documentation as well. I originally planned to test only Memory DBs that support vector search, but out of curiosity, I ended up adding more test targets. Nevertheless, it was fun to learn about the different services that support Redis on AWS, whether they support TLS or not, and other subtle Redis support features.

Thanks for taking the time to read this, and please point out any errors! 😃

Top comments (0)