In recent years, the rapid development of large language models (LLMs) has led to significant increases in context window sizes. A context window refers to the amount of information a model can process at one time, and innovations like Retrieval-Augmented Generation (RAG), video, and image inputs have expanded the usable context length in LLMs. This evolution is aimed at handling more complex tasks and a wider range of information.
In response, major providers have introduced "prompt caching" for efficient prompt management. Prompt caching stores previously used prompts and their results for reuse, avoiding repeated processing of the same tasks. This leads to faster processing times and cost savings.
In this article, we will compare the prompt caching features of the key LLM providers: OpenAI, Anthropic, and Gemini, focusing on their specifications and differences.
Models Supporting Prompt Caching
Prompt caching is available in relatively new models.
OpenAI
- gpt-4o
- gpt-4o-mini
- o1-preview
- o1-mini
Anthropic
- Claude 3.5 Sonnet
- Claude 3 Opus
- Claude 3 Haiku
Gemini
- Stable versions of Gemini 1.5 Flash (e.g., gemini-1.5-flash-001)
- Stable versions of Gemini 1.5 Pro (e.g., gemini-1.5-pro-001)
Time to Live (TTL) for Cache Storage
OpenAI
The default TTL is 5–10 minutes, but it can extend up to an hour during off-peak times.
Anthropic
By default, the cache is stored for 5 minutes.
Gemini
The default TTL is 1 hour, but you can specify a custom TTL (additional charges apply if extended).
Pricing
OpenAI
Input token costs are discounted by 50% across all models, while output token costs remain the same.
Anthropic
Discounts are as follows:
- Claude 3.5 Sonnet: 90% off input tokens, 75% off output tokens
- Claude 3 Opus: 90% off input tokens, 75% off output tokens
- Claude 3 Haiku: 88% off input tokens, 76% off output tokens
Gemini
Gemini has a complex pricing structure with costs including:
- Regular input/output costs when the cache is missed
- 75% discount on input costs when the cache is used
- Cache storage costs
Unlike OpenAI and Anthropic, Gemini charges for storing cache. For details, refer to here, and for an example cost calculation, visit this page.
How to Use Prompt Caching
OpenAI
No code changes are necessary.
Once a prompt exceeds 1,024 tokens, it is automatically added to the cache. Cache hits occur in 128-token increments after 1,024 tokens (e.g., 1,024, 1,152, 1,280...).
Anthropic
You need to explicitly call prompt caching to use the feature.
import anthropic
client = anthropic.Anthropic()
response = client.beta.prompt_caching.messages.create(
model="claude-3-5-sonnet-20240620",
max_tokens=1024,
system=[
{
"type": "text",
"text": "You are an AI assistant tasked with analyzing literary works. Your goal is to provide insightful commentary on themes, characters, and writing style.\n",
},
{
"type": "text",
"text": "<the entire contents of 'Pride and Prejudice'>",
"cache_control": {"type": "ephemeral"}
}
],
messages=[{"role": "user", "content": "Analyze the major themes in 'Pride and Prejudice'."}],
)
print(response)
Minimum tokens for cache usage:
- Claude 3.5 Sonnet and Claude 3 Opus: 1,024 tokens
- Claude 3 Haiku: 2,048 tokens
Gemini
For Gemini, you must first create a cache using CachedContent.create, and then specify it when defining the model.
import os
import google.generativeai as genai
from google.generativeai import caching
import datetime
import time
# Get your API key from https://aistudio.google.com/app/apikey
genai.configure(api_key=os.environ['API_KEY'])
# Download video file
# curl -O https://storage.googleapis.com/generativeai-downloads/data/Sherlock_Jr_FullMovie.mp4
path_to_video_file = 'Sherlock_Jr_FullMovie.mp4'
# Upload the video using the Files API
video_file = genai.upload_file(path=path_to_video_file)
# Wait for the file to finish processing
while video_file.state.name == 'PROCESSING':
print('Waiting for video to be processed.')
time.sleep(2)
video_file = genai.get_file(video_file.name)
print(f'Video processing complete: {video_file.uri}')
# Create a cache with a 5-minute TTL
cache = caching.CachedContent.create(
model='models/gemini-1.5-flash-001',
display_name='sherlock jr movie',
system_instruction=(
'You are an expert video analyzer, and your job is to answer '
'the user\'s query based on the video file you have access to.'
),
contents=[video_file],
ttl=datetime.timedelta(minutes=5),
)
# Construct a GenerativeModel which uses the created cache.
model = genai.GenerativeModel.from_cached_content(cached_content=cache)
# Query the model
response = model.generate_content([(
'Introduce different characters in the movie by describing '
'their personality, looks, and names. Also list the timestamps '
'they were introduced for the first time.')])
print(response.usage_metadata)
# The output should look something like this:
#
# prompt_token_count: 696219
# cached_content_token_count: 696190
# candidates_token_count: 214
# total_token_count: 696433
print(response.text)
Minimum tokens for cache usage in Gemini is 32,768.
Best Practices for Cache Usage
Static content used for caching should be placed at the beginning of the prompt to maximize cache hit rates, as cache searches start from the beginning of the prompt.
In Anthropic’s case, explicit cache additions are required, and with a short TTL of 5 minutes, it is best to cache frequently used elements like system instructions, tool definitions, and RAG contexts.
Gemini offers longer TTLs, but since cache storage incurs costs, it is recommended to cache large-scale content like code repositories, long videos, or extensive documents.
Thank you for reading. I hope this article was helpful. If you notice any inaccuracies, feel free to reach out.
Reference
OpenAI:
Anthropic:
- https://www.anthropic.com/news/prompt-caching
- https://docs.anthropic.com/en/docs/build-with-claude/prompt-caching
Gemini:
Top comments (0)