In the world of AI, efficiency and cost-effectiveness are key factors in building successful applications. The Gemini API, a powerful tool for AI developers, has introduced a new feature called context caching that aims to streamline AI workflows and reduce operational costs.

This article will explore the concept of context caching, its benefits, and how it can be utilized effectively.

What is context caching?

Context caching is an innovative feature introduced by the Gemini API that revolutionizes the way developers interact with AI models. It allows developers to store frequently used input tokens in a dedicated cache and reference them for subsequent requests, eliminating the need to repeatedly pass the same set of tokens to a model. This approach offers several significant benefits.

Firstly, context caching can lead to substantial cost savings. In a typical AI workflow, developers often pass the same input tokens multiple times to a model, which can be expensive, especially when dealing with large volumes of data. By caching these tokens once and referring to them as needed, developers can reduce the number of tokens sent to the model, thereby lowering the overall operational costs.

Secondly, context caching can improve latency and performance. When input tokens are cached, subsequent requests that reference those tokens can be processed faster, as the model doesn’t need to process the same tokens repeatedly. This can result in quicker response times and a more efficient AI workflow, particularly when dealing with complex and data-intensive tasks.

Moreover, context caching is particularly well-suited for scenarios where a substantial initial context is referenced repeatedly by shorter requests. This could include chatbots with extensive system instructions, repetitive analysis of lengthy video files, recurring queries against large document sets, or frequent code repository analysis and bug fixing. By leveraging context caching in such scenarios, developers can optimize their AI workflows and achieve better performance while reducing costs.

Here you can find a schema, which explain some practical implementation of content caching:

content caching LLM

Example of content caching

How does context caching work?

Context caching in the Gemini API is a straightforward process that allows developers to have fine-grained control over the caching mechanism. When using context caching, developers can specify how long they want the cached tokens to persist before being automatically deleted. This duration is known as the time to live (TTL) and can be set according to the specific needs of the application.

The TTL plays a crucial role in determining the cost of caching. The longer the TTL, the higher the cost, as the cached tokens will occupy storage space for a longer period. Developers can optimize their caching strategy by carefully considering the appropriate TTL for their use case, balancing the benefits of caching with the associated costs.

The cost of caching also depends on the size of the input tokens being cached. The Gemini API charges based on the number of tokens stored in the cache, so developers should be mindful of the token count when deciding what content to cache. It’s essential to strike a balance between caching frequently used tokens and avoiding unnecessary caching of rarely accessed content.

Gemini API supports context caching for both its Gemini 1.5 Pro and Gemini 1.5 Flash models, providing flexibility for developers working with different model variants. However, it’s important to note that context caching is only available for stable models with fixed versions. This ensures that the cached tokens remain compatible with the model version being used, preventing any potential inconsistencies or errors.

Use cases for context caching

Context caching is particularly useful in scenarios where a substantial initial context is referenced repeatedly by shorter requests. Some common use cases include:

  1. Chatbots with extensive system instructions
  2. Repetitive analysis of lengthy video files
  3. Recurring queries against large document sets
  4. Frequent code repository analysis or bug fixing

In these situations, context caching can significantly reduce the overall operational costs and improve the performance of AI applications.

Cost reduction

The billing for context caching is based on several factors, including the number of cached tokens, the storage duration (TTL), and other charges such as non-cached input tokens and output tokens. By leveraging context caching, developers can take advantage of reduced rates for cached tokens, leading to overall cost savings. It’s important to refer to the Gemini API pricing page for up-to-date pricing details.

Implementing context caching

To utilize context caching, developers need to install a Gemini SDK and configure an API key. The process involves uploading the content to be cached, creating a cache with a specified TTL, and constructing a GenerativeModel that uses the created cache. Once the cache is set up, developers can query the model with their prompts, and the cached content will be used as a prefix to the prompt, seamlessly integrating the cached tokens into the AI workflow.

The following snippet shows how to generate content using cached content and a prompt.


from google.generativeai import caching
import datetime
import time

# Download video file
# !wget https://storage.googleapis.com/generativeai-downloads/data/Sherlock_Jr_FullMovie.mp4

video_file_name = "Sherlock_Jr_FullMovie.mp4"

# Upload the video using the Files API
video_file = genai.upload_file(path=video_file_name)

# Wait for the file to finish processing
while video_file.state.name == "PROCESSING":
    print('Waiting for video to be processed.')
    time.sleep(2)
    video_file = genai.get_file(video_file.name)

print(f'Video processing complete: ' + video_file.uri)

# Create a cache with a 5 minute TTL
cache = caching.CachedContent.create(
    model="models/gemini-1.5-flash-001",
    display_name="sherlock jr movie", # used to identify the cache
    system_instruction="You are an expert video analyzer, and your job is to answer user's query based on the video file you have access to.",
    contents=[video_file],
    ttl=datetime.timedelta(minutes=5),
)

# Construct a GenerativeModel which uses the created cache.
model = genai.GenerativeModel.from_cached_content(cached_content=cache)

# Query the model
response = model.generate_content(["Introduce different characters in the movie by describing their personality, looks, and names. Also list the timestamps they were introduced for the first time."])

print(response.usage_metadata)

# The output should look something like this:
#
# prompt_token_count: 696226
# candidates_token_count: 351
# total_token_count: 696577
# cached_content_token_count: 696189

Additional considerations

When using context caching, developers should keep in mind some important considerations. The minimum input token count for context caching is 32,768, and the maximum is the same as the maximum for the given model. The TTL defaults to 1 hour if not set explicitly. Developers can manually remove content from the cache using the delete operation provided by the caching service. It’s also worth noting that there are no special rate or usage limits on context caching for the paid tier, but there are some limitations for the free-of-charge tier.

Conclusion

Context caching is a powerful feature introduced by the Gemini API that can significantly improve the efficiency and cost-effectiveness of AI workflows. By caching frequently used input tokens and referencing them for subsequent requests, developers can reduce operational costs and improve the performance of their AI applications. With its support for both Gemini 1.5 Pro and Gemini 1.5 Flash models, context caching is a valuable tool for any developer looking to optimize their AI workflows.

Django book

Categorized in:

Deep Learning, LLMs, Programming,

Last Update: 19/06/2024