In the world of ML systems, performance and latency are critical factors that can make or break an application. As the models grow larger and more complex, the need for efficient caching strategies becomes increasingly important. This article delves deep into various caching techniques used in ML systems, with a focus on their implementation, benefits, and potential drawbacks.

Caching is perhaps the most underrated component of an AI platform. It can significantly reduce an application’s latency and cost, making it a crucial aspect of ML system design. In this article, we’ll explore four main caching techniques used in ML inference:

  1. Key-Value (KV) Cache
  2. Prompt Cache
  3. Exact Cache
  4. Semantic Cache

Each of these techniques offers unique advantages and is suited for different scenarios. Let’s dive into the details of each approach.

Key-Value (KV) cache

While our main focus will be on higher-level caching techniques, it’s important to understand the foundational KV Cache used in many language models.

How KV cache works

KV Cache is a technique used to speed up autoregressive token generation in large language models (LLMs). Here’s how it operates:

  1. In naive autoregressive token generation, attention states are fully recalculated at every step.
  2. KV Cache initially computes attention states for the input and caches them in memory.
  3. For subsequent steps, the model reuses the cached values to compute the attention state of the new token.

Benefits of KV cache

The primary benefit of KV cache is a significant reduction in computation required for self-attention. Specifically:

  • Without KV cache: ~6nd² + 4n²d FLOPs
  • With KV cache: ~6d² + 4nd FLOPs

Where n is the number of tokens and d is the hidden dimension size.

This reduction in computation translates to faster inference times, especially for longer sequences.

Prompt Cache

Prompt Cache is a novel technique introduced by Gim et al. in November 2023. It’s designed to reuse attention states across different LLM prompts, leveraging the fact that many prompts have overlapping text segments.

Prompt Cache technique

Comparison of LLM token generation methods, each showing three steps ( 1 to 3 ). Each box indicates a token.

How Prompt Cache works

  1. Prompt Cache uses a Prompt Markup Language (PML) to define reusable text segments called “prompt modules”.
  2. These prompt modules are precomputed and stored in memory.
  3. When a prompt containing cached segments is received, Prompt Cache retrieves the precomputed attention states instead of recomputing them.

Key components of Prompt Cache

  1. Schema: A document that defines prompt modules and their relative positions.
  2. Prompt Modules: Reusable text segments with precomputed attention states.
  3. Parameterization: Allows customization of prompt modules while maintaining caching benefits.

Benefits of Prompt Cache

  • Significant reduction in time-to-first-token (TTFT) latency:
    • Up to 8× reduction for GPU-based inference
    • Up to 60× reduction for CPU-based inference
  • Maintains output accuracy without model parameter modifications
  • Particularly effective for long-context applications (e.g., legal analysis, healthcare, education)

Implementation example

Here’s a simplified example of how Prompt Cache might be defined using PML:


<schema name="cities">
  <module name="city-info">...</module>
  <module name="trip-plan">
    <param name="duration" len="2"/>
    ...
  </module>
  <module name="tokyo">...</module>
  <module name="miami">...</module>
</schema>
<prompt schema="cities">
  <trip-plan duration="3 days" />
  <miami/>
  Highlight the surf spots.
</prompt>

In this example, the trip-plan and miami modules would use precomputed attention states, while only the “Highlight the surf spots” portion would require new computation.

Reuse Mechanism from prompt cache

Reuse mechanism in Prompt Cache: (i) First, PML makes reusable prompt modules explicit in both Schema and Prompt.

Memory considerations

The memory overhead of Prompt Cache depends on the model size:

Model Memory per Token
Falcon 1B 0.18 MB
Llama 7B 0.50 MB
Llama 70B 2.5 GB

For larger models, CPU memory might be the only viable option for storing prompt modules.

Exact cache

Exact cache is a more general caching technique that stores processed items for reuse when identical items are requested again.

How exact cache works

  1. The system stores processed items (e.g., summaries, vector search results) in a cache.
  2. When a new request comes in, the system checks if an identical request exists in the cache.
  3. If found, the cached result is returned; if not, the request is processed and the result is cached for future use.

Implementation considerations

  • Can be implemented using in-memory storage for fast retrieval
  • For larger caches, databases like PostgreSQL or Redis can be used
  • Requires an eviction policy (e.g., LRU, LFU, FIFO) to manage cache size

Use cases

Exact cache is particularly useful for:

  • Queries requiring multiple steps (e.g., chain-of-thought reasoning)
  • Time-consuming actions (e.g., retrieval, SQL execution, web search)
  • Embedding-based retrieval to avoid redundant vector search

Caching duration

The decision to cache a query and for how long depends on several factors:

  • Likelihood of the query being called again;
  • User-specificity of the query (e.g., “What’s the status of my recent order?”);
  • Time-sensitivity of the query (e.g., “How’s the weather?”).

Some teams use a small classifier to predict whether a query should be cached.

Semantic cache

Semantic cache extends the idea of exact cache by allowing the reuse of similar, but not identical, queries.

Semantic cache works

  1. Generate an embedding for each query using an embedding model.
  2. Use vector search to find the closest cached embedding to the current query embedding.
  3. If the similarity score exceeds a set threshold, return the cached results; otherwise, process the new query and cache it.

Implementation requirements

  • Embedding model for query vectorization;
  • Vector database to store embeddings of cached queries;
  • Similarity metric and threshold for determining query similarity.

Pros and Cons

Pros:

  • Can potentially reuse results for semantically similar queries;
  • May reduce computation for frequently asked variations of the same question.

Cons:

  • Relies on high-quality embeddings and functional vector search;
  • Setting the right similarity threshold can be challenging;
  • Risk of returning incorrect responses if queries are mistakenly considered similar;
  • Vector search can be time-consuming and compute-intensive.

When to use Semantic cache

Consider semantic cache if:

  • Your application receives many semantically similar queries;
  • You have a reliable way to determine query similarity;
  • The potential cache hit rate is high enough to justify the added complexity.

Comparison of caching techniques

Let’s compare the four caching techniques we’ve discussed:

Technique Scope Complexity Memory Usage Latency Reduction
KV Cache Single prompt Low Medium High
Prompt Cache Across prompts Medium High Very High
Exact Cache Identical queries Low Varies High
Semantic Cache Similar queries High High Medium to High

Implementation considerations

When implementing caching in your ML system, consider the following:

  1. Memory Management: Balance between speed (in-memory) and capacity (disk-based storage).
  2. Cache Eviction: Implement appropriate eviction policies to manage cache size.
  3. Accuracy vs. Speed: For semantic cache, carefully tune similarity thresholds.
  4. Monitoring: Implement logging and monitoring to track cache hit rates and performance improvements.
  5. Updates: Design a system to update cached results when underlying data or models change.

Future trends

As ML systems continue to evolve, we can expect to see:

  1. More sophisticated caching strategies tailored to specific ML architectures.
  2. Integration of caching techniques into ML frameworks and platforms.
  3. Advancements in embedding techniques to improve semantic caching accuracy.
  4. Development of hybrid caching approaches that combine multiple techniques.

Conclusion

Caching is a powerful tool in the ML system designer’s toolkit. By intelligently implementing techniques like KV Cache, Prompt Cache, Exact Cache, and Semantic Cache, developers can significantly reduce latency and improve the performance of their ML applications.

Each caching technique offers unique benefits and is suited for different scenarios. KV Cache provides foundational performance improvements for language models. Prompt Cache shows promising results for reusing computation across similar prompts. Exact Cache offers a straightforward solution for frequently repeated queries. Semantic Cache, while more complex, can potentially handle variations of similar queries.

As you design and optimize your ML systems, carefully consider which caching techniques align best with your specific use cases, performance requirements, and resource constraints. Remember that the effectiveness of caching can vary greatly depending on your application’s characteristics, so always measure and validate the impact of your caching strategies.