As LLMs continue to grow in size and complexity, understanding their GPU memory requirements becomes crucial for efficient deployment and serving. This article dives deep into the intricacies of GPU memory usage for LLMs, providing a detailed breakdown of components, calculations, and optimization techniques.

💡
If you’re interested in learning how to run even smaller LLMs efficiently on CPU, check out my article “How to run LLMs on CPU-based systems” for detailed instructions and optimization tips.

Podcast highlight

Introduction to LLM serving

Model serving is the process of deploying a trained machine learning model into production to make predictions on new data. For LLMs, this means making the model available to answer questions, generate text, or perform other tasks based on user input.

In essence:

Serving = Prompt IN, Answer OUT

The importance of GPU VRAM for LLMs

Large language models are computationally intensive and require substantial memory for storing model parameters and performing intermediate calculations during inference. While system RAM is abundant, it’s not ideal for LLM serving due to its relatively slower speed compared to GPU memory.

A deeper look At VRAM

During use, the GPU first searches for data in the L1 data cache within the SM, and if the data is found in L1 there’s no need to access the L2 data cache. If data is not found in L1, it’s called a “cache miss”, and the search continues into the L2 cache. If data is found in L2, that’s called an L2 “cache hit” (see the “H” indicators in the above diagram), and data is provided to the L1 and then to the processing cores.

GPU memory, also known as VRAM (Video RAM) or GDDR (Graphics DDR), is specifically designed for high-performance computing tasks like deep learning. It offers:

  • Higher bandwidth
  • Lower latency
  • Faster data transfer between memory and processing units

These characteristics make GPU memory crucial for efficiently running large language models without encountering speed bottlenecks.

Components of GPU memory usage in LLM serving

To accurately estimate GPU memory requirements, it’s essential to understand the main components that consume memory during LLM serving:

Model parameters (Weights)

Model parameters are the learned values that define how the LLM processes input data. The memory required for storing these parameters is directly proportional to the model size.

LLMs model parameters

LLMs model parameters

Memory calculation:

  • Each parameter typically requires 2 bytes when using half-precision (FP16) format.
  • For a model with P parameters: Memory = P * 2 bytes

Examples:

  • Small LLM (345 million parameters): 345 million * 2 bytes = 690 MB
  • LLaMA-2 13B: 13 billion * 2 bytes = 26 GB
  • GPT-3 (175 billion parameters): 175 billion * 2 bytes = 350 GB

Key-Value (KV) cache memory

The KV cache stores intermediate representations needed for generating each token in a sequence. It’s crucial for maintaining context and enabling efficient attention to past tokens.

KV cache

Calculation of KV cache size per token:

  • Components: Key vectors and value vectors (one per layer)
  • Total Vectors per Token: Number of Layers (L) * Hidden Size (H)

Example for LLaMA-2 13B:

  • Layers (L): 40
  • Hidden Size (H): 5120
  • Key Vectors: 40 * 5120 * 2 bytes = 400 KB
  • Value Vectors: 40 * 5120 * 2 bytes = 400 KB
  • Total KV Cache per Token: 800 KB

For a 2000-token output with 10 concurrent requests: 800 KB/token * 8192 tokens * 10 requests ≈ 65.5 GB

Activations and temporary buffers

Activations are the outputs of neural network layers during inference, while temporary buffers are used for intermediate computations. They typically consume about 5-10% of the total GPU memory.

Memory overheads

Additional memory usage arises from inefficiencies in allocation and fragmentation:

  • Internal fragmentation: Allocated memory blocks not fully utilized
  • External fragmentation: Free memory split into small, non-contiguous blocks
  • Intermediate computations: Temporary tensors created during operations like matrix multiplications

Calculating GPU memory requirements

To estimate the total GPU memory required for serving an LLM, we need to account for all the components mentioned above. Here’s a step-by-step calculation:

Total Memory Required = Weights + KV Cache + Activations and Overhead

Let’s use the LLaMA-2 13B model as an example, assuming an 8192-token model with 10 concurrent requests:

  1. Weights = 13 Billion * 2 Bytes = 26 GB
  2. Total KV cache memory = 800 KB * 8192 Tokens * 10 Concurrent Requests = 66 GB
  3. Activations and Overhead = 0.1 * (26 GB + 66 GB) = 9.2 GB

Total memory required: 26 GB + 66 GB + 9.2 GB = 101.2 GB

This calculation shows that serving a LLaMA-2 13B model with these parameters would require at least three A100 40GB GPUs.

Memory requirements for various LLM sizes

To provide a comprehensive overview, let’s look at the memory requirements for different model sizes and token lengths:

Table 1: LLM Serving for 1 Concurrent Request (Memory Required in GB)

Model 4k Tokens 8k Tokens 32k Tokens 128k Tokens
7B 17.6 GB 19.8 GB 33.0 GB 85.8 GB
13B 32.12 GB 35.64 GB 56.76 GB 141.24 GB
30B 72.05 GB 78.14 GB 114.47 GB 259.74 GB
66B 155.58 GB 165.98 GB 228.23 GB 478.00 GB
70B 165.55 GB 177.07 GB 244.11 GB 523.25 GB
175B 405.77 GB 426.53 GB 551.03 GB 1,049.58 GB

Table 2: LLM Serving for 10 Concurrent Requests (Memory Required in GB)

Model 4k Tokens 8k Tokens 32k Tokens 128k Tokens
7B 37.4 GB 59.4 GB 191.4 GB 719.4 GB
13B 63.8 GB 99.0 GB 303.6 GB 1,128.6 GB
30B 126.5 GB 181.5 GB 528.0 GB 1,914.0 GB
66B 244.2 GB 343.2 GB 937.2 GB 3,313.2 GB
70B 264.0 GB 374.0 GB 1,034.0 GB 3,674.0 GB
175B 583.0 GB 781.0 GB 1,969.0 GB 6,721.0 GB

These tables illustrate the dramatic increase in memory consumption as model size and concurrent requests grow. The KV cache is a significant contributor to this increase, especially for longer sequences and multiple concurrent requests.

Challenges in GPU memory management for LLMs

Memory fragmentation and over-allocation

Static allocation of KV cache memory often leads to over-allocation, as the system reserves space for the maximum possible sequence length. This results in wasted memory for shorter sequences.

Both internal and external fragmentation reduce the effective usable memory, limiting the number of requests that can be processed simultaneously.

Advanced decoding algorithms

Techniques like Beam Search and Parallel Sampling introduce additional memory challenges:

  • Beam search: Generates multiple candidate sequences, each requiring its own KV cache
  • Parallel sampling: Produces multiple independent outputs simultaneously, multiplying memory requirements

These methods can lead to unpredictable memory demands and may exceed GPU capacity, potentially requiring data offloading to slower CPU memory or disk.

Optimization techniques for efficient GPU memory usage

PagedAttention

Inspired by operating system memory management, PagedAttention applies virtual memory paging to the KV cache.

Paged attention

Illustration of how PagedAttention stores the attention key and values vectors as non-contiguous blocks in the memory

Benefits include:

  • Non-contiguous memory storage
  • Dynamic memory allocation
  • Reduced fragmentation
  • Improved memory utilization

vLLM (Efficient LLM Serving System)

vLLM

Easy, fast, and cheap LLM serving for everyone

vLLM is a high-throughput LLM serving system built on top of PagedAttention.

Key features:

  • Near-zero memory waste through dynamic allocation and non-contiguous storage
  • Supports sharing of KV cache data within and across requests
  • Enables larger batches and more concurrent requests

Swapping and Recomputation

When GPU memory is insufficient, vLLM employs two strategies:

a) Swapping KV cache to CPU Memory:

  • Temporarily moves KV cache data to CPU memory
  • Frees up GPU memory for new requests
  • Trade-off: Increased latency due to slower CPU memory access

b) Recomputation:

  • Instead of storing all KV cache data, recomputes it on-demand
  • Reduces memory usage but increases computation
  • Trade-off: Potential latency impact due to extra computation

Comparison of swapping and recomputation:

Aspect Swapping Recomputation
Memory Usage Frees GPU memory by offloading to CPU Reduces memory by not storing data
Performance Potential latency due to data transfer Potential latency due to extra computation
Complexity Requires managing data movement between CPU and GPU Requires mechanisms to recompute data when needed
Use cases Suitable when CPU memory is ample and data transfer is acceptable Suitable when computation resources are available and recomputation is faster than data transfer

Quantization: Reducing memory footprint

Quantization is a powerful technique to reduce GPU memory requirements by lowering the precision of model parameters.

Quantization in LLMs

Common quantization formats include:

  • FP32 (32-bit floating point): 4 bytes per parameter
  • FP16 (16-bit floating point): 2 bytes per parameter
  • INT8 (8-bit integer): 1 byte per parameter
  • INT4 (4-bit integer): 0.5 bytes per parameter

While quantization can significantly reduce memory usage, it’s crucial to evaluate the impact on model accuracy. Lower precision formats like INT8 or INT4 may lead to more noticeable drops in performance compared to FP16.

Quantizatio nand dequantization

When you dequantize the values to return to FP32, they lose some precision and are not distinguishable anymore.

Estimating GPU memory requirements: A practical formula

For a quick estimation of GPU memory requirements, you can use the following formula:

M = (P * 4B) / (32/Q) * 1.2

Where:

  • M: GPU memory expressed in Gigabytes
  • P: Number of parameters in the model (in billions)
  • 4B: 4 bytes, expressing the bytes used for each parameter
  • Q: The number of bits used for loading the model (e.g., 16, 8, or 4)
  • 1.2: Represents a 20% overhead for loading additional things in GPU memory

Example calculation for a 70B parameter model using 8-bit quantization:

M = (70 * 4) / (32/8) * 1.2 = 84 GB

This formula provides a rough estimate, considering model size, quantization, and overhead. However, for precise calculations, consider all components discussed earlier in this article.

Conclusion

The efficient serving of large language models is a complex challenge that demands a deep understanding of GPU memory usage patterns and advanced optimization techniques. Throughout this article, we’ve explored the critical components that contribute to GPU memory consumption when serving LLMs:

  1. Model parameters: The foundation of LLMs, directly impacting memory requirements based on model size.
  2. Key-value cache: A significant memory consumer that grows with sequence length and concurrent requests.
  3. Activations and temporary buffers: Essential for model computations, typically consuming 5-10% of total memory.
  4. Memory overheads: Including fragmentation and inefficiencies in memory allocation.

We’ve seen how these components interact and scale with model size, from smaller 7B parameter models to massive 175B parameter behemoths. The dramatic increase in memory requirements for larger models and longer sequences underscores the importance of efficient memory management strategies.

To address these challenges, we’ve discussed several cutting-edge optimization techniques:

  • PagedAttention: Applying virtual memory concepts to reduce fragmentation and improve utilization.
  • vLLM: A high-throughput serving system that minimizes memory waste and enables efficient scaling.
  • Swapping and Recomputation: Strategies to manage memory constraints by leveraging CPU memory or trading memory for computation.
  • Quantization: Reducing memory footprint by lowering parameter precision, with careful consideration of accuracy trade-offs.

As LLMs continue to grow in size and capability, mastering these techniques becomes increasingly crucial. Efficient GPU memory management is not just about cost-effectiveness; it’s about enabling the deployment of more powerful models, handling longer contexts, and serving more users concurrently.

The future of LLM serving will likely see further innovations in memory optimization, hardware-software co-design, and distributed computing strategies. Staying informed about these developments and applying them judiciously will be key to unlocking the full potential of LLMs in real-world applications.

Categorized in:

Deep Learning, LLMs, MLOps,

Last Update: 01/10/2024