Researchers from Google have developed a novel attention technique called Infini-attention that allows Transformer-based Large Language Models (LLMs) to efficiently process infinitely long input sequences with bounded memory and computation requirements. This breakthrough approach introduces minimal changes to the standard attention mechanism while enabling powerful long-context processing capabilities.

Key aspects of Infini-attention

  • Compressive memory integration: Infini-attention incorporates a compressive memory into the vanilla attention mechanism. This memory stores and retrieves long-term contextual information, allowing the model to maintain an unbounded context window with bounded memory footprint.
Infini attention

Figure 1: Infini-attention has an additional compressive memory with linear attention for processing infinitely long contexts.

  • Dual attention mechanisms: The approach builds in both masked local attention and long-term linear attention within a single Transformer block. The local attention captures fine-grained within-segment context, while the linear attention retrieves relevant information from the compressive memory. This dual attention architecture efficiently models both short-range and long-range dependencies.
  • Attention state reuse: Infini-attention reuses the attention key, value, and query states from the standard attention computation for memory consolidation and retrieval. This state sharing not only enables efficient plug-and-play long-context adaptation but also speeds up training and inference. The approach introduces minimal additional parameters, making it highly memory-efficient.
  • Streaming inference: By processing extremely long sequences in a segment-wise streaming fashion, Infini-attention enables fast inference for LLMs. The bounded memory footprint and incremental memory updates allow for efficient computation even on infinitely long input sequences.

Experimental evaluation

The researchers evaluated Infini-Transformer models on various challenging long-context tasks:

  1. Language modeling benchmarks: On the PG19 and Arxiv-math datasets, Infini-Transformer outperformed strong baselines such as Transformer-XL and Memorizing Transformers while using 114 times less memory. The model’s perplexity further improved when trained on sequences of length 100K, showcasing its ability to leverage long-range context.
  2. 1 Million passkey retrieval: In this task, a passkey is hidden within an extremely long input sequence, and the model must retrieve it. A 1 billion parameter LLM equipped with Infini-attention solved the task on sequences up to 1 million tokens long, after fine-tuning on only 5K length inputs. The models achieved impressive retrieval accuracies ranging from 96% to 100%.
  3. 500K book summarization: The researchers scaled up the approach to an 8 billion parameter Infini-Transformer LLM and evaluated it on the BookSum dataset, which involves summarizing entire books. After continual pre-training and fine-tuning, the model reached new state-of-the-art results, obtaining Rouge-1, Rouge-2, and Rouge-L scores of 40.0, 8.8, and 17.9, respectively. Importantly, performance improved as more book text was provided as input, highlighting the model’s ability to effectively utilize long-range context.

Infini-attention head analysis

An analysis of the trained Infini-attention heads revealed two distinct types:

  • Specialized heads: These heads focus on either performing local attention within the current segment or retrieving information from the compressive memory. The specialization allows for efficient processing of both short-range and long-range dependencies.
  • Mixer heads: Mixer heads aggregate both the current contextual information and the retrieved long-term memory content into a single output. These heads facilitate the integration of local and global context, enabling the model to capture complex dependencies spanning multiple segments.


Infini-attention represents a powerful and practical approach for efficiently scaling Transformer LLMs to extremely long context lengths. By introducing a compressive memory and dual attention mechanisms, the approach achieves bounded memory footprint and streaming computation while outperforming prior techniques on multiple long-sequence benchmarks. The plug-and-play nature of Infini-attention allows for easy adaptation of existing LLMs, making it a promising direction for future research and applications involving long-context processing.

The development of Infini-attention marks a significant step towards enabling LLMs to process and reason over virtually unlimited context lengths, opening up new possibilities for complex language understanding and generation tasks. As research in this area continues to advance, we can expect to see further improvements in the efficiency and effectiveness of long-context processing techniques, ultimately leading to more powerful and versatile language models.

Original preprint: Munkhdalai, T., Faruqui, M. and Gopal, S., 2024. Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention. arXiv preprint arXiv:2404.07143.


Last Update: 14/04/2024