In a significant development for the AI community, UnslothAI has introduced a method to dramatically improve the finetuning process for Google’s Gemma 2 language models. This breakthrough promises to make working with large language models more accessible and efficient for researchers and developers alike.
For hands-on testing, check out their Colab notebook (which runs on a free Tesla T4 Google Colab instance). Alternatively, keep reading to explore the technical innovations behind this breakthrough.
Gemma 2 support is here!
Unsloth supports 50K context lengths for Gemma 2 (9B) on a 80GB GPU – 5x longer than HF+FA2.
QLoRA finetuning Gemma 2 (27B) is 1.9x faster, uses 53% less VRAM & Gemma 2 (9B) is 2x faster, 63% less VRAM + fits in a 8GB GPU!
Blog: https://t.co/cu7ktWQ5C1 pic.twitter.com/E8evWJoRAH
— Unsloth AI (@UnslothAI) July 3, 2024
Before diving into UnslothAI’s improvements, it’s important to understand the model they’ve optimized. Google’s Gemma 2 is the latest language model that UnslothAI has enhanced, so let’s explore its key features.
Gemma 2: Google’s Latest Language Model
Gemma 2 represents Google’s latest advancement in language models, released in June 2024. This new iteration builds upon the success of the original Gemma models, offering significant improvements in performance and efficiency. Gemma 2 is available in two sizes:
- Gemma 2 (9B): A 9 billion parameter model (Kaggle link)
- Gemma 2 (27B): A 27 billion parameter model (Kaggle link)
Key features of Gemma 2 include:
- Enhanced performance: The 27B model offers competitive performance compared to models more than twice its size. The 9B model outperforms other open models in its size category, including Llama 3 8B.
- Efficiency: Designed for efficient inference, Gemma 2 can run at full precision on a single Google Cloud TPU host, NVIDIA A100 80GB, or H100 Tensor Core GPU. This design choice significantly reduces deployment costs.
- Speed: Optimized for fast inference across various hardware, from high-end desktops to cloud-based setups.
- Accessibility: Available under a commercially-friendly license, allowing developers and researchers to share and commercialize their innovations.
- Framework compatibility: Works with major AI frameworks like Hugging Face Transformers, JAX, PyTorch, and TensorFlow.
- Responsible AI development: Google implemented robust safety processes during training, including data filtering and comprehensive testing to mitigate potential biases and risks.
- Upcoming developments: Google plans to release a 2.6B parameter Gemma 2 model, aiming to balance lightweight accessibility with powerful performance.
The release of Gemma 2 marks a significant step in making advanced language models more accessible to researchers and developers, potentially accelerating AI innovation across various fields.
Check also our article about Phi-3 June update.
UnslothAI’s breakthrough

2x faster Gemma 2 finetuning + 63% less VRAM. Source: https://unsloth.ai/blog/gemma2
UnslothAI has developed a method that significantly enhances the finetuning process for Gemma 2 models:
- Speed improvement:
- Gemma 2 (9B): 2x faster finetuning
- Gemma 2 (27B): 1.9x faster finetuning
- VRAM Reduction:
- Gemma 2 (9B): 63.2% less VRAM usage
- Gemma 2 (27B): 51% less VRAM usage
These improvements are substantial, potentially halving the time required for finetuning while significantly reducing the hardware requirements.
Longer context lengths

Gemma 2 benchmarks. Source: https://unsloth.ai/blog/gemma2
One of the most impressive aspects of UnslothAI’s approach is the ability to handle longer context lengths during finetuning:
- Gemma 2 (9B): 4-5x longer context lengths
- Gemma 2 (27B): 3x longer context lengths
This improvement allows for more comprehensive and nuanced finetuning, as models can process larger chunks of text at once.
Technical innovations
UnslothAI achieved these improvements through several technical innovations:
- Softcapping mechanism: UnslothAI implemented a softcapping mechanism for both attention and logits in the Gemma 2 models. This technique helps stabilize training and improve performance. As shown in the image below, softcapping is crucial for both the 9B and 27B models, with the 27B model being particularly sensitive to this approach.
- Gradient checkpointing: By using offloaded gradient checkpointing, UnslothAI managed to reduce VRAM usage by 30% with only a 2% slowdown in training time.
- Optimized matrix operations: UnslothAI fine-tuned the matrix multiplication operations, particularly for the LoRA (Low-Rank Adaptation) weights, leading to significant performance gains.

Softcapping investigations.
Their blog article provides various investigations and experiments about softcapping combinations (attn softcapping, logit softcapping, both on, one on, none on). It is worth checking it directly if you are interested.
Practical implications
The improvements brought by UnslothAI have several practical implications:
- Democratization of AI: Lower VRAM requirements mean that more researchers and developers can work with these powerful models on less expensive hardware.
- Faster iteration: Quicker finetuning allows for more rapid experimentation and development cycles.
- Enhanced model capabilities: Longer context lengths enable the models to understand and generate more coherent long-form content, and that is awesome.
The image belo provides a striking visualization of the VRAM usage difference between UnslothAI’s method and the standard Hugging Face + Flash Attention 2 approach. The graph clearly shows that UnslothAI’s method (green line) consistently uses less memory across various context lengths compared to the standard approach (yellow line).

VRAM usage difference. Source: https://unsloth.ai/blog/gemma2
You can test the finetuning yourself in a free Tesla T4 Google Colab instance. Link to Gemma2 9b finetuning notebook.
Conclusion
UnslothAI’s advancements in Gemma 2 finetuning represent a significant step forward in making large language models more accessible and efficient. By reducing both the time and computational resources required for finetuning, UnslothAI is helping to accelerate AI research and development. As the field of AI continues to evolve rapidly, innovations like these play a crucial role in pushing the boundaries of what’s possible with language models.
While these improvements are impressive, it’s important to note that the field of AI is constantly evolving. As such, while UnslothAI’s work represents a current breakthrough, it’s likely that further advancements will continue to push the efficiency and capabilities of language models even further in the future.