The introduction of BERT (Bidirectional Encoder Representations from Transformers) in 2018 signaled a paradigm shift in Natural Language Processing (NLP).

As the first deep learning model to process text bidirectionally, BERT significantly improved various NLP tasks, including sentiment analysis, named entity recognition, question answering, and text classification. However, despite its revolutionary impact, BERT had limitations, such as a restricted context length (512 tokens), high computational resource demands, and a lack of code awareness.

ModernBERT, a recent advancement in encoder-only models, emerges as a major Pareto improvement over older encoders, offering both enhanced performance and increased efficiency.

ModernBERT improvements

ModernBERT Pareto efficiency. Source: https://huggingface.co/blog/modernbert

This article goes into the key concepts and improvements of ModernBERT, compares it with previous BERT models, and provides practical examples and details on implementation.

Here you can find a notebook for fine-tuning ModernBERT.

Key concepts and improvements

ModernBERT distinguishes itself from its predecessors through several architectural enhancements and training methodologies. It is specifically designed as an encoder-only model, making it leaner and more efficient for tasks that don’t require text generation.

Some of the key improvements include:

  • Extended context length: ModernBERT boasts a sequence length of 8,192 tokens, a significant leap from the 512 tokens limit in the original BERT. This extended capacity allows it to handle much longer documents or datasets, opening doors to use cases like full-document retrieval and large-scale code analysis.
  • Improved architecture: ModernBERT incorporates several advancements in transformer architecture:
    1. Rotary Positional Embeddings (RoPE): This replaces the older positional encoding mechanism, leading to a better understanding of token positions and enabling longer sequence lengths.
    2. GeGLU Layers: ModernBERT substitutes the old MLP layers with GeGLU layers, refining the original BERT’s GeLU activation function for enhanced performance.
    3. Streamlined architecture: By removing unnecessary bias terms, ModernBERT streamlines its architecture, allowing for a more efficient allocation of the parameter budget.
    4. Additional Normalization Layer: An extra normalization layer after embeddings contributes to the stabilization of training.
    5. FlexBERT: ModernBERT introduces FlexBERT, a modular approach to encoder building blocks, further enhancing its architectural flexibility and adaptability.
    6. Integration of Flash Attention 2 and RoPE: ModernBERT integrates Flash Attention and rotary positional embeddings (RoPE) to enhance computational efficiency and positional understanding.
  • Flash attention and unpadding: ModernBERT employs Flash Attention and unpadding techniques to accelerate both training and inference. Flash Attention, a highly efficient attention mechanism, reduces the computational overhead associated with processing long sequences. Unpadding, on the other hand, eliminates unnecessary padding tokens during the computation process, optimizing memory usage and speeding up operations. ModernBERT leverages Flash Attention 2’s speed improvements, building upon existing research to maximize efficiency.
  • Alternating attention: ModernBERT utilizes an alternating pattern of global and local attention, drawing inspiration from how a human reader processes a novel. Every third layer employs full contextual awareness, while the others concentrate on local context, striking a balance between efficiency and performance 8.
  • Hardware-aware model design: ModernBERT is designed with hardware in mind, ensuring efficient execution on a variety of GPUs and maximizing performance on commonly used hardware 2.

Training data and process

ModernBERT was trained on a massive dataset comprising over 2 trillion tokens from a diverse range of sources, including:

  • Web documents: This includes a wide array of online content, such as articles from websites like Wikipedia, news articles from sources like the New York Times, blog posts from platforms like Medium, and other text-rich web resource.
  • Code: ModernBERT incorporates code from large codebases hosted on platforms like GitHub, GitLab, and Bitbucket, covering a variety of programming languages and software projects.
  • Scientific papers: The training data includes research papers from diverse scientific disciplines, sourced from platforms like arXiv, PubMed, and JSTOR, encompassing a wide range of topics and writing styles.

This diverse dataset empowers ModernBERT to excel in tasks that demand specialized knowledge, such as code retrieval, programming assistance, and technical document understanding.

Training process: ModernBERT’s training process is divided into three distinct phases:

  • Phase 1: Initial training on 1.7 trillion tokens with a sequence length of 1,024.
  • Phase 2: Long-context adaptation, training on 250 billion tokens with a sequence length of 8,192.
  • Phase 3: Annealing with 50 billion tokens, focusing on fine-tuning the model to optimize its performance on long-context tasks.

This phased approach allows the model to progressively enhance its general language understanding while ensuring its proficiency in handling long documents effectively.

Performance

ModernBERT demonstrates significant improvements in efficiency and performance compared to its predecessors. This is achieved through a combination of architectural innovations, training methodologies, and a focus on hardware optimization.

  • Architectural efficiency: ModernBERT incorporates architectural elements like Flash Attention, RoPE embeddings, and alternating attention, all contributing to faster processing and reduced memory consumption.
  • Training efficiency: The three-phase training process, with its focus on long-context adaptation and annealing, ensures that ModernBERT is trained effectively and efficiently for a wide range of tasks.
  • Hardware optimization: ModernBERT is designed to run efficiently on a variety of GPUs, maximizing performance on commonly used hardware and minimizing latency.

These combined efforts result in a model that is not only more accurate but also faster and more memory-efficient, making it suitable for inference on common GPUs.

But why we need a newer BERT?

ModernBERT offers several advantages over previous BERT models:

  • Improved performance: ModernBERT consistently outperforms models like RoBERTa and DeBERTa across a variety of NLP tasks.
  • Increased efficiency: ModernBERT’s architectural improvements and training methodologies result in faster processing and reduced memory consumption 2. It is the most speed and memory-efficient encoder among its peers 2.
  • Longer context length: The extended context length allows ModernBERT to handle longer documents and excel in long-context tasks.
  • Code awareness: ModernBERT’s training data includes a substantial amount of code, enabling it to perform well on code-related tasks.
ModernBERT performance

ModernBERT is the only model which is a top scorer across every category, which makes it the one model you can use for all your encoder-based tasks

ModernBERT efficiency

Memory (max batch size, BS) and Inference (in thousands of tokens per second) efficiency results on an NVIDIA RTX 4090 for ModernBERT and other decoder models.

Implementing ModernBERT

ModernBERT can be readily implemented using popular NLP libraries like Hugging Face Transformers.

Here’s a basic example of how to use ModernBERT for a fill-mask task:


# installation from GitHub, as still not included in the current Trasnformers version
pip install git+https://github.com/huggingface/transformers.git

And then the Python code:


import torch
from transformers import pipeline
from pprint import pprint

pipe = pipeline(
    "fill-mask",
    model="answerdotai/ModernBERT-base",
    torch_dtype=torch.bfloat16,
)

input_text = "He walked to the [MASK]."
results = pipe(input_text)
pprint(results)

This code snippet demonstrates how to use the fill-mask pipeline with the answerdotai/ModernBERT-base model to predict the masked word in a sentence.

For more complex tasks like classification or retrieval, you can fine-tune ModernBERT using standard BERT fine-tuning recipes. The Hugging Face Model Hub provides pre-trained ModernBERT models and various resources to help you get started.

ModernBERT fine-tuning

ModernBERT’s architecture makes it particularly well-suited for fine-tuning on specific tasks. Let’s explore how to fine-tune ModernBERT using a practical example of sentiment analysis on Bulgarian text data.

The fine-tuning process adapts the pre-trained model to a specific downstream task while preserving its learned language understanding. This is especially effective because ModernBERT’s pre-training on diverse datasets provides a strong foundation for various NLP tasks.

Let’s walk through a complete example using a Bulgarian sentiment analysis dataset containing 7,923 rows of text with sentiment labels. You can find the implementation in this Colab notebook.

The implementation starts with preparing the data using a custom Dataset class that handles tokenization and creates appropriate tensor formats:

class TextDataset(Dataset):
    def __init__(self, data, tokenizer, max_length=128):
        self.encodings = tokenizer(data['text'], 
                                 truncation=True, 
                                 padding=True,
                                 max_length=max_length, 
                                 return_tensors='pt')
        self.labels = torch.tensor(data['label'])

We then set up the model using the base ModernBERT architecture with a classification head and configure the training process with carefully chosen hyperparameters. The training loop includes both training and validation phases, tracking key metrics throughout the process:


model_name = "answerdotai/ModernBERT-base"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name, 
                                                         num_labels=2)

# Training loop
for epoch in range(num_epochs):
    model.train()
    for batch in train_loader:
        optimizer.zero_grad()
        outputs = model(**batch)
        loss = outputs.loss
        loss.backward()
        optimizer.step()

After training, we evaluate the model on a test set and save both the model and tokenizer for future use:


model.save_pretrained('./modernbert_bulgarian_sa')
tokenizer.save_pretrained('./modernbert_bulgarian_sa')

This example demonstrates ModernBERT’s capability to adapt to specific languages and tasks while maintaining efficient training characteristics. You can modify this template for other text classification tasks or different languages with minimal changes to the core architecture.

For hands-on experimentation with this implementation, visit the complete Colab notebook, which provides an interactive environment for running and modifying the fine-tuning process.

This example can be enhanced with several modern optimization approaches that have shown promising results in recent research. Some key improvements could include implementing curriculum learning by gradually decreasing the MLM masking probability during training, which helps the model build understanding progressively. The ADOPT optimizer could replace traditional optimizers for better convergence characteristics, especially in large-scale language model training. Additionally, integrating FlashAttention 2 can significantly accelerate training on compatible GPUs by optimizing attention computation patterns.

The training process could also benefit from dynamic batching with a custom DataCollator to handle varying sequence lengths more efficiently, and gradient accumulation to simulate larger batch sizes without increasing memory requirements.

These techniques, combined with careful hyperparameter tuning and regular evaluation steps, can lead to more robust and efficient fine-tuning of ModernBERT for specific domains and tasks.

✔️
Let me know if you’d like to explore a more advanced notebook that implements these optimization techniques – I’d be happy to guide you through a more complicated implementation with curriculum learning, custom data collation, and advanced training monitoring.

Practical examples of ModernBERT

ModernBERT’s enhanced capabilities make it suitable for a wide range of NLP applications.

Some practical examples include:

  • Retrieval Augmented Generation (RAG): ModernBERT’s extended context length and efficient processing make it ideal for RAG pipelines, where it can effectively retrieve and process relevant information from large knowledge bases to augment the generation capabilities of large language models.
  • Semantic search: ModernBERT can be used to power semantic search engines, enabling more accurate and relevant search results by understanding the meaning and context of search queries and documents.
  • Code retrieval: ModernBERT excels in code retrieval tasks, achieving high scores on the SQA dataset. This capability can be used to develop AI-powered IDEs and enterprise-wide code indexing solutions.
  • Classification: ModernBERT can be fine-tuned for various classification tasks, such as sentiment analysis, topic classification, and spam detection, with improved performance compared to previous BERT models.
  • Question answering: ModernBERT can be used in question-answering systems, providing accurate and comprehensive answers by effectively understanding the context and meaning of questions and relevant documents.

Conclusion

ModernBERT represents a step forward in the evolution of encoder-only models. By incorporating modern architectural improvements, efficient training methodologies, and a diverse training dataset, ModernBERT not only addresses the limitations of previous models but also offers enhanced performance and capabilities. Its extended context length, improved efficiency, and code awareness make it a versatile tool for various NLP applications, including semantic search, classification, code retrieval, and RAG pipelines.

The impact of ModernBERT extends beyond just improved benchmarks. Its design for real-world performance with variable length inputs makes it a practical and valuable tool for various industries and applications. Whether it’s enhancing search engine accuracy, powering code retrieval systems, or improving the efficiency of NLP pipelines, ModernBERT has the potential to significantly impact how we interact with and utilize language data.

Last Update: 26/12/2024