In October 2024, Rhymes AI introduced ARIA, the first open-source multimodal native Mixture-of-Experts (MoE) model. This article provides a detailed examination of ARIA’s architecture, training process, performance, and potential applications.
Links:
- Demo: https://rhymes.ai/
- Blog article: https://www.rhymes.ai/blog-details/aria-first-open-multimodal-native-moe-model
- Paper: https://arxiv.org/pdf/2410.05993
- GitHub: https://github.com/rhymes-ai/Aria
What is ARIA?
ARIA is a large language model designed to process and understand multiple input modalities, including text, images, video, and code. It utilizes a mixture-of-experts architecture to efficiently handle diverse data types within a single model framework.
Key features of ARIA include:
- Multimodal native performance: Strong results across various tasks involving text, images, video, and code.
- Lightweight design: 3.9 billion activated parameters per token out of a total of 25.3 billion parameters.
- Long context window: Supports up to 64,000 tokens of multimodal input.
- Open-source availability: Released under the Apache 2.0 license for both academic and commercial use.
Architecture
ARIA’s architecture consists of two main components: a vision encoder and a mixture-of-experts decoder.
Vision encoder
The vision encoder in ARIA is designed to process visual inputs efficiently:
- Converts images and video frames into visual tokens
- Operates in three resolution modes:
- Medium: 128 visual tokens for images with longer edge resized to 490 pixels
- High: 256 visual tokens for images with longer edge resized to 980 pixels
- Ultra-high: Dynamically decomposes images into multiple high-resolution sub-images
The encoder uses a Vision Transformer (ViT) initialized with weights from the SigLIP-SO400M model, followed by a projection module comprising a cross-attention layer and a feed-forward network (FFN) layer.
Mixture-of-Experts decoder
The MoE decoder is the core of ARIA’s multimodal capabilities:
- 66 experts in each MoE layer
- 2 experts shared among all inputs to capture common knowledge
- 6 additional experts activated for each token by a router module
- Total of 24.9 billion parameters with 3.5 billion activated parameters per text token
This architecture allows ARIA to efficiently allocate computational resources based on the input’s complexity and modality.
Training process
ARIA’s training follows a 4-stage pipeline designed to progressively enhance the model’s capabilities:
Language pre-training
- Corpus: 6.4 trillion tokens of curated language data
- Objective: Next-token prediction
- Context window: 8,000 tokens
Multimodal pre-training
- Data: Mixture of 1 trillion high-quality language tokens and 400 billion multimodal tokens
- Multimodal data categories: a. Interleaved image-text web data (190 billion tokens) b. Synthetic image captions (70 billion tokens) c. Document transcriptions and QA (102 billion tokens) d. Video captions and QA (35 billion tokens)
Multimodal long-context pre-training
- Extended context window to 64,000 tokens
- Data: 12 billion language tokens and 21 billion multimodal tokens (69% long sequences)
- Increased RoPE base frequency from 100,000 to 5,000,000
Multimodal post-training
- Focus: Improving question-answering and instruction-following capabilities
- Data: 20 billion tokens from high-quality open-source and human-annotated datasets
Performance benchmarks
ARIA’s performance has been evaluated across a wide range of tasks and compared to both open-source and proprietary models. Here’s a detailed breakdown of its performance:
Multimodal tasks
Category | Benchmark | ARIA | Pixtral 12B | Llama3.2 11B | GPT-4o mini | Gemini-1.5 Flash |
---|---|---|---|---|---|---|
Knowledge (Multimodal) | MMMU | 54.9 | 52.5 | 50.7 | 59.4 | 56.1 |
Math (Multimodal) | MathVista | 66.1 | 58.0 | 51.5 | – | 58.4 |
Document | DocQA | 92.6 | 90.7 | 84.4 | – | 89.9 |
Chart | ChartQA | 86.4 | 81.8 | 83.4 | – | 85.4 |
Scene Text | TextVQA | 81.1 | – | – | – | 78.7 |
General Visual QA | MMBench-1.1 | 80.3 | – | – | 76.0 | – |
Video Understanding | LongVideoBench | 65.3 | 47.4 | 45.7 | 58.8 | 62.4 |
Language and coding tasks
Category | Benchmark | ARIA | Pixtral 12B | Llama3.2 11B | GPT-4o mini | Gemini-1.5 Flash |
---|---|---|---|---|---|---|
Knowledge (Language) | MMLU (5-shot) | 73.3 | 69.2 | 69.4 | – | 78.9 |
Math (Language) | MATH | 50.8 | 48.1 | 51.9 | 70.2 | – |
Reasoning (Language) | ARC Challenge | 91.0 | – | 83.4 | 96.4 | – |
Coding | HumanEval | 73.2 | 72.0 | 72.6 | 87.2 | 74.3 |
Long-context multimodal understanding
Model | LongVideoBench (test) | VideoMME (w subs) | MMLongBench-Doc (acc) |
---|---|---|---|
ARIA | 65.3 | 72.1 | 28.3 |
Qwen2-VL-7B | 56.8 | 69.0 | 21.3 |
Idefics2 | 49.4 | – | 7.0 |
MiniCPM-V-2.6 | 55.7 | 63.7 | 11.5 |
Llama3.2-11B | 45.7 | 49.5 | 13.8 |
Pixtral-12B | 47.4 | 47.5 | 6.4 |
Gemini-1.5-Flash | 62.6 | 75.0 | 27.0 |
GPT-4o mini | 58.8 | 68.9 | 29.0 |
Instruction following capabilities
Benchmark | ARIA | Phi-3 Vision | Qwen2-VL-7B | Pixtral-12B | GPT-4o |
---|---|---|---|---|---|
MIA-Bench (Multimodal) | 8.76 | 7.60 | 8.07 | 8.43 | 8.86 |
MT-Bench (Language) | 8.53 | 6.27 | 6.41 | 7.68 | – |
These benchmarks demonstrate ARIA’s competitive performance across a wide range of tasks, often surpassing other open-source models and approaching the performance of proprietary models in several areas.
Expert modality specialization
An analysis of ARIA’s expert activation behavior reveals interesting patterns of specialization:
- Visual-specialized experts emerge in most layers
- Some layers (4, 5, 14, 15, 16, 17, 20) have a single visual expert specialized across all three visual domains (natural image, video, PDF-format image)
- The ratio of visual token activation (Rv) to text token activation (Rt) for each expert indicates the level of visual specialization
This specialization demonstrates ARIA’s ability to efficiently allocate its resources for processing different types of input modalities, despite its modality-generic architecture.
Practical applications
ARIA’s capabilities extend to various practical applications. Here are some examples:
Weather forecast information extraction
When presented with a 14-day forecast image, ARIA can:
- Extract temperature and humidity data for specific dates
- Generate Python code to visualize the data
Example output:
import matplotlib.pyplot as plt
# Data for even dates
days = ['20 Sep', '22 Sep', '24 Sep', '26 Sep', '28 Sep', '30 Sep']
high_temps = [33, 33, 32, 29, 27, 27]
low_temps = [26, 26, 26, 24, 26, 26]
humidity = [62, 63, 70, 76, 69, 66]
# Line chart for temperatures
plt.figure(figsize=(14, 6))
plt.subplot(1, 2, 1)
plt.plot(days, high_temps, 'r-', label='Highest Temperature', marker='o')
plt.plot(days, low_temps, 'g-', label='Lowest Temperature', marker='o')
plt.xlabel('Days')
plt.ylabel('Temperature (°C)')
plt.title('Temperature Variation')
plt.legend()
plt.grid(True)
# Bar chart for humidity
plt.subplot(1, 2, 2)
plt.bar(days, humidity, color='blue', alpha=0.7)
plt.xlabel('Days')
plt.ylabel('Humidity (%)')
plt.title('Humidity')
plt.ylim(0, 100)
plt.grid(axis='y')
plt.tight_layout()
plt.show()
Using the provided code snippet from Aria, we can generate the following visualizations:
Financial report analysis
When presented with a financial statement image, ARIA can:
- Calculate profit margins for different years
- Analyze factors contributing to changes in profit margins

Income statement document
Aria response:
To determine how the company’s profit margin changed from 2020 to 2021, we need to follow these steps:
- Calculate the profit margins for both years:
- 2020:
- Net income: $57,411 million
- Total net sales: $274,515 million
- Profit margin = (Net income / Total net sales) * 100 = (57,411/274,515) * 100 ≈ 20.91%
- 2021:
- Net income: $94,680 million
- Total net sales: $365,817 million
- Profit margin = (Net income / Total net sales) * 100 = (94,680/365,817) * 100 ≈ 25.88%
- 2020:
- Analyze the factors causing the change:
- Increase in Net Income: The net income increased from $57,411 million in 2020 to $94,680 million in 2021, showing a significant rise of $37,269 million.
- Sales Growth: Total net sales also increased from $274,515 million in 2020 to $365,817 million in 2021, indicating a strong growth in revenue.
- Cost Management: The cost of sales increased from $151,286 million in 2020 to $192,266 million in 2021, but the percentage increase in sales outpaced the increase in costs, leading to a higher profit margin.
- Operating Expenses: Operating expenses grew from $38,668 million in 2020 to $43,887 million in 2021, but the rise in sales was more substantial, positively impacting the profit margin.
Therefore, the company’s profit margin improved from approximately 20.91% in 2020 to approximately 25.88% in 2021. This improvement is primarily due to a substantial increase in net income and effective cost management relative to the growth in total net sales.
Long video understanding
Example:
When processing a video about Michelangelo’s David, ARIA can:
- Split the video into detailed scenes
- Provide start and end times for each scene
- Generate titles and descriptions for each segment
Prompt:
Please split this video into scenes, providing start time, end time, a title and detailed descriptions for each scene.
Example output :
Development and usage
ARIA is designed to be developer-friendly, offering extensive support and flexibility for downstream applications. The provided codebase includes:
- Fast and easy inference with Transformers or vllm
- Cookbooks and best practices for using ARIA
- Fine-tuning capabilities on various dataset formats, using as few as a single GPU
Example usage:
import requests
import torch
from PIL import Image
from transformers import AutoModelForCausalLM, AutoProcessor
model_id_or_path = "rhymes-ai/Aria"
model = AutoModelForCausalLM.from_pretrained(model_id_or_path, device_map="auto", torch_dtype=torch.bfloat16, trust_remote_code=True)
processor = AutoProcessor.from_pretrained(model_id_or_path, trust_remote_code=True)
image_path = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/cat.png"
image = Image.open(requests.get(image_path, stream=True).raw)
messages = [
{
"role": "user",
"content": [
{"text": None, "type": "image"},
{"text": "what is the image?", "type": "text"},
],
}
]
text = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(text=text, images=image, return_tensors="pt")
inputs["pixel_values"] = inputs["pixel_values"].to(model.dtype)
inputs = {k: v.to(model.device) for k, v in inputs.items()}
with torch.inference_mode(), torch.cuda.amp.autocast(dtype=torch.bfloat16):
output = model.generate(
**inputs,
max_new_tokens=500,
stop_strings=["<|im_end|>"],
tokenizer=processor.tokenizer,
do_sample=True,
temperature=0.9,
)
output_ids = output[0][inputs["input_ids"].shape[1]:]
result = processor.decode(output_ids, skip_special_tokens=True)
print(result)
Future directions and potential impact
The release of ARIA as an open-source model presents numerous opportunities for advancement in the field of multimodal AI:
- Research and development: Researchers can build upon ARIA’s architecture to develop more advanced multimodal models.
- Industry applications: Businesses can leverage ARIA for a wide range of applications, from content analysis to automated customer service.
- Education and training: ARIA can serve as a valuable tool for teaching AI concepts and multimodal processing techniques.
- Accessibility: The open-source nature of ARIA democratizes access to advanced AI capabilities, potentially leading to innovative applications in various domains.
- Benchmarking: ARIA sets a new standard for open-source multimodal models, encouraging healthy competition and rapid advancement in the field.
Conclusion
ARIA represents a significant milestone in the development of open-source multimodal AI models. Its ability to process and understand multiple input modalities within a single, efficient architecture opens up new possibilities for AI applications across various domains. As researchers and developers continue to build upon and refine ARIA’s capabilities, we can expect to see further advancements in multimodal AI in the near future.