Mistral AI and NVIDIA have released Mistral NeMo 12B, a state-of-the-art language model that promises to redefine the landscape of enterprise AI applications.
This article provides an in-depth exploration of its features, capabilities, and potential impact on the AI industry.
Important links:
- Mistral NeMo relesease blog;
- NVIDIA blog article;
- Huggingface Mistral model repo;
- Ollama implementation (v0.2.8).
Key contributions
Mistral NeMo 12B stands out in the AI landscape for several reasons:
- Open-source model with a 128k context window: Large context window open-source models have been rare, with popular models like the Llama 3 series limited to an 8k context window. NeMo’s 128k context is crucial for RAG workloads that require processing large documents. This feature, combined with its smaller size, should offer fast input and output processing speeds.
- Successor to Mistral 7B: As a follow-up to the popular Mistral 7B model, NeMo is designed as a drop-in replacement, making it easy to use in existing systems. This compatibility, along with its potential for fine-tuning, makes it an attractive option for various use cases.
- Competitive pricing for long context processing: Priced at $0.3 per million input and output tokens on Mistral AI’s La Plateforme, NeMo offers a cost-effective solution compared to alternatives. For context, GPT-4 is priced at $5 per million input tokens, while Mixtral 8x22B (with a 65k context window) costs $1.2 per million input tokens. This pricing aligns NeMo closely with models like Gemini 1.5 Flash ($0.35) and Claude 3 Haiku ($0.25).
These features position Mistral NeMo 12B as a powerful and accessible tool for a wide range of AI applications, from natural language processing to complex document analysis and generation tasks.
Model overview
Mistral NeMo 12B is a pretrained generative text model jointly developed by Mistral AI and NVIDIA. It boasts 12 billion parameters and significantly outperforms existing models of similar or smaller size.
Key features
- Released under the Apache 2 License
- Available in pre-trained and instructed versions
- 128K context window
- Trained on extensive multilingual and code data
- Drop-in replacement for Mistral 7B
Model architecture
Mistral NeMo is built on a transformer architecture with the following specifications:
- Layers: 40
- Dimension: 5,120
- Head dimension: 128
- Hidden dimension: 14,436
- Activation Function: SwiGLU
- Number of heads: 32
- Number of kv-heads: 8 (Grouped Query Attention)
- Vocabulary size: 2^17 ≈ 128K
- Rotary embeddings (theta = 1M)
Performance benchmarks
Main benchmarks
Benchmark | Score |
---|---|
HellaSwag (0-shot) | 83.5% |
Winogrande (0-shot) | 76.8% |
OpenBookQA (0-shot) | 60.6% |
CommonSenseQA (0-shot) | 70.4% |
TruthfulQA (0-shot) | 50.3% |
MMLU (5-shot) | 68.0% |
TriviaQA (5-shot) | 73.8% |
NaturalQuestions (5-shot) | 31.2% |
Multilingual benchmarks (MMLU)
Language | Score |
---|---|
French | 62.3% |
German | 62.7% |
Spanish | 64.6% |
Italian | 61.3% |
Portuguese | 63.3% |
Russian | 59.2% |
Chinese | 59.0% |
Japanese | 59.0% |
These benchmarks demonstrate Mistral NeMo’s exceptional performance across various tasks and languages, positioning it as a great tool for diverse AI applications.
Tekken: advanced tokenization
Mistral NeMo introduces a new tokenizer called Tekken, which offers significant improvements in text compression efficiency:
- Trained on over 100 languages
- ~30% more efficient at compressing source code and various languages compared to SentencePiece
- 2x more efficient for Korean and 3x for Arabic
- Outperforms the Llama 3 tokenizer in compressing text for about 85% of all languages
Usage and implementation
Mistral NeMo can be used with three different frameworks:
- mistral_inference
- transformers
- NeMo
Installation with mistral_inference
pip install mistral_inference
Downloading the model
from huggingface_hub import snapshot_download
from pathlib import Path
mistral_models_path = Path.home().joinpath('mistral_models', 'Nemo-v0.1')
mistral_models_path.mkdir(parents=True, exist_ok=True)
snapshot_download(repo_id="mistralai/Mistral-Nemo-Base-2407",
allow_patterns=["params.json", "consolidated.safetensors", "tekken.json"],
local_dir=mistral_models_path)
Demo usage
After installation, you can use the mistral-demo
CLI command:
mistral-demo $HOME/mistral_models/Nemo-v0.1
Using with Hugging Face Transformers
Note: As of the current release, you need to install transformers from source:
pip install git+https://github.com/huggingface/transformers.git
Example usage:
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "mistralai/Mistral-Nemo-Base-2407"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)
inputs = tokenizer("Hello my name is", return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=20)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Important: Unlike previous Mistral models, Mistral NeMo requires smaller temperatures. A temperature of 0.3 is recommended for optimal performance.
Enterprise-grade features
Mistral NeMo comes packaged as an NVIDIA NIM inference microservice, offering several benefits for enterprise deployment:
- Performance-optimized inference with NVIDIA TensorRT-LLM engines
- Containerized format for easy deployment
- Enterprise-grade software as part of NVIDIA AI Enterprise
- Dedicated feature branches and rigorous validation processes
- Enhanced security and support
- Comprehensive support with direct access to NVIDIA AI experts
- Defined service-level agreements
Hardware compatibility and efficiency
Mistral NeMo is designed to fit on the memory of a single:
- NVIDIA L40S GPU
- NVIDIA GeForce RTX 4090 GPU
- NVIDIA RTX 4500 GPU
This compatibility ensures high efficiency, low compute cost, and enhanced security and privacy for enterprise deployments.
Pricing and availability
Mistral is offering NeMo at $0.3 per 1 million input & output tokens on La Plateforme. This competitive pricing positions NeMo favorably against other models with large context windows:
- GPT-4 (32k context): $5/1M input tokens (over 15x more expensive than NeMo)
- Mixtral 8x22B (65k context): $1.2/1M input tokens (4x more expensive than NeMo)
- Gemini 1.5 Flash: $0.35/1M tokens
- Claude 3 Haiku: $0.25/1M tokens
The model is currently available as an NVIDIA NIM via ai.nvidia.com, with a downloadable NIM coming soon.
Conclusion
Mistral NeMo 12B represents a significant leap forward in the field of enterprise AI. Its combination of advanced features, impressive performance across various benchmarks, and flexible deployment options make it a powerful tool for businesses looking to leverage cutting-edge AI technology. As the AI landscape continues to evolve, Mistral NeMo 12B stands out as a versatile and powerful solution for a wide range of applications, from natural language processing to code generation and beyond.