Mistral AI and NVIDIA have released Mistral NeMo 12B, a state-of-the-art language model that promises to redefine the landscape of enterprise AI applications.

This article provides an in-depth exploration of its features, capabilities, and potential impact on the AI industry.

Important links:

Key contributions

Mistral NeMo 12B stands out in the AI landscape for several reasons:

  1. Open-source model with a 128k context window: Large context window open-source models have been rare, with popular models like the Llama 3 series limited to an 8k context window. NeMo’s 128k context is crucial for RAG workloads that require processing large documents. This feature, combined with its smaller size, should offer fast input and output processing speeds.
  2. Successor to Mistral 7B: As a follow-up to the popular Mistral 7B model, NeMo is designed as a drop-in replacement, making it easy to use in existing systems. This compatibility, along with its potential for fine-tuning, makes it an attractive option for various use cases.
  3. Competitive pricing for long context processing: Priced at $0.3 per million input and output tokens on Mistral AI’s La Plateforme, NeMo offers a cost-effective solution compared to alternatives. For context, GPT-4 is priced at $5 per million input tokens, while Mixtral 8x22B (with a 65k context window) costs $1.2 per million input tokens. This pricing aligns NeMo closely with models like Gemini 1.5 Flash ($0.35) and Claude 3 Haiku ($0.25).

These features position Mistral NeMo 12B as a powerful and accessible tool for a wide range of AI applications, from natural language processing to complex document analysis and generation tasks.

Model overview

Mistral NeMo 12B is a pretrained generative text model jointly developed by Mistral AI and NVIDIA. It boasts 12 billion parameters and significantly outperforms existing models of similar or smaller size.

Key features

  • Released under the Apache 2 License
  • Available in pre-trained and instructed versions
  • 128K context window
  • Trained on extensive multilingual and code data
  • Drop-in replacement for Mistral 7B
Mistral NeMo base model performance compared to Gemma 2 9B and Llama 3 8B

Mistral NeMo base model performance compared to Gemma 2 9B and Llama 3 8B. Source: https://mistral.ai/news/mistral-nemo/

Model architecture

Mistral NeMo is built on a transformer architecture with the following specifications:

  • Layers: 40
  • Dimension: 5,120
  • Head dimension: 128
  • Hidden dimension: 14,436
  • Activation Function: SwiGLU
  • Number of heads: 32
  • Number of kv-heads: 8 (Grouped Query Attention)
  • Vocabulary size: 2^17 ≈ 128K
  • Rotary embeddings (theta = 1M)

Performance benchmarks

Main benchmarks

Benchmark Score
HellaSwag (0-shot) 83.5%
Winogrande (0-shot) 76.8%
OpenBookQA (0-shot) 60.6%
CommonSenseQA (0-shot) 70.4%
TruthfulQA (0-shot) 50.3%
MMLU (5-shot) 68.0%
TriviaQA (5-shot) 73.8%
NaturalQuestions (5-shot) 31.2%

Multilingual benchmarks (MMLU)

Language Score
French 62.3%
German 62.7%
Spanish 64.6%
Italian 61.3%
Portuguese 63.3%
Russian 59.2%
Chinese 59.0%
Japanese 59.0%

These benchmarks demonstrate Mistral NeMo’s exceptional performance across various tasks and languages, positioning it as a great tool for diverse AI applications.

Tekken: advanced tokenization

Mistral NeMo introduces a new tokenizer called Tekken, which offers significant improvements in text compression efficiency:

  • Trained on over 100 languages
  • ~30% more efficient at compressing source code and various languages compared to SentencePiece
  • 2x more efficient for Korean and 3x for Arabic
  • Outperforms the Llama 3 tokenizer in compressing text for about 85% of all languages

Usage and implementation

Mistral NeMo can be used with three different frameworks:

  1. mistral_inference
  2. transformers
  3. NeMo

Installation with mistral_inference


pip install mistral_inference

Downloading the model


from huggingface_hub import snapshot_download
from pathlib import Path

mistral_models_path = Path.home().joinpath('mistral_models', 'Nemo-v0.1')
mistral_models_path.mkdir(parents=True, exist_ok=True)

snapshot_download(repo_id="mistralai/Mistral-Nemo-Base-2407", 
                  allow_patterns=["params.json", "consolidated.safetensors", "tekken.json"], 
                  local_dir=mistral_models_path)

Demo usage

After installation, you can use the mistral-demo CLI command:


mistral-demo $HOME/mistral_models/Nemo-v0.1

Using with Hugging Face Transformers

Note: As of the current release, you need to install transformers from source:


pip install git+https://github.com/huggingface/transformers.git

Example usage:


from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "mistralai/Mistral-Nemo-Base-2407"
tokenizer = AutoTokenizer.from_pretrained(model_id)

model = AutoModelForCausalLM.from_pretrained(model_id)
inputs = tokenizer("Hello my name is", return_tensors="pt")

outputs = model.generate(**inputs, max_new_tokens=20)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Important: Unlike previous Mistral models, Mistral NeMo requires smaller temperatures. A temperature of 0.3 is recommended for optimal performance.

Enterprise-grade features

Mistral NeMo comes packaged as an NVIDIA NIM inference microservice, offering several benefits for enterprise deployment:

  • Performance-optimized inference with NVIDIA TensorRT-LLM engines
  • Containerized format for easy deployment
  • Enterprise-grade software as part of NVIDIA AI Enterprise
  • Dedicated feature branches and rigorous validation processes
  • Enhanced security and support
  • Comprehensive support with direct access to NVIDIA AI experts
  • Defined service-level agreements

Hardware compatibility and efficiency

Mistral NeMo is designed to fit on the memory of a single:

  • NVIDIA L40S GPU
  • NVIDIA GeForce RTX 4090 GPU
  • NVIDIA RTX 4500 GPU

This compatibility ensures high efficiency, low compute cost, and enhanced security and privacy for enterprise deployments.

Pricing and availability

Mistral is offering NeMo at $0.3 per 1 million input & output tokens on La Plateforme. This competitive pricing positions NeMo favorably against other models with large context windows:

  • GPT-4 (32k context): $5/1M input tokens (over 15x more expensive than NeMo)
  • Mixtral 8x22B (65k context): $1.2/1M input tokens (4x more expensive than NeMo)
  • Gemini 1.5 Flash: $0.35/1M tokens
  • Claude 3 Haiku: $0.25/1M tokens

The model is currently available as an NVIDIA NIM via ai.nvidia.com, with a downloadable NIM coming soon.

Premium content from UnfoldAI (ebooks, cheat sheets, tutorials)

Conclusion

Mistral NeMo 12B represents a significant leap forward in the field of enterprise AI. Its combination of advanced features, impressive performance across various benchmarks, and flexible deployment options make it a powerful tool for businesses looking to leverage cutting-edge AI technology. As the AI landscape continues to evolve, Mistral NeMo 12B stands out as a versatile and powerful solution for a wide range of applications, from natural language processing to code generation and beyond.

Categorized in:

Deep Learning, LLMs,

Last Update: 23/07/2024