In the rapidly evolving world of artificial intelligence, Microsoft has made a significant leap forward with the release of Phi-3.5-mini, a lightweight yet remarkably capable language model. This article provides an in-depth look at Phi-3.5-mini, exploring its architecture, capabilities, benchmarks, and potential applications. We’ll dive into the technical details that make this model stand out in the crowded field of AI language models.
For those in a hurry: Yes, Phi-3.5-mini is readily available through Ollama. You can access it at: https://ollama.com/library/phi3.5
Model overview
Phi-3.5-mini is a state-of-the-art open model with the following key characteristics:
- Parameters: 3.8 billion
- Architecture: Dense decoder-only Transformer
- Context length: 128K tokens
- Training data: 3.4T tokens
- Training time: 10 days on 512 H100-80G GPUs
- Release date: August 2024
What sets Phi-3.5-mini apart is its ability to outperform similarly sized and even larger models, making it an attractive option for developers working with constrained resources or latency-sensitive applications.
Model architecture and training
Phi-3.5-mini belongs to the Phi-3 model family and builds upon datasets used for Phi-3. The training data includes:
- Filtered publicly available documents
- High-quality educational data
- Code repositories
- Synthetic “textbook-like” data covering math, coding, common sense reasoning, and general knowledge
- High-quality chat format supervised data
The model underwent a rigorous enhancement process, incorporating:
- Supervised fine-tuning
- Proximal policy optimization
- Direct preference optimization
This comprehensive approach ensures precise instruction adherence and robust safety measures.
Multilingual capabilities
Phi-3.5-mini supports an impressive array of languages:
Arabic, Chinese, Czech, Danish, Dutch, English, Finnish, French, German, Hebrew, Hungarian, Italian, Japanese, Korean, Norwegian, Polish, Portuguese, Russian, Spanish, Swedish, Thai, Turkish, and Ukrainian.
Let’s look at its performance on multilingual benchmarks:
Benchmark | Phi-3.5 Mini-Ins | GPT-4o-mini-2024-07-18 (Chat) |
---|---|---|
Multilingual MMLU | 55.4 | 72.9 |
Multilingual MMLU-Pro | 30.9 | 53.2 |
MGSM | 47.9 | 81.7 |
MEGA MLQA | 61.7 | 70.0 |
MEGA TyDi QA | 62.2 | 81.8 |
MEGA UDPOS | 46.5 | 66.0 |
MEGA XCOPA | 63.1 | 90.3 |
MEGA XStoryCloze | 73.5 | 96.6 |
Average | 55.2 | 76.6 |
While Phi-3.5-mini doesn’t match the performance of larger models like GPT-4, it shows impressive results considering its compact size.
Long context understanding
One of Phi-3.5-mini’s standout features is its ability to handle long contexts up to 128K tokens. This makes it suitable for tasks such as:
- Long document summarization
- Meeting summarization
- Long document Q&A
- Long document information retrieval
Let’s examine its performance on long context benchmarks:
Benchmark | Phi-3.5-mini-instruct | GPT-4o-mini-2024-07-18 (Chat) |
---|---|---|
GovReport | 25.9 | 24.8 |
QMSum | 21.3 | 21.7 |
Qasper | 41.9 | 39.8 |
SQuALITY | 25.3 | 23.8 |
SummScreenFD | 16.0 | 17.0 |
Average | 26.1 | 25.4 |
Remarkably, Phi-3.5-mini outperforms GPT-4 on some of these long context tasks, demonstrating its efficiency in handling extended sequences of text.
Reasoning and problem-solving capabilities
Phi-3.5-mini exhibits strong reasoning abilities across various domains. Here’s a snapshot of its performance on key benchmarks:
Category | Benchmark | Phi-3.5 Mini-Ins | GPT-4o-mini-2024-07-18 (Chat) |
---|---|---|---|
Reasoning | ARC Challenge (10-shot) | 84.6 | 93.5 |
BoolQ (2-shot) | 78 | 88.7 | |
GPQA (0-shot, CoT) | 30.4 | 41.1 | |
HellaSwag (5-shot) | 69.4 | 87.1 | |
OpenBookQA (10-shot) | 79.2 | 90 | |
PIQA (5-shot) | 81 | 88.7 | |
Social IQA (5-shot) | 74.7 | 82.9 | |
TruthfulQA (MC2) (10-shot) | 64 | 78.2 | |
WinoGrande (5-shot) | 68.5 | 76.9 | |
Math | GSM8K (8-shot, CoT) | 86.2 | 91.3 |
MATH (0-shot, CoT) | 48.5 | 70.2 | |
Code Generation | HumanEval (0-shot) | 62.8 | 86.6 |
MBPP (3-shot) | 69.6 | 84.1 |
These results show that Phi-3.5-mini performs admirably across a wide range of tasks, often coming close to much larger models like GPT-4.
Using Phi-3.5-mini
To use Phi-3.5-mini, you’ll need the following requirements:
- flash_attn==2.5.8
- torch==2.3.1
- accelerate==0.31.0
- transformers==4.43.0
Here’s a code snippet to get you started with Phi-3.5-mini:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
torch.random.manual_seed(0)
model = AutoModelForCausalLM.from_pretrained(
"microsoft/Phi-3.5-mini-instruct",
device_map="cuda",
torch_dtype="auto",
trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3.5-mini-instruct")
messages = [
{"role": "system", "content": "You are a helpful AI assistant."},
{"role": "user", "content": "Can you provide ways to eat combinations of bananas and dragonfruits?"},
{"role": "assistant", "content": "Sure! Here are some ways to eat bananas and dragonfruits together: 1. Banana and dragonfruit smoothie: Blend bananas and dragonfruits together with some milk and honey. 2. Banana and dragonfruit salad: Mix sliced bananas and dragonfruits together with some lemon juice and honey."},
{"role": "user", "content": "What about solving an 2x + 3 = 7 equation?"},
]
pipe = pipeline(
"text-generation",
model=model,
tokenizer=tokenizer,
)
generation_args = {
"max_new_tokens": 500,
"return_full_text": False,
"temperature": 0.0,
"do_sample": False,
}
output = pipe(messages, **generation_args)
print(output[0]['generated_text'])
Responsible AI considerations
While Phi-3.5-mini represents a significant advancement in AI language models, it’s crucial to consider its limitations and potential risks:
- Quality of service: Performance may vary across languages, with non-English languages experiencing worse performance.
- Multilingual performance and safety gaps: Developers should test for performance or safety gaps in their specific linguistic and cultural contexts.
- Representation of harms & perpetuation of stereotypes: The model may over- or under-represent certain groups or reinforce stereotypes.
- Inappropriate or offensive content: Additional mitigations may be necessary for sensitive contexts.
- Information reliability: The model can generate nonsensical or fabricated content.
- Limited scope for code: While proficient in Python, the model’s capabilities with other languages or less common packages may be limited.
- Long conversation: The model may generate repetitive or inconsistent responses in very long chat sessions.
Developers should apply responsible AI best practices, including risk assessment and mitigation strategies appropriate for their specific use cases.
Conclusion
Microsoft’s Phi-3.5-mini represents a significant step forward in the development of compact, efficient language models. Its ability to match or even outperform larger models in certain tasks, combined with its long context understanding and multilingual capabilities, makes it a versatile tool for a wide range of applications.
As AI continues to evolve, models like Phi-3.5-mini demonstrate that significant improvements in performance don’t always require exponential increases in model size. This trend towards more efficient, targeted models could lead to broader adoption of AI technologies across various industries and use cases.
However, as with any AI technology, it’s crucial to approach its use responsibly, considering both its capabilities and limitations. By doing so, developers can harness the power of Phi-3.5-mini to create innovative, efficient, and ethically sound AI applications.