In the rapidly evolving world of artificial intelligence, Microsoft has made a significant leap forward with the release of Phi-3.5-mini, a lightweight yet remarkably capable language model. This article provides an in-depth look at Phi-3.5-mini, exploring its architecture, capabilities, benchmarks, and potential applications. We’ll dive into the technical details that make this model stand out in the crowded field of AI language models.

For those in a hurry: Yes, Phi-3.5-mini is readily available through Ollama. You can access it at: https://ollama.com/library/phi3.5

Phi3.5 availability

Phi3.5 availability

Model overview

Phi-3.5-mini is a state-of-the-art open model with the following key characteristics:

  • Parameters: 3.8 billion
  • Architecture: Dense decoder-only Transformer
  • Context length: 128K tokens
  • Training data: 3.4T tokens
  • Training time: 10 days on 512 H100-80G GPUs
  • Release date: August 2024

What sets Phi-3.5-mini apart is its ability to outperform similarly sized and even larger models, making it an attractive option for developers working with constrained resources or latency-sensitive applications.

Model architecture and training

Phi-3.5-mini belongs to the Phi-3 model family and builds upon datasets used for Phi-3. The training data includes:

  1. Filtered publicly available documents
  2. High-quality educational data
  3. Code repositories
  4. Synthetic “textbook-like” data covering math, coding, common sense reasoning, and general knowledge
  5. High-quality chat format supervised data

The model underwent a rigorous enhancement process, incorporating:

  • Supervised fine-tuning
  • Proximal policy optimization
  • Direct preference optimization

This comprehensive approach ensures precise instruction adherence and robust safety measures.

Phi3 family and variants

Phi3 family and variants

Multilingual capabilities

Phi-3.5-mini supports an impressive array of languages:

Arabic, Chinese, Czech, Danish, Dutch, English, Finnish, French, German, Hebrew, Hungarian, Italian, Japanese, Korean, Norwegian, Polish, Portuguese, Russian, Spanish, Swedish, Thai, Turkish, and Ukrainian.

Let’s look at its performance on multilingual benchmarks:

Benchmark Phi-3.5 Mini-Ins GPT-4o-mini-2024-07-18 (Chat)
Multilingual MMLU 55.4 72.9
Multilingual MMLU-Pro 30.9 53.2
MGSM 47.9 81.7
MEGA MLQA 61.7 70.0
MEGA TyDi QA 62.2 81.8
MEGA UDPOS 46.5 66.0
MEGA XCOPA 63.1 90.3
MEGA XStoryCloze 73.5 96.6
Average 55.2 76.6

While Phi-3.5-mini doesn’t match the performance of larger models like GPT-4, it shows impressive results considering its compact size.

Long context understanding

One of Phi-3.5-mini’s standout features is its ability to handle long contexts up to 128K tokens. This makes it suitable for tasks such as:

  • Long document summarization
  • Meeting summarization
  • Long document Q&A
  • Long document information retrieval

Let’s examine its performance on long context benchmarks:

Benchmark Phi-3.5-mini-instruct GPT-4o-mini-2024-07-18 (Chat)
GovReport 25.9 24.8
QMSum 21.3 21.7
Qasper 41.9 39.8
SQuALITY 25.3 23.8
SummScreenFD 16.0 17.0
Average 26.1 25.4

Remarkably, Phi-3.5-mini outperforms GPT-4 on some of these long context tasks, demonstrating its efficiency in handling extended sequences of text.

Reasoning and problem-solving capabilities

Phi-3.5-mini exhibits strong reasoning abilities across various domains. Here’s a snapshot of its performance on key benchmarks:

Category Benchmark Phi-3.5 Mini-Ins GPT-4o-mini-2024-07-18 (Chat)
Reasoning ARC Challenge (10-shot) 84.6 93.5
BoolQ (2-shot) 78 88.7
GPQA (0-shot, CoT) 30.4 41.1
HellaSwag (5-shot) 69.4 87.1
OpenBookQA (10-shot) 79.2 90
PIQA (5-shot) 81 88.7
Social IQA (5-shot) 74.7 82.9
TruthfulQA (MC2) (10-shot) 64 78.2
WinoGrande (5-shot) 68.5 76.9
Math GSM8K (8-shot, CoT) 86.2 91.3
MATH (0-shot, CoT) 48.5 70.2
Code Generation HumanEval (0-shot) 62.8 86.6
MBPP (3-shot) 69.6 84.1

These results show that Phi-3.5-mini performs admirably across a wide range of tasks, often coming close to much larger models like GPT-4.

Using Phi-3.5-mini

To use Phi-3.5-mini, you’ll need the following requirements:

  • flash_attn==2.5.8
  • torch==2.3.1
  • accelerate==0.31.0
  • transformers==4.43.0

Here’s a code snippet to get you started with Phi-3.5-mini:


import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

torch.random.manual_seed(0)

model = AutoModelForCausalLM.from_pretrained(
    "microsoft/Phi-3.5-mini-instruct", 
    device_map="cuda", 
    torch_dtype="auto", 
    trust_remote_code=True, 
)
tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3.5-mini-instruct")

messages = [
    {"role": "system", "content": "You are a helpful AI assistant."},
    {"role": "user", "content": "Can you provide ways to eat combinations of bananas and dragonfruits?"},
    {"role": "assistant", "content": "Sure! Here are some ways to eat bananas and dragonfruits together: 1. Banana and dragonfruit smoothie: Blend bananas and dragonfruits together with some milk and honey. 2. Banana and dragonfruit salad: Mix sliced bananas and dragonfruits together with some lemon juice and honey."},
    {"role": "user", "content": "What about solving an 2x + 3 = 7 equation?"},
]

pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
)

generation_args = {
    "max_new_tokens": 500,
    "return_full_text": False,
    "temperature": 0.0,
    "do_sample": False,
}

output = pipe(messages, **generation_args)
print(output[0]['generated_text'])
💡
Note: If you want to use flash attention, call AutoModelForCausalLM.from_pretrained() with attn_implementation=”flash_attention_2″.

Responsible AI considerations

While Phi-3.5-mini represents a significant advancement in AI language models, it’s crucial to consider its limitations and potential risks:

  1. Quality of service: Performance may vary across languages, with non-English languages experiencing worse performance.
  2. Multilingual performance and safety gaps: Developers should test for performance or safety gaps in their specific linguistic and cultural contexts.
  3. Representation of harms & perpetuation of stereotypes: The model may over- or under-represent certain groups or reinforce stereotypes.
  4. Inappropriate or offensive content: Additional mitigations may be necessary for sensitive contexts.
  5. Information reliability: The model can generate nonsensical or fabricated content.
  6. Limited scope for code: While proficient in Python, the model’s capabilities with other languages or less common packages may be limited.
  7. Long conversation: The model may generate repetitive or inconsistent responses in very long chat sessions.

Developers should apply responsible AI best practices, including risk assessment and mitigation strategies appropriate for their specific use cases.

Conclusion

Microsoft’s Phi-3.5-mini represents a significant step forward in the development of compact, efficient language models. Its ability to match or even outperform larger models in certain tasks, combined with its long context understanding and multilingual capabilities, makes it a versatile tool for a wide range of applications.

As AI continues to evolve, models like Phi-3.5-mini demonstrate that significant improvements in performance don’t always require exponential increases in model size. This trend towards more efficient, targeted models could lead to broader adoption of AI technologies across various industries and use cases.

However, as with any AI technology, it’s crucial to approach its use responsibly, considering both its capabilities and limitations. By doing so, developers can harness the power of Phi-3.5-mini to create innovative, efficient, and ethically sound AI applications.

Last Update: 21/08/2024