On July 2, 2024, Microsoft released a significant update to their Phi-3 mini language models, enhancing both the 4K and 128K context versions.

This “June Update” brings substantial improvements across various aspects of the models’ capabilities, making them more powerful and versatile for a wide range of applications.

Microsoft phi-3 june benchmarks

Benchmarks

Link to changed models:

https://huggingface.co/microsoft/Phi-3-mini-128k-instruct

https://huggingface.co/microsoft/Phi-3-mini-4k-instruct

Release notes:

This is an update over the original instruction-tuned Phi-3-mini release based on valuable customer feedback. The model used additional post-training data leading to substantial gains on long-context understanding, instruction following, and structure output. We also improve multi-turn conversation quality, explicitly support <|system|> tag, and significantly improve reasoning capability. We believe most use cases will benefit from this release, but we encourage users to test in their particular AI applications. We appreciate the enthusiastic adoption of the Phi-3 model family, and continue to welcome all feedback from the community.

Enhanced code understanding

The update significantly boosts the models’ ability to comprehend and work with code across multiple programming languages. Performance improvements were notable across several popular programming languages:

  • Python: 85% (up from 27%)
  • C++: 63% (up from 29%)
  • Rust: 72% (up from 40%)
  • Java: 93% (up from 33%)
  • TypeScript: 72% (up from 33%)
Microsoft phi-3 code understanding

RepoQA: a benchmark for long context code understanding

These improvements make Phi-3 mini models much more effective for tasks involving code analysis, generation, and understanding. This could be particularly beneficial for developers and in automated coding assistance scenarios.

Improved structured output

One of the most significant enhancements is in the models’ ability to generate structured data formats:

  • JSON output accuracy increased dramatically from 1.9% to 60.1%
  • XML output accuracy improved from 47.8% to 52.9%

This improvement is crucial for applications that require the model to produce machine-readable outputs, such as in data processing pipelines or API integrations.

Enhanced instruction following and multi-turn conversations

The update brings notable improvements in the models’ ability to follow complex instructions and engage in multi-turn conversations. This enhancement makes the models more responsive and accurate in scenarios that require ongoing dialogue or intricate task completion.

Support for <|system|> tag

A new feature in this update is the explicit support for the <|system|> tag. This addition allows for better control and customization of the model’s behavior, enabling developers to set specific parameters or instructions for the model’s responses.

Superior reasoning capabilities

The update showcases significant improvements in the models’ reasoning abilities across various benchmarks:

  • GPQA (General Program of Question Answering) score increased from 25.9% to 29.7%
  • MMLU (Massive Multitask Language Understanding) score improved from 68.1% to 69.7%

These improvements indicate enhanced capabilities in tasks requiring complex reasoning and broad knowledge application.

Long-context understanding

For the 128K context model, there are remarkable improvements in handling long documents:

  • Performance on 128K context tasks improved from 43.3% to 65.6%
  • The model now shows consistent high performance across various context lengths (4K to 128K)
phi-3 long context understanding

RULER: a retrieval-based benchmark for long context understanding

This enhancement is particularly valuable for applications involving the analysis or generation of long-form content, such as document summarization or extended Q&A sessions.

Benchmarks and comparisons

The updated Phi-3 mini models show competitive performance against larger models in several areas:

  • On the AGI Eval benchmark (5-shot), Phi-3 mini achieves 39.5%, compared to 42.1% for Gemma-7B and 35.1% for Mistral-7B
  • In the MMLU benchmark (5-shot), Phi-3 mini scores 69.7%, outperforming Gemma-7B (63.6%) and Mistral-7B (61.7%)

These results demonstrate that the compact Phi-3 mini models can compete with larger models in certain tasks, offering efficient performance with a smaller computational footprint.

Practical applications

The improvements across various domains open up new possibilities for practical applications:

  • Enhanced code understanding makes the models more suitable for coding assistance and code review tools;
  • Improved structured output capabilities benefit data processing and API integration scenarios;
  • Better long-context understanding enhances document analysis and summarization tasks;
  • Superior reasoning abilities make the models more effective for complex problem-solving applications.

FastAPI books for building AI application

Conclusion

Microsoft’s June Update to the Phi-3 mini models represents a significant step forward in the capabilities of compact language models. By improving performance across code understanding, structured output, reasoning, and long-context tasks, Microsoft has made these models more versatile and powerful.

The company encourages users to test the updated models in their specific applications, as they believe most use cases will benefit from these enhancements. Microsoft continues to welcome feedback from the community as they work on further improvements to the Phi-3 model family.

As language models continue to evolve, updates like this demonstrate the ongoing potential for significant improvements even in smaller, more efficient models. This could lead to wider adoption and new applications of AI in scenarios where computational resources are limited.

Last Update: 03/07/2024