The field of multimodal language models has seen rapid advancements in recent years, with models like GPT-4o and Gemini Pro showcasing impressive capabilities in understanding both text and images. However, most cutting-edge multimodal models have been proprietary and unavailable to the wider community.

Meet with MiniCPM-Llama3-V 2.5, the latest model in the open-source MiniCPM-V series developed by researchers from Tsinghua University and ModelBest. With 8.5 billion parameters, this model punches well above its weight, achieving performance on par with or even surpassing much larger proprietary models on a range of multimodal benchmarks. Let’s dive in and explore what makes MiniCPM-Llama3-V 2.5 so compelling.

Demo can be found here.

Leading Multimodal Performance

MiniCPM-Llama3-V 2.5 has achieved outstanding results on OpenCompass, a comprehensive evaluation suite covering 11 popular multimodal benchmarks. The model attains an impressive average score of 65.1, outperforming industry heavyweights like GPT-4V-1106 (63.5), Gemini Pro (62.9), Claude 3 and Qwen-VL-Max despite having a fraction of their parameters.

Results on TextVQA, DocVQA, OCRBench, OpenCompass MultiModal Avg , MME, MMBench, MMMU, MathVista, LLaVA Bench, RealWorld QA, Object HalBench

Results on TextVQA, DocVQA, OCRBench, OpenCompass MultiModal Avg , MME, MMBench, MMMU, MathVista, LLaVA Bench, RealWorld QA, Object HalBench

Its performance is particularly remarkable on benchmarks measuring OCR and scene text understanding capabilities. On OCRBench, MiniCPM-Llama3-V 2.5 scores over 700 points, surpassing models like GPT-4 and Gemini Pro. It achieves 76.6% accuracy on TextVQA and 84.8% on DocVQA, setting a new bar for open-source models.

Some key benchmark results:

  • OpenCompass average: 65.1
  • OCRBench: 725
  • TextVQA: 76.6%
  • DocVQA: 84.8%
  • LLaVA Bench: 86.7%
LLaVA Bench compare

LLaVA Bench compare

Advanced OCR and reasoning abilities

One of MiniCPM-Llama3-V 2.5’s most impressive features is its ability to understand text in images, even at high resolutions and arbitrary aspect ratios. The model can process images up to 1.8 million pixels, allowing it to perceive fine-grained details critical for demanding real-world applications.

Recent enhancements have further bolstered MiniCPM-Llama3-V 2.5’s document understanding capabilities. It now supports full-text OCR extraction, table-to-markdown conversion, and exhibits strong instruction-following and reasoning skills over extracted information. This empowers the model to assist with complex analytical tasks involving multi-page documents and large tables.

Trustworthy behavior

Large language models can sometimes hallucinate facts or generate deceptive outputs. To address this issue, the MiniCPM team leveraged the RLAIF-V technique to align MiniCPM-Llama3-V 2.5’s behavior toward more truthful and responsive generation.

As a result, the model achieves a remarkably low hallucination rate of 10.3% on the challenging Object HalBench dataset, outperforming even GPT-4V-1106 (13.6%). This trustworthiness is essential for deploying the model in high-stakes applications where reliability is paramount.

Efficiency through optimizations

Despite its considerable size, MiniCPM-Llama3-V 2.5 has been engineered for efficient deployment on resource-constrained devices through an array of optimizations:

  • 4-bit quantization reduces memory usage while maintaining quality
  • Compilation and kernel optimizations massively accelerate CPU inference
  • Integration with Qualcomm’s QNN framework unlocks NPU acceleration on mobile chipsets

On a Qualcomm Snapdragon 8 Gen 3 mobile platform, MiniCPM-Llama3-V 2.5 achieves a staggering 150x speed-up on image encoding and 3x faster language decoding compared to unoptimized models. The model can be served smoothly on iOS and Android devices via integration with the popular llama.cpp framework.

Multilingual capabilities

While most open-source multimodal models only support English or Chinese, MiniCPM-Llama3-V 2.5 comes with out-of-the-box support for over 30 languages, including German, French, Spanish, Italian, Korean, etc. This is enabled by cross-lingual transfer techniques that help generalize the model’s multimodal understanding across languages.

On the multilingual LLaVA Bench evaluation, MiniCPM-Llama3-V 2.5 attains consistently high performance across all covered languages, outperforming even GPT-4V. This multilingual support greatly expands the model’s potential applications and userbase.

Easy fine-tuning and deployment

The MiniCPM team has prioritized making MiniCPM-Llama3-V 2.5 accessible to developers and researchers:

  • LoRA-based fine-tuning allows adapting the model to custom use cases with just 2 commodity GPUs
  • Gradio and Streamlit demos enable interactive development and showcasing
  • HuggingFace transformers integration and a comprehensive inference API simplify productionization
  • Converted llama.cpp and llama server let developers integrate the model into existing pipelines
  • Published GGML format enables exploring further compression and optimization

Thanks to these efforts, MiniCPM-Llama3-V 2.5 has a low barrier to entry despite its impressive capabilities. It’s an ideal foundation model for startups and researchers looking to build multimodal applications without massive computational resources.

Example usage

Let’s see MiniCPM-Llama3-V 2.5 in action with a simple example using the HuggingFace transformers library:

from transformers import AutoModelForCausalLM, AutoTokenizer 
import torch

model = AutoModelForCausalLM.from_pretrained(
tokenizer = AutoTokenizer.from_pretrained(

image ="path/to/image.jpg")
prompt = "What is the main object shown in the image?"

chat =
    generate_args={"temperature": 0.8}


This code snippet loads the model and tokenizer, passes an image and question to the model’s chat function, and prints out the generated response. We can easily modify the prompt to engage in multi-turn conversations or adjust the generation hyperparameters for more diverse outputs.

MiniCPM-Llama3-V 2.5 also supports streaming outputs and incorporation of user-provided system prompts to dynamically steer its behavior. The model is available via ollama as well, enabling interactive exploration without any setup: example.


MiniCPM-Llama3-V 2.5 represents a major milestone in democratizing access to powerful multimodal language models. With its impressive performance, wide-ranging capabilities, and efficient implementation, this model enables researchers and developers to build applications that were previously limited to a few well-resourced labs.

The model’s multilingual support and instruction-following skills expand its potential even further, unlocking applications across diverse languages and use cases. MiniCPM-Llama3-V 2.5’s performance demonstrates that with thoughtful architecture and engineering, open-source models can match or even exceed proprietary counterparts in this fast-moving field.

As the MiniCPM team continues to refine and expand the model’s abilities, an exciting era of open multimodal innovation lies ahead. MiniCPM-Llama3-V 2.5 is poised to become a go-to foundation model for research and applications, and I can’t wait to see what the community creates with it. Huge kudos to the team at ModelBest, THUNLP, and Zhihu for this outstanding contribution!


Last Update: 16/06/2024