Alibaba’s research team has released Qwen2-VL, the latest iteration of vision-language models based on the Qwen2 architecture. This release marks a significant advancement in multimodal AI, building upon the foundations laid by its predecessor, Qwen-VL. After a year of intensive development, Qwen2-VL introduces a range of new capabilities that expand the horizons of visual understanding and interaction.
Key features and improvements
- Enhanced visual understanding
Qwen2-VL demonstrates state-of-the-art performance across various visual understanding benchmarks:
- College-level problem solving (MMMU)
- Mathematical reasoning (MathVista, MATH-Vision)
- Document and diagram comprehension (DocVQA, ChartQA, OCRBench)
- Multilingual text-image understanding (MTVQA)
- General visual question answering (RealWorldQA, MMStar, MMVet)
- Extended video processing
The model now excels at understanding videos over 20 minutes in length, enabling:
- High-quality video-based question answering
- Dialogue systems based on video content
- Video content creation and summarization
- Agent-like capabilities
Qwen2-VL can be integrated with external devices, such as mobile phones and robots, for:
- Automatic operation based on visual input
- Complex reasoning and decision-making in visual environments
- Multilingual support
Beyond English and Chinese, Qwen2-VL now understands text in various languages within images, including:
- Most European languages
- Japanese
- Korean
- Arabic
- Vietnamese
Model variants and performance
Qwen2-VL is available in three main variants:
The flagship model, demonstrating top-tier performance across most metrics:
- Often surpasses closed-source models like GPT-4V and Claude 3.5-Sonnet
- Excels particularly in document understanding tasks
A more cost-effective option that maintains competitive performance:
- Supports image, multi-image, and video inputs
- Achieves state-of-the-art results on DocVQA and MTVQA benchmarks
A compact model optimized for potential mobile deployment:
- Strong performance in image, video, and multilingual comprehension
- Excels in video-related tasks, document understanding, and general scenario question-answering compared to similar-sized models
The full collection can be found here (Huggingface).
Performance comparison
Let’s take a closer look at the performance of each model variant across various benchmarks:
Qwen2-VL-2B Performance
Benchmark Category | Benchmark Name | Qwen2-VL-2B Score | MiniCPM-V 2.0 Score | InternVL2-2B Score |
---|---|---|---|---|
College-level Problems | MMMU | 41.1 | 38.2 | 36.3 |
Mathematical Reasoning | MathVista | 43.0 | – | 46.0 |
MATH-Vision | 12.4 | – | – | |
Document and Diagrams Reading | DocVQA | 90.1 | – | 86.9 |
ChartQA | 73.5 | – | 76.2 | |
OCRBench | 794 | 605 | 781 | |
MTVQA | 20.0 | – | – | |
InfoVQA | 65.5 | – | 58.9 | |
TextVQA | 79.7 | – | 73.4 | |
General Visual Question Answering | RealWorldQA | 62.9 | 55.8 | 57.3 |
MMStar | 48.0 | 39.1 | 49.8 | |
MMVet | 49.5 | 41.0 | 39.7 | |
MMT-Bench | 54.5 | – | – | |
MMbench-1.1 | 72.2 | 65.8 | 69.6 | |
MME | 1872.0 | 1808.6 | 1876.8 | |
HallBench | 41.7 | 36.1 | 38.0 | |
Video Understanding | MVBench | 63.2 | – | 60.2 |
EgoSchema | 53.9 | – | – | |
PerceptionTest | 54.9 | – | – | |
Video-MME w/o subs | 55.6 | – | 45.0 | |
Video-MME w/ subs | 60.4 | – | 47.3 |
Qwen2-VL-7B performance
Benchmark Category | Benchmark Name | Qwen2-VL-7B Score | GPT-4o-mini Score | Other Best Open-source VLM Score |
---|---|---|---|---|
College-level Problems | MMMU | 54.1 | 60.0 | 51.8 (InternVL2-8b) |
Mathematical Reasoning | MathVista | 58.2 | 52.4 | 60.6 (MiniCPM-2.4) |
MATH-Vision | 16.3 | – | 14.5 (InternLM-XComposer2-VL) | |
Document and Diagrams Reading | DocVQA | 94.5 | – | 91.6 (InternVL2-8b) |
ChartQA | 83.0 | – | 83.3 (InternVL2-8b) | |
OCRBench | 845 | 785 | 852 (MiniCPM-2.4) | |
MTVQA | 26.3 | – | 17.3 (MiniCPM-2.4) | |
InfoVQA | 76.5 | – | 74.8 (InternVL2-8b) | |
TextVQA | 84.3 | – | 80.1 (InternVL2-8b) | |
General Visual Question Answering | RealWorldQA | 70.1 | 67.1 | 64.4 (InternVL2-8b) |
MMStar | 60.7 | 54.8 | 61.5 (InternVL2-8b) | |
MMVet | 62.0 | 66.9 | 60.0 (MiniCPM-2.4) | |
MMT-Bench | 63.7 | – | 55.7 (InternLM-XComposer2-VL) | |
MMbench-1.1 | 80.7 | 76.0 | 79.4 (InternVL2-8b) | |
MME | 2326.8 | 2003.4 | 2348.4 (MiniCPM-2.4) | |
HallBench | 50.6 | 46.1 | 48.1 (MiniCPM-2.4) | |
Video Understanding | MVBench | 67.0 | – | 66.4 (InternVL2-8b) |
EgoSchema | 66.7 | – | 60.1 (LLaVA-OneVision-7B) | |
PerceptionTest | 62.3 | – | 57.1 (LLaVA-OneVision-7B) | |
Video-MME w/o subs | 63.3 | 64.8 | 60.9 (MiniCPM-2.4) | |
Video-MME w/ subs | 69.0 | 68.9 | 63.7 (MiniCPM-2.4) |
Qwen2-VL-72B performance
Benchmark Category | Benchmark Name | Qwen2-VL-72B Score | GPT-4o-0513 Score | Claude3.5-Sonnet Score | Other Best Model Score |
---|---|---|---|---|---|
College-level Problems | MMMU | 64.5 | 69.2 | 68.3 | 66.1 (Qrok-2) |
Mathematical Reasoning | MathVista | 70.5 | 63.8 | 67.7 | 69.0 (Qrok-2) |
MATH-Vision | 25.9 | 30.4 | – | 30.3 (GPT-4 turbo) | |
Document and Diagrams Reading | DocVQA | 96.5 | 92.8 | 95.2 | 94.1 (InternVL2-74B) |
ChartQA | 88.3 | 85.7 | 90.8 | 88.4 (InternVL2-74B) | |
OCRBench | 855 | 736 | 788 | 852 (MiniCPM-2.4) | |
MTVQA | 32.6 | 27.8 | 25.7 | 23.2 (Gemini 1.0 Ultra) | |
InfoVQA | 84.5 | – | – | 82.0 (InternVL2-74B) | |
TextVQA | 85.5 | – | – | 84.4 (InternVL2-74B) | |
General Visual Question Answering | RealWorldQA | 77.8 | 75.4 | 60.1 | 72.2 (InternVL2-74B) |
MMStar | 68.3 | 63.9 | 62.2 | 67.1 (InternVL2-74B) | |
MMVet | 74.0 | 69.1 | 66.0 | 67.5 (GPT-4v) | |
MMT-Bench | 71.7 | 65.5 | – | 69.4 (InternVL2-1.2) | |
MMbench-1.1 | 85.9 | 82.2 | 78.5 | 85.5 (InternVL2-74B) | |
MME | 2482.7 | 2328.7 | 1920.0 | 2414.7 (InternVL2-74B) | |
HallBench | 58.1 | 55.0 | 49.9 | 55.2 (InternVL2-74B) | |
Video Understanding | MVBench | 73.6 | – | – | 69.6 (InternVL2-74B) |
EgoSchema | 77.9 | 72.2 | – | 72.2 (Gemini 1.5 Pro) | |
PerceptionTest | 68.0 | – | – | 66.9 (LLaVA-OneVision-73M) | |
Video-MME w/o subs | 71.2 | 71.9 | 60.0 | 75.0 (Gemini 1.5 Pro) | |
Video-MME w/ subs | 77.8 | 77.2 | – | 81.3 (Gemini 1.5 Pro) | |
Visual Agent | FnCall | 93.1 | 90.2 | – | – |
AITZ | 89.6 | 70.0 | – | 83.0 (AITZ) | |
Gym Cards | 61.7 | 53.6 | – | 45.5 (BLAVAM) | |
ALFRED | 67.8 | – | – | 67.7 (Think-Bot) |
The 72B model showcases top-tier performance across most metrics, often surpassing even closed-source models like GPT-4o and Claude 3.5-Sonnet. Notably, it demonstrates a significant edge in document understanding.
Architecture and technical improvements
Qwen2-VL builds upon the architecture of its predecessor, Qwen-VL, incorporating several key enhancements:
- Vision Transformer (ViT) Integration
- Uses a ViT with approximately 600M parameters
- Designed to handle both image and video inputs seamlessly
- Naive dynamic resolution support
- Allows processing of arbitrary image resolutions
- Maps inputs into a dynamic number of visual tokens
- Ensures consistency between model input and inherent image information
- More closely mimics human visual perception
- Multimodal rotary position embedding (M-ROPE)
- Deconstructs original rotary embedding into three parts:
- Temporal information
- Spatial height information
- Spatial width information
- Enables concurrent capture and integration of:
- 1D textual information
- 2D visual information
- 3D video positional information
These architectural improvements contribute to Qwen2-VL’s enhanced performance across various tasks and input modalities.
Developing with Qwen2-VL
Qwen2-VL is available in multiple formats to suit different development needs:
- Qwen2-VL-72B API
Access the largest model through the official API:
from openai import OpenAI
import os
import base64
def encode_image(image_path):
with open(image_path, "rb") as image_file:
return base64.b64encode(image_file.read()).decode("utf-8")
# Path to your image
image_path = "dog_and_girl.jpeg"
# Getting the base64 string
base64_image = encode_image(image_path)
def get_response():
client = OpenAI(
api_key=os.getenv("DASHSCOPE_API_KEY"),
base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
)
completion = client.chat.completions.create(
model="qwen-vl-max-0809",
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": "What is this?"},
{
"type": "image_url",
"image_url": {
"url": "https://dashscope.oss-cn-beijing.aliyuncs.com/images/dog_and_girl.jpeg"
},
},
{
"type": "image_url",
"image_url": {"url": f"data:image/jpeg;base64,{base64_image}"},
},
],
}
],
top_p=0.8,
stream=True,
stream_options={"include_usage": True},
)
for chunk in completion:
print(chunk.model_dump_json())
if __name__ == "__main__":
get_response()
- Open-Source Models (2B and 7B)
The 2B and 7B variants are open-sourced and available on Huggingface and ModelScope. Here’s an example of using the 7B model with Hugging Face Transformers:
from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info
# Load the model
model = Qwen2VLForConditionalGeneration.from_pretrained(
"Qwen/Qwen2-VL-7B-Instruct", device_map="auto"
)
# For better performance with flash attention 2
# model = Qwen2VLForConditionalGeneration.from_pretrained(
# "Qwen/Qwen2-VL-7B-Instruct",
# torch_dtype=torch.bfloat16,
# attn_implementation="flash_attention_2",
# device_map="auto",
# )
# Load the processor
processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-7B-Instruct")
# Optional: Adjust token range for balancing speed and memory usage
# min_pixels = 256*28*28
# max_pixels = 1280*28*28
# processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-7B-Instruct", min_pixels=min_pixels, max_pixels=max_pixels)
messages = [
{
"role": "user",
"content": [
{
"type": "image",
"image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
},
{"type": "text", "text": "Describe this image."},
],
}
]
# Prepare for inference
text = processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
text=[text],
images=image_inputs,
videos=video_inputs,
padding=True,
return_tensors="pt",
)
# Generate output
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)
This code demonstrates how to load the model, process inputs, and generate outputs using the Qwen2-VL-7B model.
Integration with open-source ecosystem
To facilitate seamless integration and use of Qwen2-VL models, the developers have ensured compatibility with various tools and frameworks in the open-source ecosystem:
- Quantization
- AutoGPTQ: https://github.com/AutoGPTQ/AutoGPTQ
- AutoAWQ: https://github.com/casper-hansen/AutoAWQ
- Deployment
- Fine-tuning
- Llama-Factory: https://github.com/hiyouga/LLaMA-Factory
These integrations allow developers to optimize, deploy, and customize Qwen2-VL models for specific use cases and hardware configurations.
Licensing
Both the Qwen2-VL-2B and Qwen2-VL-7B models are released under the Apache 2.0 license, allowing for broad use in both academic and commercial applications.
Future directions
The Qwen team has outlined several areas for future development:
- Continued improvement of vision-language models based on upcoming versions of the core language models.
- Integration of additional modalities, moving towards a more comprehensive “omni-model” approach.
- Further enhancements to video understanding capabilities, potentially including audio processing.
- Expansion of multilingual support to cover an even broader range of languages and scripts.
- Refinement of agent-like capabilities for more complex reasoning and decision-making tasks in visual environments.
Limitations and considerations
While Qwen2-VL represents a significant advancement in vision-language AI, it’s important to note some limitations:
- Audio Processing: The model cannot extract or process audio from videos.
- Knowledge Cutoff: The model’s knowledge is current only up to June 2023.
- Complex Instructions: There may be accuracy issues when processing very complex instructions or scenarios.
- Specific Task Weaknesses: The model shows relative weakness in tasks involving counting, character recognition, and 3D spatial awareness.
Conclusion
Qwen2-VL marks a significant leap forward in the field of vision-language AI. With its ability to handle a wide range of visual inputs, from static images to long-form videos, and its impressive performance across numerous benchmarks, it opens up new possibilities for multimodal AI applications.
The model’s strengths in document understanding, multilingual text recognition, and video comprehension make it particularly well-suited for tasks such as:
- Advanced visual question-answering systems
- Automated document analysis and data extraction
- Cross-lingual visual information processing
- Long-form video content analysis and summarization
- Visual-based agent systems for robotics and automation
As the field of AI continues to evolve rapidly, Qwen2-VL stands out as a versatile and powerful tool for researchers and developers working at the intersection of computer vision and natural language processing. Its open-source variants and integration with popular frameworks ensure that it will play a significant role in driving innovation in multimodal AI applications.