Alibaba’s research team has released Qwen2-VL, the latest iteration of vision-language models based on the Qwen2 architecture. This release marks a significant advancement in multimodal AI, building upon the foundations laid by its predecessor, Qwen-VL. After a year of intensive development, Qwen2-VL introduces a range of new capabilities that expand the horizons of visual understanding and interaction.

Key features and improvements

  1. Enhanced visual understanding

Qwen2-VL demonstrates state-of-the-art performance across various visual understanding benchmarks:

  • College-level problem solving (MMMU)
  • Mathematical reasoning (MathVista, MATH-Vision)
  • Document and diagram comprehension (DocVQA, ChartQA, OCRBench)
  • Multilingual text-image understanding (MTVQA)
  • General visual question answering (RealWorldQA, MMStar, MMVet)
  1. Extended video processing

The model now excels at understanding videos over 20 minutes in length, enabling:

  • High-quality video-based question answering
  • Dialogue systems based on video content
  • Video content creation and summarization
  1. Agent-like capabilities

Qwen2-VL can be integrated with external devices, such as mobile phones and robots, for:

  • Automatic operation based on visual input
  • Complex reasoning and decision-making in visual environments
  1. Multilingual support

Beyond English and Chinese, Qwen2-VL now understands text in various languages within images, including:

  • Most European languages
  • Japanese
  • Korean
  • Arabic
  • Vietnamese

Model variants and performance

Qwen2-VL is available in three main variants:

  1. Qwen2-VL-72B

The flagship model, demonstrating top-tier performance across most metrics:

  • Often surpasses closed-source models like GPT-4V and Claude 3.5-Sonnet
  • Excels particularly in document understanding tasks
  1. Qwen2-VL-7B

A more cost-effective option that maintains competitive performance:

  • Supports image, multi-image, and video inputs
  • Achieves state-of-the-art results on DocVQA and MTVQA benchmarks
  1. Qwen2-VL-2B

A compact model optimized for potential mobile deployment:

  • Strong performance in image, video, and multilingual comprehension
  • Excels in video-related tasks, document understanding, and general scenario question-answering compared to similar-sized models

The full collection can be found here (Huggingface).

Performance comparison

Let’s take a closer look at the performance of each model variant across various benchmarks:

Qwen2-VL-2B Performance

Benchmark Category Benchmark Name Qwen2-VL-2B Score MiniCPM-V 2.0 Score InternVL2-2B Score
College-level Problems MMMU 41.1 38.2 36.3
Mathematical Reasoning MathVista 43.0 46.0
MATH-Vision 12.4
Document and Diagrams Reading DocVQA 90.1 86.9
ChartQA 73.5 76.2
OCRBench 794 605 781
MTVQA 20.0
InfoVQA 65.5 58.9
TextVQA 79.7 73.4
General Visual Question Answering RealWorldQA 62.9 55.8 57.3
MMStar 48.0 39.1 49.8
MMVet 49.5 41.0 39.7
MMT-Bench 54.5
MMbench-1.1 72.2 65.8 69.6
MME 1872.0 1808.6 1876.8
HallBench 41.7 36.1 38.0
Video Understanding MVBench 63.2 60.2
EgoSchema 53.9
PerceptionTest 54.9
Video-MME w/o subs 55.6 45.0
Video-MME w/ subs 60.4 47.3

Qwen2-VL-7B performance

Benchmark Category Benchmark Name Qwen2-VL-7B Score GPT-4o-mini Score Other Best Open-source VLM Score
College-level Problems MMMU 54.1 60.0 51.8 (InternVL2-8b)
Mathematical Reasoning MathVista 58.2 52.4 60.6 (MiniCPM-2.4)
MATH-Vision 16.3 14.5 (InternLM-XComposer2-VL)
Document and Diagrams Reading DocVQA 94.5 91.6 (InternVL2-8b)
ChartQA 83.0 83.3 (InternVL2-8b)
OCRBench 845 785 852 (MiniCPM-2.4)
MTVQA 26.3 17.3 (MiniCPM-2.4)
InfoVQA 76.5 74.8 (InternVL2-8b)
TextVQA 84.3 80.1 (InternVL2-8b)
General Visual Question Answering RealWorldQA 70.1 67.1 64.4 (InternVL2-8b)
MMStar 60.7 54.8 61.5 (InternVL2-8b)
MMVet 62.0 66.9 60.0 (MiniCPM-2.4)
MMT-Bench 63.7 55.7 (InternLM-XComposer2-VL)
MMbench-1.1 80.7 76.0 79.4 (InternVL2-8b)
MME 2326.8 2003.4 2348.4 (MiniCPM-2.4)
HallBench 50.6 46.1 48.1 (MiniCPM-2.4)
Video Understanding MVBench 67.0 66.4 (InternVL2-8b)
EgoSchema 66.7 60.1 (LLaVA-OneVision-7B)
PerceptionTest 62.3 57.1 (LLaVA-OneVision-7B)
Video-MME w/o subs 63.3 64.8 60.9 (MiniCPM-2.4)
Video-MME w/ subs 69.0 68.9 63.7 (MiniCPM-2.4)

Qwen2-VL-72B performance

Benchmark Category Benchmark Name Qwen2-VL-72B Score GPT-4o-0513 Score Claude3.5-Sonnet Score Other Best Model Score
College-level Problems MMMU 64.5 69.2 68.3 66.1 (Qrok-2)
Mathematical Reasoning MathVista 70.5 63.8 67.7 69.0 (Qrok-2)
MATH-Vision 25.9 30.4 30.3 (GPT-4 turbo)
Document and Diagrams Reading DocVQA 96.5 92.8 95.2 94.1 (InternVL2-74B)
ChartQA 88.3 85.7 90.8 88.4 (InternVL2-74B)
OCRBench 855 736 788 852 (MiniCPM-2.4)
MTVQA 32.6 27.8 25.7 23.2 (Gemini 1.0 Ultra)
InfoVQA 84.5 82.0 (InternVL2-74B)
TextVQA 85.5 84.4 (InternVL2-74B)
General Visual Question Answering RealWorldQA 77.8 75.4 60.1 72.2 (InternVL2-74B)
MMStar 68.3 63.9 62.2 67.1 (InternVL2-74B)
MMVet 74.0 69.1 66.0 67.5 (GPT-4v)
MMT-Bench 71.7 65.5 69.4 (InternVL2-1.2)
MMbench-1.1 85.9 82.2 78.5 85.5 (InternVL2-74B)
MME 2482.7 2328.7 1920.0 2414.7 (InternVL2-74B)
HallBench 58.1 55.0 49.9 55.2 (InternVL2-74B)
Video Understanding MVBench 73.6 69.6 (InternVL2-74B)
EgoSchema 77.9 72.2 72.2 (Gemini 1.5 Pro)
PerceptionTest 68.0 66.9 (LLaVA-OneVision-73M)
Video-MME w/o subs 71.2 71.9 60.0 75.0 (Gemini 1.5 Pro)
Video-MME w/ subs 77.8 77.2 81.3 (Gemini 1.5 Pro)
Visual Agent FnCall 93.1 90.2
AITZ 89.6 70.0 83.0 (AITZ)
Gym Cards 61.7 53.6 45.5 (BLAVAM)
ALFRED 67.8 67.7 (Think-Bot)

The 72B model showcases top-tier performance across most metrics, often surpassing even closed-source models like GPT-4o and Claude 3.5-Sonnet. Notably, it demonstrates a significant edge in document understanding.

Architecture and technical improvements

Qwen2-VL builds upon the architecture of its predecessor, Qwen-VL, incorporating several key enhancements:

  1. Vision Transformer (ViT) Integration
  • Uses a ViT with approximately 600M parameters
  • Designed to handle both image and video inputs seamlessly
  1. Naive dynamic resolution support
  • Allows processing of arbitrary image resolutions
  • Maps inputs into a dynamic number of visual tokens
  • Ensures consistency between model input and inherent image information
  • More closely mimics human visual perception
  1. Multimodal rotary position embedding (M-ROPE)
  • Deconstructs original rotary embedding into three parts:
    • Temporal information
    • Spatial height information
    • Spatial width information
  • Enables concurrent capture and integration of:
    • 1D textual information
    • 2D visual information
    • 3D video positional information

These architectural improvements contribute to Qwen2-VL’s enhanced performance across various tasks and input modalities.

Qwen2-V architecture

Qwen2-V architecture. Source: https://qwenlm.github.io/blog/qwen2-vl/

Developing with Qwen2-VL

Qwen2-VL is available in multiple formats to suit different development needs:

  1. Qwen2-VL-72B API

Access the largest model through the official API:


from openai import OpenAI
import os
import base64

def encode_image(image_path):
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode("utf-8")

# Path to your image
image_path = "dog_and_girl.jpeg"

# Getting the base64 string
base64_image = encode_image(image_path)

def get_response():
    client = OpenAI(
        api_key=os.getenv("DASHSCOPE_API_KEY"),
        base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
    )
    completion = client.chat.completions.create(
        model="qwen-vl-max-0809",
        messages=[
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": "What is this?"},
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": "https://dashscope.oss-cn-beijing.aliyuncs.com/images/dog_and_girl.jpeg"
                        },
                    },
                    {
                        "type": "image_url",
                        "image_url": {"url": f"data:image/jpeg;base64,{base64_image}"},
                    },
                ],
            }
        ],
        top_p=0.8,
        stream=True,
        stream_options={"include_usage": True},
    )
    for chunk in completion:
        print(chunk.model_dump_json())

if __name__ == "__main__":
    get_response()
  1. Open-Source Models (2B and 7B)

The 2B and 7B variants are open-sourced and available on Huggingface and ModelScope. Here’s an example of using the 7B model with Hugging Face Transformers:


from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info

# Load the model
model = Qwen2VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2-VL-7B-Instruct", device_map="auto"
)

# For better performance with flash attention 2
# model = Qwen2VLForConditionalGeneration.from_pretrained(
#     "Qwen/Qwen2-VL-7B-Instruct",
#     torch_dtype=torch.bfloat16,
#     attn_implementation="flash_attention_2",
#     device_map="auto",
# )

# Load the processor
processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-7B-Instruct")

# Optional: Adjust token range for balancing speed and memory usage
# min_pixels = 256*28*28
# max_pixels = 1280*28*28
# processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-7B-Instruct", min_pixels=min_pixels, max_pixels=max_pixels)

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
            },
            {"type": "text", "text": "Describe this image."},
        ],
    }
]

# Prepare for inference
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)

# Generate output
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

This code demonstrates how to load the model, process inputs, and generate outputs using the Qwen2-VL-7B model.

Integration with open-source ecosystem

To facilitate seamless integration and use of Qwen2-VL models, the developers have ensured compatibility with various tools and frameworks in the open-source ecosystem:

  1. Quantization
  2. Deployment
  3. Fine-tuning

These integrations allow developers to optimize, deploy, and customize Qwen2-VL models for specific use cases and hardware configurations.

Licensing

Both the Qwen2-VL-2B and Qwen2-VL-7B models are released under the Apache 2.0 license, allowing for broad use in both academic and commercial applications.

Future directions

The Qwen team has outlined several areas for future development:

  1. Continued improvement of vision-language models based on upcoming versions of the core language models.
  2. Integration of additional modalities, moving towards a more comprehensive “omni-model” approach.
  3. Further enhancements to video understanding capabilities, potentially including audio processing.
  4. Expansion of multilingual support to cover an even broader range of languages and scripts.
  5. Refinement of agent-like capabilities for more complex reasoning and decision-making tasks in visual environments.

Limitations and considerations

While Qwen2-VL represents a significant advancement in vision-language AI, it’s important to note some limitations:

  1. Audio Processing: The model cannot extract or process audio from videos.
  2. Knowledge Cutoff: The model’s knowledge is current only up to June 2023.
  3. Complex Instructions: There may be accuracy issues when processing very complex instructions or scenarios.
  4. Specific Task Weaknesses: The model shows relative weakness in tasks involving counting, character recognition, and 3D spatial awareness.

Conclusion

Qwen2-VL marks a significant leap forward in the field of vision-language AI. With its ability to handle a wide range of visual inputs, from static images to long-form videos, and its impressive performance across numerous benchmarks, it opens up new possibilities for multimodal AI applications.

The model’s strengths in document understanding, multilingual text recognition, and video comprehension make it particularly well-suited for tasks such as:

  • Advanced visual question-answering systems
  • Automated document analysis and data extraction
  • Cross-lingual visual information processing
  • Long-form video content analysis and summarization
  • Visual-based agent systems for robotics and automation

As the field of AI continues to evolve rapidly, Qwen2-VL stands out as a versatile and powerful tool for researchers and developers working at the intersection of computer vision and natural language processing. Its open-source variants and integration with popular frameworks ensure that it will play a significant role in driving innovation in multimodal AI applications.

Categorized in:

Computer Vision, Deep Learning, LLMs,

Last Update: 02/09/2024