In the rapidly evolving field of artificial intelligence, multimodal models that can process and understand both visual and textual information have become increasingly important. LLaVA-OneVision, a family of open large multimodal models, has emerged as a significant advancement in this domain.

Developed by consolidating insights from the LLaVA-NeXT blog series, LLaVA-OneVision is the first single model to simultaneously push performance boundaries across three crucial computer vision scenarios: single-image, multi-image, and video.

This article will dive deep into the architecture, capabilities, and innovations of LLaVA-OneVision, exploring how it achieves strong transfer learning across different modalities and scenarios, resulting in new emerging capabilities. We’ll examine its performance in various benchmarks, discuss its open-source nature, and highlight the model’s ability to generalize and tackle real-world computer vision tasks.

Important links:

Model architecture

At the heart of LLaVA-OneVision is a carefully designed architecture that combines state-of-the-art components for both visual and language processing. Let’s break down the key elements:

  1. Language model: Qwen-2
    • Serves as the core language processing unit
    • Handles text generation and understanding
  2. Vision encoder: SigLIP
    • Processes visual inputs (images and video frames)
    • Extracts meaningful features from visual data
  3. Projection layer: 2-Layer MLP
    • Bridges the gap between visual and language representations
    • Maps visual features to a space compatible with the language model
LLaVa onevision architecture

LLaVA-OneVision network architecture. Left: The current model instantiation; Right: the
general form of LLaVA architecture, but is extended to support more visual signals. Source: https://arxiv.org/pdf/2408.03326

This architecture allows LLaVA-OneVision to process various types of visual inputs, including single images, multiple images, and video frames, while maintaining a consistent interface with the language model.

Visual representation strategy

One of the key innovations in LLaVA-OneVision is its visual representation strategy, which enables processing of different visual modalities. The strategy is designed to balance the representation of single images, multiple images, and video frames, ensuring that the model can effectively transfer knowledge across these modalities.

Here’s a breakdown of the token strategy for each modality:

  1. Single-Image:
    • Uses AnyResMax-9 strategy
    • Each input image or grid is processed into 729 visual tokens
    • Maximum tokens: (1 + 9) * 729 = 7290 tokens
    • Allows for high-resolution image processing
  2. Multi-Image:
    • Simple padding strategy
    • Each image is resized to fit within a 384×384 frame
    • Zero-padding is removed after processing
    • Up to 12 images per instance
    • Maximum tokens: 12 * 729 = 8748 tokens
  3. Video:
    • Each frame is processed and undergoes 2×2 bilinear interpolation
    • Results in 196 tokens per frame
    • Up to 32 frames per video
    • Maximum tokens: 32 * 196 = 6272 tokens

This balanced approach ensures that the maximum number of tokens across different modalities is approximately equal, allowing for more equitable representation and better transfer learning capabilities.

Training process

The training of LLaVA-OneVision follows a carefully designed curriculum to build up its multimodal capabilities:

  1. Pretraining stage:
    • Dataset: LCS-558K
    • Duration: 1 epoch
    • Trainable components: Projector only
  2. Mid stage:
    • Dataset: 4.7M high-quality synthetic data
    • Duration: 1 epoch
    • Trainable components: Full model
  3. Final-Image stage:
    • Dataset: 3.6M single-image data
    • Duration: 1 epoch
    • Trainable components: Full model
  4. OneVision stage:
    • Dataset: 1.6M mixed single-image/multi-image/video data
    • Duration: 1 epoch
    • Trainable components: Full model

The training was conducted using bfloat16 precision on a cluster of 256 Nvidia Tesla A100 GPUs, utilizing the Huggingface Trainer for orchestration and PyTorch for neural network computations.

Data curation

The success of LLaVA-OneVision heavily relies on its carefully curated datasets. The data curation process followed these key principles:

  1. Quality over quantity
  2. Balancing different modalities and tasks
  3. Incorporating high-quality synthetic data
  4. Continuous exposure to new, high-quality data

The datasets used in training include:

  1. Single-Image data (3.2M samples):
    • General: 1.14M samples (36.1%)
    • Doc/Chart/Screen: 647K samples (20.6%)
    • Math/Reasoning: 632K samples (20.1%)
    • General OCR: 281K samples (8.9%)
    • Pure Language: 450K samples (14.3%)
  2. OneVision data (1.6M samples):
    • Single-Image: 500K samples (31.2%)
    • Multi-Image: 688K samples (43.0%)
    • Video: 414K samples (25.9%)

The data curation process involved careful selection, formatting, and quality control to ensure optimal performance across various tasks and modalities.

Performance and benchmarks

LLaVA-OneVision demonstrates impressive performance across a wide range of benchmarks, often surpassing or matching the capabilities of advanced commercial models like GPT-4V. Here’s a summary of its performance on key benchmarks:

  1. Single-Image tasks:
    • AI2D (Science Diagrams): 85.6% (vs. GPT-4V: 78.2%)
    • ChartQA: 83.7% (vs. GPT-4V: 78.5%)
    • DocVQA: 91.3% (vs. GPT-4V: 88.4%)
    • MathVista: 67.5% (vs. GPT-4V: 49.9%)
    • MMBench: 85.9% (vs. GPT-4V: 75.0%)
  2. Multi-Image tasks:
    • LLaVA-Interleave: 79.9% (vs. GPT-4V: 60.3%)
    • MuirBench: 54.8% (vs. GPT-4V: 62.3%)
    • Mantis: 77.6% (vs. GPT-4V: 62.7%)
  3. Video tasks:
    • ActivityNetQA: 62.3% (vs. GPT-4V: 57.0%)
    • MLVU: 68.0% (vs. GPT-4V: 49.2%)
    • MVBench: 59.4% (vs. GPT-4V: 43.5%)

These results demonstrate LLaVA-OneVision’s strong performance across various modalities and task types, often outperforming GPT-4V, especially in multi-image and video tasks.

LLaVA-OneVision usage

This code snippet below demonstrates how to load the LLaVA-OneVision model, process an image, and generate a response to a question about the image using the model’s multimodal capabilities.


from llava.model.builder import load_pretrained_model
from llava.mm_utils import process_images, tokenizer_image_token
from llava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN
from llava.conversation import conv_templates

from PIL import Image
import requests
import copy
import torch
import warnings

warnings.filterwarnings("ignore")

# Load model
pretrained = "lmms-lab/llava-onevision-qwen2-7b-ov"
model_name = "llava_qwen"
device = "cuda"
tokenizer, model, image_processor, max_length = load_pretrained_model(pretrained, None, model_name, device_map="auto")

model.eval()

# Prepare image
url = "https://github.com/haotian-liu/LLaVA/blob/1a91fc274d7c35a9b50b3cb29c4247ae5837ce39/images/llava_v1_5_radar.jpg?raw=true"
image = Image.open(requests.get(url, stream=True).raw)
image_tensor = process_images([image], image_processor, model.config)
image_tensor = [img.to(dtype=torch.float16, device=device) for img in image_tensor]

# Prepare prompt
conv_template = "qwen_1_5"
question = DEFAULT_IMAGE_TOKEN + "\nWhat is shown in this image?"
conv = copy.deepcopy(conv_templates[conv_template])
conv.append_message(conv.roles[0], question)
conv.append_message(conv.roles[1], None)
prompt_question = conv.get_prompt()

# Generate response
input_ids = tokenizer_image_token(prompt_question, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).to(device)
image_sizes = [image.size]

output = model.generate(
    input_ids,
    images=image_tensor,
    image_sizes=image_sizes,
    do_sample=False,
    temperature=0,
    max_new_tokens=4096,
)
response = tokenizer.batch_decode(output, skip_special_tokens=True)[0]
print(response)

You can use different models, choosing from the collection in Hugging Face:

Emerging capabilities

One of the most exciting aspects of LLaVA-OneVision is its ability to exhibit emerging capabilities through task transfer and composition. These capabilities showcase the model’s potential to generalize and tackle complex real-world computer vision tasks. Some notable emerging capabilities include:

  1. Joint understanding of diagram and chart
    • Transfers ability from single-image to multi-image scenarios
    • Interprets multiple images coherently
  2. GUI for multi-modal agent
    • Recognizes and interacts with mobile UI screenshots
    • Provides operational instructions for automating tasks
  3. Set-of-mark prompting
    • Refers to numerical labels when answering questions about images
    • Demonstrates fine-grained visual content comprehension
  4. Image-to-video editing instruction
    • Generates detailed video creation prompts based on static images
    • Generalizes image-to-image editing capabilities to video scenarios
  5. Video-to-video difference analysis
    • Analyzes differences between videos with the same starting frame but different endings
    • Compares videos with similar backgrounds but different foreground objects
  6. Multi-camera video understanding in self-driving
    • Analyzes and interprets multi-camera video footage in self-driving contexts
    • Combines multi-panel comprehension, video description, and spatial-temporal reasoning
  7. Composed sub-video understanding
    • Understands and describes content and layout of composed sub-videos
    • Combines single-image analysis, multi-image sequence comprehension, and contextual reasoning
  8. Visual prompting in video
    • Understands highlighted areas and text in videos without specific training on video data with visual prompts
  9. Visual referring in image in video understanding
    • Refers to image queries when answering questions about videos
    • Demonstrates strong base single-image training for cross-modal capabilities

These emerging capabilities highlight LLaVA-OneVision’s potential to transfer knowledge across modalities and generalize to complex, real-world tasks.

Open-source release

To facilitate further development and research in the field of large multimodal models, the LLaVA-OneVision team has made several key resources available to the community:

  1. Training code: Available on GitHub, allowing researchers to reproduce and build upon the model.
  2. Model checkpoints: Pre-trained model checkpoints for 0.5B, 7B, and 72B parameter versions are accessible through Hugging Face.
  3. Datasets: The LLaVA-OneVision datasets for both Single-Image and OneVision stages are available for exploration and use.
  4. Live demo: An interactive demo allows users to experience the model’s capabilities firsthand.

This open-source approach encourages collaboration and accelerates progress in the field of multimodal AI.

Conclusion

LLaVA-OneVision represents a significant advancement in the field of large multimodal models. By consolidating insights from previous research and implementing innovative strategies for visual representation and task transfer, it achieves state-of-the-art performance across single-image, multi-image, and video tasks.

The model’s ability to exhibit emerging capabilities through task transfer and composition demonstrates its potential for tackling complex, real-world computer vision problems. As an open-source project, LLaVA-OneVision paves the way for further research and development in multimodal AI, potentially leading to even more capable and versatile models in the future.

For developers and researchers working on multimodal AI projects, LLaVA-OneVision offers a powerful foundation to build upon, with its strong performance, flexible architecture, and well-documented training process. As the field continues to evolve, models like LLaVA-OneVision will likely play a crucial role in shaping the future of AI-powered visual understanding and generation tasks.

Categorized in:

Computer Vision, Deep Learning, LLMs,

Last Update: 12/08/2024