UnslothAI has expanded its capabilities by adding vision fine-tuning support in November 2024. This major update transforms how developers can work with vision language models (VLMs), making complex tasks like medical image analysis and mathematical formula recognition more accessible and efficient.

Core implementation

At the heart of this update is the new FastVisionModel class. The basic implementation demonstrates the straightforward approach to loading and configuring vision models:


from unsloth import FastVisionModel
import torch

model, tokenizer = FastVisionModel.from_pretrained(
    "unsloth/Qwen2-VL-7B-Instruct",
    load_in_4bit = True,  # Enable memory-efficient 4-bit quantization
    use_gradient_checkpointing = "unsloth"  # Optimize for longer contexts
)

Multiple vision language models are supported, each serving different use cases and in different sizes.

Llama 3.2 Vision comes in 11B and 90B variants for high-performance tasks, while Pixtral 12B offers a balance of capability and efficiency. The Qwen2 VL family (2B, 7B, 72B) provides options across the performance spectrum.

Memory management and component control

A key advancement in this release is the granular control over model components during fine-tuning. This allows developers to target specific parts of the model architecture:


model = FastVisionModel.get_peft_model(
    model,
    finetune_vision_layers = True,     # Control vision processing
    finetune_language_layers = True,    # Control language processing
    finetune_attention_modules = True,  # Control attention mechanisms
    finetune_mlp_modules = True,       # Control MLP components
    r = 16,                            # LoRA rank for accuracy
    lora_alpha = 16                    # LoRA scaling
)

This selective fine-tuning capability becomes particularly valuable when working with specialized tasks.

For example, when fine-tuning for medical imaging analysis, you might focus on vision layers while preserving the model’s language capabilities:


# Medical imaging specialized configuration
model = FastVisionModel.get_peft_model(
    model,
    finetune_vision_layers = True,      # Focus on vision processing
    finetune_language_layers = False,   # Preserve language capabilities
    finetune_attention_modules = True,  # Update cross-attention
    r = 8                              # Lower rank for efficiency
)

Data processing and training

The system requires a specific data format for vision fine-tuning tasks. Here’s how to structure your training data:


def prepare_vision_data(image, instruction):
    return {
        "messages": [
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": instruction},
                    {"type": "image", "image": image}
                ]
            }
        ]
    }

In general, to format a dataset for training, all vision fine-tuning tasks should be formatted as follows:


[
	{ "role": "user",
	  "content": [{"type": "text",  "text": instruction}, {"type": "image", "image": image} ]
	},
	{ "role": "assistant",
	  "content": [{"type": "text",  "text": answer} ]
	},
]

Training configuration uses HuggingFace’s TRL with custom optimizations for vision tasks:


trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    data_collator = UnslothVisionDataCollator(model, tokenizer),
    args = SFTConfig(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        learning_rate = 2e-4,
        max_seq_length = 2048,
        warmup_steps = 5,
        optim = "adamw_8bit",
        weight_decay = 0.01
    )
)

Memory usage patterns

Testing on Tesla T4 hardware reveals specific memory requirements:


# Memory monitoring example
gpu_stats = torch.cuda.get_device_properties(0)
base_memory = round(torch.cuda.max_memory_reserved() / 1024**3, 3)
print(f"Base model memory: {base_memory} GB")

# After training
total_memory = round(torch.cuda.max_memory_reserved() / 1024**3, 3)
training_overhead = round(total_memory - base_memory, 3)
print(f"Training memory overhead: {training_overhead} GB")

These measurements show Qwen2 VL 7B operating within 8-9GB total memory, while Pixtral 12B requires 12-13GB. The efficiency comes from careful optimization and the use of 4-bit quantization.

Export options

The system provides multiple export paths for different deployment scenarios:


# Local LoRA adapter save
model.save_pretrained("vision_model_lora")
tokenizer.save_pretrained("vision_model_lora")

# Full model 16-bit export
model.save_pretrained_merged("vision_model_full", tokenizer)

# Hugging Face Hub upload
model.push_to_hub_merged("username/vision_model", tokenizer, token="...")

Practical applications

Let’s examine a practical mathematical formula recognition implementation:


# Load specialized math OCR model
model, tokenizer = FastVisionModel.from_pretrained(
    "unsloth/Qwen2-VL-7B-Instruct",
    load_in_4bit = True
)

# Prepare inference
def process_math_formula(image):
    instruction = "Write the LaTeX representation for this image."
    messages = [
        {"role": "user", "content": [
            {"type": "image"},
            {"type": "text", "text": instruction}
        ]}
    ]
    
    input_text = tokenizer.apply_chat_template(
        messages, 
        add_generation_prompt = True
    )
    
    return tokenizer(
        image,
        input_text,
        add_special_tokens = False,
        return_tensors = "pt"
    ).to("cuda")

# Generate LaTeX output
outputs = model.generate(
    **inputs,
    max_new_tokens = 128,
    temperature = 1.5,
    min_p = 0.1
)

This implementation demonstrates how UnslothAI’s vision fine-tuning can be applied to specific, high-value tasks while maintaining efficient resource usage and high accuracy.

Notebooks with examples

The link to notebooks below run on free Google Colab instances with Tesla T4 GPUs, making them accessible for testing and development without specialized hardware. Each notebook includes complete implementation details, from data preparation through training to inference, and can serve as a practical starting point for similar applications.

Conclusion

The combination of efficient memory usage, selective component fine-tuning, and support for multiple model architectures marks a significant advancement in making vision AI more accessible. These implementations run on consumer hardware while maintaining professional-grade performance, suggesting a new phase where sophisticated vision models become standard development tools rather than specialized research projects.

Categorized in:

Computer Vision, Deep Learning, LLMs,

Last Update: 22/11/2024