In the field of computer vision, it has been a long-standing belief that bigger models lead to better performance. Researchers and practitioners have been racing to develop larger and larger vision models, hoping to achieve state-of-the-art results in various tasks such as image classification, object detection, and semantic segmentation.

However, a recent study by a group of researchers from UC Berkeley and Microsoft Research challenges this notion and proposes a novel approach called “Scaling on Scales” (S²) that can enable smaller models to outperform their larger counterparts.

The core idea

The key idea behind S² is to keep the model size fixed while feeding the model with images at different scales. Instead of increasing the number of parameters in the model, S² focuses on extracting more information from the input images by processing them at multiple resolutions.

The researchers demonstrate that by applying S² to a smaller pre-trained vision model, such as ViT-B or ViT-L, they can achieve better performance than larger models like ViT-H or ViT-G in a wide range of tasks, including image classification, segmentation, depth estimation, and even complex tasks like visual question answering and robotic manipulation.

s2 wrapper

S² Wrapper is a simple mechanism that extends any pre-trained vision model to multiple image scales in a parameter-free manner.

S² technique

The S² technique works by first taking a pre-trained model that was originally trained on images at a single scale (e.g., 224×224 pixels) and applying it to images at multiple scales (e.g., 224×224, 448×448, and 672×672 pixels). The model processes each scale independently by splitting the larger images into smaller patches that match the original training size.

The features extracted from each scale are then pooled together and concatenated to form a final multi-scale representation. This approach allows the model to capture both high-level semantics and low-level details from the input images, leading to a more comprehensive understanding of the visual content.


One of the most impressive findings of this study is that S² can help smaller models achieve state-of-the-art performance in tasks that require a detailed understanding of the image. For example, in the visual question answering task on the challenging V* benchmark, a ViT-B model with S² surpassed the performance of even the largest open-source models like GPT-4V.

This is a remarkable achievement, considering that the S²-enhanced model uses significantly fewer parameters than its larger counterparts.

The researchers also conducted experiments to understand the conditions under which S² is preferable to scaling up the model size. They found that while larger models have an advantage in handling harder examples and rare cases, the features learned by these models can be well approximated by the features from multi-scale smaller models.

Comparison of S2scaling and model size scaling

Comparison of S² scaling and model size scaling

This suggests that smaller models have the capacity to learn similar representations as larger models when trained with images at multiple scales.

Pre-training smaller models

Furthermore, the study shows that pre-training the smaller models with S² from scratch can lead to even better performance. By exposing the models to images at different scales during the pre-training phase, the models can learn more robust and generalizable features. The authors demonstrate that a ViT-B model pre-trained with S² can match or even outperform a ViT-L model pre-trained with only a single image scale.


The implications of this research are significant for the development of efficient and powerful vision models. By leveraging the S² technique, practitioners can achieve state-of-the-art performance with smaller models, reducing the computational cost and memory requirements. This is particularly important for deploying vision models on resource-constrained devices like mobile phones or embedded systems.

Moreover, the S² approach opens up new possibilities for further research in multi-scale representation learning. The study shows that processing images at different scales can lead to a more comprehensive understanding of the visual content, and this idea can be extended to other domains like video analysis or 3D vision.


The “Scaling on Scales” technique challenges the conventional wisdom that bigger is always better in the world of vision models. By enabling smaller models to process images at multiple scales, S² achieves impressive performance in a wide range of tasks while being more parameter-efficient. This research paves the way for developing powerful yet compact vision models that can be deployed in various real-world applications. As the field of computer vision continues to evolve, techniques like S2 will play a crucial role in pushing the boundaries of what is possible with smaller and more efficient models.

Citation of the paper: Shi, B., Wu, Z., Mao, M., Wang, X. and Darrell, T., 2024. When Do We Not Need Larger Vision Models?. arXiv preprint arXiv:2403.13043.

Last Update: 24/03/2024