Meta has released SAM 2, an advanced foundation model for promptable visual segmentation in both images and videos. This article provides an in-depth look at SAM 2’s capabilities, architecture, and performance, as well as the new SA-V dataset used to train it.
Key resources:
SAM 2 builds upon the success of the original Segment Anything Model (SAM), extending its capabilities to handle video segmentation while improving performance on image tasks. The key advancements include:
- A unified architecture for both image and video segmentation
- Real-time processing of video frames using a streaming memory mechanism
- State-of-the-art performance on various segmentation benchmarks
- The new SA-V dataset, containing over 600,000 masklet annotations across 51,000 videos
Let’s dive into the details of SAM 2’s architecture, the SA-V dataset, and its performance compared to existing methods.
SAM 2 architecture
SAM 2 introduces several key components to enable efficient video segmentation while maintaining strong performance on image tasks:
Image encoder
The image encoder is based on a pre-trained Hiera model, which provides a hierarchical representation of the input frame. This allows for multi-scale feature extraction, which is crucial for handling objects of various sizes.
Memory attention
To process video frames efficiently, SAM 2 employs a memory attention mechanism consisting of:
- Memory encoder
- Memory bank
- Memory attention module
This design allows the model to store information about previously processed frames and use it to inform segmentation decisions on the current frame.
Prompt encoder and mask decoder
Similar to the original SAM, SAM 2 can accept various types of prompts:
- Points (positive or negative)
- Bounding boxes
- Masks
The mask decoder uses a “two-way” transformer architecture to update both prompt and frame embeddings before producing the final segmentation mask.
Occlusion handling
SAM 2 introduces an “occlusion head” to predict whether the object of interest is present in the current frame. This allows the model to handle scenarios where objects become temporarily occluded or disappear from view.
Streaming architecture
SAM 2 processes video frames one at a time, storing information about segmented objects in memory. This streaming approach allows for real-time processing of arbitrarily long videos, which is crucial for practical applications.
The SA-V dataset
To train SAM 2, Meta created the Segment Anything Video (SA-V) dataset, which is significantly larger and more diverse than existing video segmentation datasets. Key features of SA-V include:
- Over 600,000 masklet annotations on approximately 51,000 videos
- Videos featuring geographically diverse, real-world scenarios from 47 countries
- Annotations covering whole objects, object parts, and challenging instances with occlusions and reappearances
Data collection process
The SA-V dataset was collected using a three-phase data engine:
- SAM per frame: Using the original SAM model to annotate each frame individually
- SAM + SAM 2 Mask: Combining SAM for spatial masks and an early version of SAM 2 for temporal propagation
- SAM 2: Utilizing the full SAM 2 model for interactive annotation
This iterative process allowed for increasingly efficient annotation, with the final phase being 8.4 times faster than the initial per-frame annotation method.
Dataset composition
The SA-V dataset comprises:
- 50.9K videos
- 196.0 hours of video content
- 190.9K manual masklet annotations
- 451.7K automatic masklets
Compared to existing datasets like DAVIS 2017, YouTube-VOS, and MOSE, SA-V provides an order of magnitude more annotations and a wider variety of objects and scenarios.
Performance and benchmarks
SAM 2 demonstrates significant improvements over existing methods in both image and video segmentation tasks. Let’s look at some key performance metrics:
Video Object Segmentation (VOS)
SAM 2 outperforms state-of-the-art methods on various VOS benchmarks:
Method | MOSE val | DAVIS 2017 val | LVOS val | SA-V val | SA-V test | YTVOS 2019 val |
---|---|---|---|---|---|---|
Cutie-base+ | 71.7 | 88.1 | – | 61.3 | 62.8 | 87.5 |
SAM 2 (Hiera-B+) | 75.8 | 90.9 | 74.9 | 73.6 | 74.1 | 88.4 |
SAM 2 (Hiera-L) | 77.2 | 91.6 | 76.1 | 75.6 | 77.6 | 89.1 |
These results show that SAM 2 achieves better accuracy across various datasets, with particularly significant improvements on the SA-V val and test sets.
Image segmentation
SAM 2 also improves upon the original SAM in image segmentation tasks:
Model | Data | SA-23 All (1-click mIoU) | FPS |
---|---|---|---|
SAM | SA-1B | 58.1 | 21.7 |
SAM 2 | SA-1B | 58.9 | 130.1 |
SAM 2 | Mixed | 61.4 | 130.1 |
SAM 2 not only achieves higher accuracy but also runs 6 times faster than the original SAM model.
Interactive Video segmentation
In interactive video segmentation scenarios, SAM 2 demonstrates superior performance compared to baselines like SAM+XMem++ and SAM+Cutie:
- SAM 2 achieves better segmentation accuracy while requiring 3 times fewer interactions
- Consistent performance improvements across various prompting scenarios (1-click, 3-click, 5-click, bounding box, and ground-truth mask)
Ablation studies and model insights
The researchers conducted extensive ablation studies to understand the impact of various design choices and training data compositions. Here are some key findings:
Data mix impact
Training SAM 2 on a combination of VOS datasets, SA-1B, and the new SA-V dataset yields the best overall performance across various benchmarks. This highlights the importance of diverse training data for achieving strong generalization.
Model capacity
Increasing model capacity through higher input resolution, larger image encoders, and more memory attention blocks generally leads to improved performance. However, there’s a trade-off between accuracy and inference speed.
Memory architecture
The study found that using a simple memory bank without recurrent GRU units is sufficient and more efficient. Additionally, cross-attending to object pointers from previous frames significantly boosts performance on challenging long-term video segmentation tasks.
Relative positional encoding
Removing relative positional biases (RPB) from the image encoder and using 2D rotary positional encoding (RoPE) in memory attention leads to improved performance and faster inference, especially at higher resolutions.
Practical applications and future directions
SAM 2’s capabilities open up numerous possibilities for real-world applications:
- Video editing and effects: SAM 2 can be used to create new video effects by segmenting and manipulating objects in video frames.
- Augmented reality: The model could help identify and track objects in real-time for AR applications.
- Autonomous vehicles: SAM 2 could assist in real-time object detection and tracking for self-driving cars.
- Medical imaging: The model’s ability to segment objects in videos could be applied to analyzing medical imaging sequences.
- Scientific research: SAM 2 could aid in tracking and analyzing objects in scientific videos, such as cell movement in microscopy.
Future research directions may include:
- Improving the model’s performance on fast-moving objects and fine details
- Enhancing the efficiency of multi-object segmentation
- Exploring ways to incorporate inter-object contextual information
- Further automating the data annotation process
Conclusion
SAM 2 represents a significant advancement in the field of visual segmentation, offering a unified approach to handling both images and videos. Its improved accuracy, faster inference speed, and ability to handle challenging scenarios make it a powerful tool for a wide range of applications.
The release of the SA-V dataset, along with the open-sourcing of SAM 2 under permissive licenses, provides researchers and developers with valuable resources to build upon and further advance the state of the art in visual perception tasks.
As the AI community continues to explore and build upon SAM 2, we can expect to see even more innovative applications and improvements in the field of computer vision and video analysis.