Introduction

On July 29th, 2024, Meta unveiled Segment Anything Model 2 (SAM 2), a real-time image and video segmentation model that presents a revolutionary leap in computer vision technology and promises to change how we interact with images and videos. This new model builds upon the success of its predecessor, the SAM model (released in April last year), by unifying image and video segmentation capabilities into a single, powerful system. SAM was already capable of performing complex image segmentation tasks with unprecedented precision and versatility.

Segmentation in computer vision refers to the process by which software identifies and groups pixels in an image or a video, determining which pixels belong to which objects. This helps understand and analyze the image's (or video’s) content.

According to Meta, SAM 2 follows a simple transformer architecture with streaming memory for real-time video processing and can segment videos with better accuracy using 3x fewer interactions than prior approaches. In image segmentation, it is 6x faster than the SAM Model. 

unnamed (38)Source: Meta

SAM 2 architecture processes videos frame by frame, using the current prompt and previous memories for segmentation. The image encoder analyzes each frame, referencing past frames, while the mask decoder predicts the segmentation mask. A memory encoder updates the predictions for future frames.

Why SAM2 Is A Big Deal

SAM 2’s real-time video segmentation is a huge technical leap because it showcases how AI can process moving images and distinguish among the elements on screen even as they move around or out of the frame and back in again. Softwares that leverage AI to generate videos could soon make processing or editing complicated videos and images easier and cheaper.

The earlier model, SAM, was already performing well in several practical image-processing applications. For example, it has been used in marine science to segment sonar images and analyze coral reefs, in satellite imagery analysis for disaster relief, and in the medical field to segment cellular images and aid in detecting skin cancer

unnamed (37)

Meta’s Modified SAM Model Detects Objects In Sonar Images With Box Prompts.
(Source: When SAM Meets Sonar Images)

It also performed significantly better than other methods, such as ViTDET.

unnamed (36)

Source: Segment Anything (SA) project, Meta AI Research, FAIR (Figure 16)

Now, SAM 2 extends these capabilities across video, which is no small feat and would not have been feasible until very recently. It does it very well even when there are barely any differences color-wise between objects, when the quality is low, and when models have not been trained for similar cases.

https://x.com/i/status/1818121963958603869

Source: Alex Volkov (Thursd/AI)

SAM Segmentation Capabilities (Source: Meta)

SAM2 also carries forward several key features introduced by the original Segment Anything Model. For example, SAM was capable of a multi-prompt Interface and could be guided by different prompts, like clicks, boxes around objects, or text, to create segmentation masks. It could understand new tasks and image types without additional training, a feature known as zero-shot transfer. SAM 2 builds on these with several exciting improvements:

Enhanced Accuracy and Speed: SAM 2 offers better segmentation accuracy and faster processing times (approximately 44 frames per second), making it great for real-time use.

Video Segmentation: SAM 2 can now segment videos, consistently tracking and segmenting objects across video frames. It also allows for real-time interaction by utilizing a streaming memory architecture that processes video frames one at a time.

Refined Prompting Mechanisms: The new model supports more advanced prompting techniques, giving users more control over the segmentation process. For example, SAM 2 accepts prompts in any video frame to create a "masklet" that identifies objects and extends them across all frames. Users can refine this mask by providing more prompts at any time, repeating the process until the mask is accurate.

The evolution of the architecture from SAM to SAM 2 (Source: Meta)

Expanded Dataset: SAM 2 is trained on an even larger and more diverse dataset, improving its ability to handle different image and video types:

unnamed (35)

Zero-shot accuracy on the Segment Anything (SA) task across 37 datasets. The table shows the average 1- and 5-click mIoU of SAM 2 compared to SAM by domains (image/video). (Source: Meta)

As part of SAM 2’s debut, Meta has also shared a database of over 50,000 videos created to train the model. That’s on top of the 100,000 other videos Meta mentioned employing. 

Open Source: In line with its commitment to open-source AI, Meta has made the code and the model weights for SAM2 available under a permissive Apache 2.0 license. Meta has also shared a web demo that enables real-time interactive segmentation of short videos and applies video effects to the model predictions:

unnamed (34)

Source: Meta SAM2 Demo Page

Meta had cemented its status in the open AI space with tools like PyTorch and models such as LLaMa. SAM2 is the latest addition to their portfolio that promises to boost computer vision and image segmentation capabilities.

User Reactions

On social media, users have expressed excitement about SAM 2’s new capabilities. Many acknowledge the significant improvements in speed and accuracy and the ease with which the model can track objects with just one click on a paused video frame. 

 


Source: A.I.Warper

 Here is another X thread that shows how good the model is for ReID (reidentification) across multiple camera views:

 


Source: SkalskiP

The overall sentiment is positive, with excitement about SAM2's capabilities, open-source nature, and potential applications across various fields despite some concerns about regional access restrictions and privacy (for example, the demo doesn't work unless you accept cookies).

unnamed (33)A Reddit user praising Meta for its open-source contributions

What Could SAM2 Mean For Digital Media Publishers

Generative AI-based applications are already leveraging AI to speed up publisher workflows. For example, Aeon, as a Generative AI-based text-to-video service, enables publishers to:

Meta’s new AI model, SAM 2, can supercharge video editing and VFX processes by automatically identifying and tracking subjects in a video. SAM2 makes it easier to emphasize key elements in media. For instance, publishers could convert text-based articles to monetizable videos and use special VFX to highlight only the person or create an object-tracking pattern around the subject. The entire process could be completed using prompts in near real-time.

Emulating the same visual effects in legacy manual video editing requires much more effort and professional-grade skills, making the process expensive and time-consuming. It also requires advanced softwares. SAM 2 simplifies the process to a large extent by automating it. It is similar to background removal tools, which were extremely difficult before AI but have become easier with readily available tools like remove.bg.

Publishers could also use SAM2 to catalog and label their image and video inventory, detect anomalies, and enhance search functionality, making their content management more efficient and accurate.

Meta's family apps, including Instagram and Facebook, will likely soon feature SAM2-powered tools. Instagram's backdrops and cutouts, previously using SAM1, will be upgraded to SAM2. Users may soon be able to edit videos directly within Instagram using AI, although Meta has not confirmed details yet.

A Bit Of a Background in Discriminative Models

While most of our blogs discuss developments around Generative AI, SAM belongs to a class of AI known as Discriminative AI. We've therefore included some basic technical information for those interested.

There are two main models in enterprise AI technologies: Discriminative and Generative. While generative models are used to create new data, discriminative models are used to classify or predict data. Generative AI has undoubtedly revolutionized the AI landscape with its ability to create new content, generating significant excitement along the way. However, Discriminative AI, though less popular, is equally fascinating. 

As explained in this Turing blog, think of them as the “Topper Twins” - Zed (Generative) and Zack (Discriminative) are twin brothers. Both are child prodigies and jointly hold the topper’s position in their class. 

Zed can learn everything about a given topic. He goes in-depth and understands every little detail about a subject. Once grasped, he never forgets it. But this is cumbersome, especially if there’s a lot to learn, and thus, he needs more time to prepare for his exams much sooner than his brother.

On the other hand, Zack studies by creating a mind map. He gets the general idea of a topic and then learns the differences and patterns between the subtopics. He can also apply what he has learned from one subject to another, which gives him a lot more flexibility in his thinking process. In a way, Zack learns by spotting the differences.

Generative classifiers predict outcomes by first understanding the likelihood of different outcomes [P(Y)] and how features relate to each outcome [P(X|Y)]. They then use Bayes’ theorem to determine the probability of an outcome given the features [P(Y|X)]. On the other hand, discriminative classifiers directly estimate the probability of an outcome based on the features [P(Y|X)] without considering the likelihood of the outcomes separately.

Here, P(Y) represents the probability of the outcome, P(X|Y) represents the probability of features given the outcome, and P(Y|X) represents the probability of the outcome given the features. For example, if the probability of having a disease (Y) in the general population is 5% [P(Y)], then given that a person has the disease (Y), the probability that they exhibit a certain symptom, like a high fever (X), might be 80% [P(X|Y)].

unnamed (32)

Source: Daily Dose of Data Science

A generative model will first determine the overall probability of having the disease. It would then calculate the probability of observing the symptom (high fever) given the disease. It might also determine the overall probability of observing the symptom in the population and then combine these probabilities to estimate the probability of having the disease given the symptom. The generative model essentially builds a full probabilistic model of the features (symptoms) and outcomes (disease) and then uses these to make predictions.

In contrast, a discriminative model directly estimates the probability of having the disease given the symptom. It focuses on learning the boundary or relationship between the symptom and the disease from the data.

Conclusion

A lot is happening at the intersection of AI and video. Models like SAM2 could be a ChatGPT moment for publishers, enabling them to leverage AI and harness the power of video at scale. The future of AI is democratized, making advanced technology accessible to everyone.