SAM 3 for Video: Concept-Aware Segmentation and Object Tracking

In Part 1 of this series, we introduced Segment Anything Model 3 (SAM 3) and saw how it moves beyond geometric prompts to concept-based visual understanding. We learned how the model can segment all instances of a concept using natural language and visual examples.

sam-3-sam3-video-concept-aware-segmentation-object-tracking-featured.png

In Part 2, we went one step further. We explored multi-modal prompting and interactive workflows. We combined text, bounding boxes, and point clicks to build precise and controllable segmentation pipelines on images.

So far, however, everything we have done lives in a static world.

Images do not move. Objects do not disappear. There is no notion of time.

Video changes everything.

In videos, segmentation alone is not enough. We also need temporal consistency. If a person appears in frame 1 and walks across the scene, the model must not only segment that person — it must also remember that it is the same person in frame 200.

This is where SAM3 becomes fundamentally different from previous systems.

SAM3 does not treat video as a bag of independent images. Instead, it maintains a streaming memory and a tracking state that allows it to propagate object identities across frames. Detection, segmentation, and tracking are no longer separate steps. They are part of a single, unified pipeline.

In other words, SAM3 does not just answer: “Where is the object in this frame?”

It answers: “Where is this concept over time?”

In this tutorial, we will focus entirely on using SAM3 with video. We will build several practical pipelines that combine detection, segmentation, and tracking into one coherent workflow.

Specifically, we will implement 4 tasks:

Video Detection, Segmentation, and Tracking Using a Text Prompt
Here, we use a text prompt such as "person" and let SAM3 detect, segment, and track all instances of that concept throughout the video.
Real-Time Detection, Segmentation, and Tracking Using a Text Prompt via Webcam
This is the same idea, but running in real time on a live camera stream.
Detection, Segmentation, and Tracking Using a Single Click on an Object
Here, we do not use text. We simply click on an object in the first frame and let SAM3 track it.
Detection, Segmentation, and Tracking Using Multiple Clicks on Objects
In this case, we select multiple objects interactively and track all of them at once.

Across these examples, we will see the same core idea again and again:

First, SAM3 recognizes what to track.
Then, it segments it.
Finally, it remembers and propagates it through time.

This lesson is the 3rd of a 4-part series on SAM 3:

SAM 3: Concept-Based Visual Understanding and Segmentation
Advanced SAM 3: Multi-Modal Prompting and Interactive Segmentation
SAM 3 for Video: Concept-Aware Segmentation and Object Tracking (this tutorial)
Lesson 4

To learn how to build SAM3-powered video segmentation and tracking pipelines — using text prompts, real-time webcam streams, and interactive multi-object tracking inside a dynamic Gradio interface, just keep reading.

Looking for the source code to this post?

Would you like immediate access to 3,457 images curated and labeled with hand gestures to train, explore, and experiment with … for free? Head over to Roboflow and get a free account to grab these hand gesture images.

Configuring Your Development Environment

To follow this guide, you need to have the following libraries installed on your system.

!pip install --q git+https://github.com/huggingface/transformers av gradio

First, we install the latest development version of transformers directly from GitHub.

We do this because SAM3 is very new. Its video processors, tracker models, and streaming APIs are not yet available in older stable releases. Installing from GitHub ensures we get access to:

Sam3VideoProcessor and Sam3VideoModel for concept-based segmentation
Sam3TrackerVideoProcessor and Sam3TrackerVideoModel for video tracking
The video session and streaming inference utilities

In short, this gives us all SAM3 image and video capabilities in one place.

Next, we install av. This is a Python binding for FFmpeg. We use it to decode video files, read frames efficiently, and handle video streams coming from disk or webcam. Without av, working with video frames would be slow and unreliable.

Finally, we install gradio. We use Gradio to build interactive demos, run webcam-based real-time segmentation, create simple UI components for clicking on objects and visualizing results. This allows us to turn SAM3 into a live, interactive video application, not just a notebook script.

We also pass the --q flag to keep the installation output quiet. This keeps our notebook clean and easy to read.

Need Help Configuring Your Development Environment?

Having trouble configuring your development environment? Want access to pre-configured Jupyter Notebooks running on Google Colab? Be sure to join PyImageSearch University — you will be up and running with this tutorial in a matter of minutes.

All that said, are you:

Short on time?
Learning on your employer’s administratively locked system?
Wanting to skip the hassle of fighting with the command line, package managers, and virtual environments?
Ready to run the code immediately on your Windows, macOS, or Linux system?

Then join PyImageSearch University today!

Gain access to Jupyter Notebooks for this tutorial and other PyImageSearch guides pre-configured to run on Google Colab’s ecosystem right in your web browser! No installation required.

And best of all, these Jupyter Notebooks will run on Windows, macOS, and Linux!

Setup and Imports

Once the dependencies are installed, we can import the libraries we need for video processing, model execution, and interactive demos.

import cv2
import torch
import numpy as np
import gradio as gr

from PIL import Image
from accelerate import Accelerator
from transformers.video_utils import load_video
from transformers import Sam3VideoModel, Sam3VideoProcessor
from transformers import Sam3TrackerVideoModel, Sam3TrackerVideoProcessor

First, we import OpenCV (cv2). We use OpenCV to read frames from videos and webcams, convert between different image formats, and perform basic image and video I/O operations. This forms the low-level video input layer of our system.

Next, we import PyTorch (torch). PyTorch is the backend that runs SAM3. We use it to move models and tensors to GPU or CPU, run inference efficiently, and control precision and memory usage. All heavy computation in this tutorial happens inside PyTorch.

We also import NumPy (numpy). NumPy is used for simple array manipulations, converting between OpenCV images, PIL images, and tensors, and handling masks and frame buffers in a lightweight way.

Then, we import Gradio (gradio). We use Gradio to build interactive demos, create webcam-based real-time segmentation apps, and handle mouse clicks for point-based object selection. This turns our video pipeline into a live, interactive application instead of a static script.

Next, we import PIL’s Image. This is used to convert frames into PIL format when required by the processor, handle RGB image conversions cleanly, and bridge between OpenCV and the Transformers preprocessing pipeline.

Then, we import Accelerator from Hugging Face Accelerate. The Accelerator class helps us automatically place the model on CPU or GPU, write device-agnostic code, and scale to different hardware setups without changing logic. This keeps the code clean and portable.

Next, we import load_video from transformers.video_utils. This utility function loads a video file from disk, decodes it into a list (or generator) of frames, and handles resizing and format conversion in a consistent way. We use this for file-based video experiments.

Finally, we import the SAM3 video models and processors. These are the core components of this tutorial.

Sam3VideoModel and Sam3VideoProcessor are used for:
- Text-prompted video segmentation
- Concept detection on video frames
Sam3TrackerVideoModel and Sam3TrackerVideoProcessor are used for:
- Point-based prompting
- Interactive object selection
- Multi-object tracking with memory across frames

Together, these 2 model families allow us to cover all 4 workflows:

Text-prompted tracking
Webcam-based tracking
Single-click object tracking
Multi-click object tracking

Text-Prompt Video Tracking

In this section, we use a natural language prompt such as "person" or "car" to detect, segment, and track objects across an entire video. SAM3 maintains temporal memory, ensuring that object identities remain consistent from the first frame to the last. The result is a fully annotated video with masks, bounding boxes, and tracking IDs propagated over time.

Load the SAM3 Video Model

Before we process any video, we need to load the SAM3 model and its processor.

Since these models are large, we load them once and reuse them across all videos and frames.

# ------------------------------------------------
# Model setup (loaded once)
# ------------------------------------------------
accelerator = Accelerator()
device = accelerator.device

model = Sam3VideoModel.from_pretrained(
   "facebook/sam3"
).to(device, dtype=torch.bfloat16)

processor = Sam3VideoProcessor.from_pretrained("facebook/sam3")

First, we create an Accelerator object. The Accelerator automatically:

Detects whether a GPU is available
Chooses the best device (CPU or GPU)
Handles device placement in a clean and consistent way

By using this, we avoid hardcoding "cuda" or "cpu" anywhere in our code. The device variable now represents where the model will run.

Next, we load the SAM3 Video model. This does 3 important things:

It downloads the pretrained SAM3 weights from Hugging Face
It constructs the full video-capable segmentation model
It moves the model to the selected device (GPU or CPU)

We also explicitly set dtype=torch.bfloat16. This tells PyTorch to run the model in bfloat16 precision. Using bfloat16 reduces memory usage significantly, speeds up inference on modern GPUs, and has almost no impact on segmentation quality. This is especially important because SAM3 is a very large model.

Then, we load the processor. The processor is responsible for preprocessing video frames (resizing, normalization, padding), encoding text prompts, formatting inputs for the model, and post-processing outputs (resizing masks back to the original resolution).

At this point, we have:

A fully loaded SAM3 video model on the correct device
A video processor that knows how to prepare inputs and decode outputs

Helper Function: Visualizing Video Segmentation Masks, Bounding Boxes, and Tracking IDs

Before we start running video inference, we define a small helper function to visualize segmentation and tracking results.

This function overlays:

Segmentation masks
Bounding boxes
Object IDs
Confidence scores

directly on top of each video frame.

# ------------------------------------------------
# Visualization helper
# ------------------------------------------------
def overlay_masks_boxes(image, masks, boxes, scores, object_ids, alpha=0.5):
   image = image.copy()
   h, w = image.shape[:2]

   for i, mask in enumerate(masks):
       color = np.random.randint(0, 255, (3,), dtype=np.uint8)

       if mask.shape[-2:] != (h, w):
           mask = cv2.resize(mask.astype(np.uint8), (w, h)) > 0

       colored = np.zeros_like(image)
       colored[mask] = color
       image = cv2.addWeighted(image, 1.0, colored, alpha, 0)

       x1, y1, x2, y2 = boxes[i].astype(int)
       cv2.rectangle(image, (x1, y1), (x2, y2), color.tolist(), 2)

       label = f"ID {int(object_ids[i])} | {scores[i]:.2f}"
       cv2.putText(
           image,
           label,
           (x1, max(y1 - 5, 15)),
           cv2.FONT_HERSHEY_SIMPLEX,
           0.5,
           color.tolist(),
           1,
           cv2.LINE_AA,
       )

   return image

First, we create a copy of the input image and read its spatial resolution. We do this to avoid modifying the original frame and ensure all masks are resized to the same height and width.

Next, we loop over each detected object. Each iteration corresponds to one tracked instance in the frame. Inside the loop, we generate a random color for that object. This makes each object visually distinct in the overlay.

Then, we ensure the mask matches the image resolution. Because masks are produced at a lower resolution, we resize them to the full frame size and convert them to a Boolean mask. This ensures perfect alignment with the video frame.

Next, we create a colored overlay and blend it with the image. This:

Paints the object region with the selected color
Blends it with transparency (alpha)
Keeps the original image visible underneath

Then, we draw the bounding box for the same object. This helps us visually confirm the detection region and the spatial extent of each tracked object.

Next, we prepare a label string. This shows the tracking ID assigned by SAM3 and the confidence score of the detection.

We then render this text near the bounding box. We slightly shift the text upward to avoid overlapping with the box. Finally, after all objects are drawn, we return the annotated frame.

Main Pipeline: Running the Full Video Segmentation and Tracking Workflow

Now we build the main function that runs SAM3 on a video using a text prompt and produces an annotated output video.

This function does 4 things:

Loads the video frames
Initializes a SAM3 video session with memory
Propagates segmentation and tracking across frames
Writes an annotated video to disk

Let us walk through the full pipeline step by step.

# ------------------------------------------------
# Main pipeline
# ------------------------------------------------
def run_sam3_video(video_path, text_prompt):
   # Load frames for SAM3
   video_frames, _ = load_video(video_path)

   # Reliable FPS extraction (OpenCV)
   cap = cv2.VideoCapture(video_path)
   fps = cap.get(cv2.CAP_PROP_FPS)
   cap.release()

   if fps is None or fps <= 0:
       fps = 25.0
   fps = float(fps)

   # Init SAM3 session
   inference_session = processor.init_video_session(
       video=video_frames,
       inference_device=device,
       processing_device="cpu",
       video_storage_device="cpu",
       dtype=torch.bfloat16,
   )

   inference_session = processor.add_text_prompt(
       inference_session=inference_session,
       text=text_prompt,
   )

   outputs_per_frame = {}

   for model_outputs in model.propagate_in_video_iterator(
       inference_session=inference_session,
       max_frame_num_to_track=len(video_frames),
   ):
       processed = processor.postprocess_outputs(
           inference_session,
           model_outputs,
       )
       outputs_per_frame[model_outputs.frame_idx] = processed

   # Prepare output video
   h, w = video_frames[0].shape[:2]
   out_path = "sam3_annotated.mp4"

   writer = cv2.VideoWriter(
       out_path,
       cv2.VideoWriter_fourcc(*"mp4v"),
       fps,
       (w, h),
   )

   for idx, frame in enumerate(video_frames):
       outputs = outputs_per_frame.get(idx)

       if outputs and len(outputs["object_ids"]) > 0:
           frame = overlay_masks_boxes(
               frame,
               outputs["masks"].cpu().numpy(),
               outputs["boxes"].cpu().numpy(),
               outputs["scores"].cpu().numpy(),
               outputs["object_ids"].cpu().numpy(),
           )

       # OpenCV expects BGR
       writer.write(frame[:, :, ::-1])

   writer.release()

   return out_path, out_path

First, we load the video frames using the Transformers utility. This returns:

A list of RGB frames as NumPy arrays
(Optionally) audio, which we ignore here

Next, we extract the frame rate using OpenCV. We do this because:

We want the output video to play at the same speed as the input
Some video loaders do not always return reliable FPS metadata

If FPS is missing or invalid, we fall back to a default value.

Instead of processing each frame independently, SAM3 uses a video session that maintains the following:

Temporal memory
Object identities
Propagation state

We initialize the session using processor.init_video_session(...). We pass:

The full list of frames
The device for model inference (GPU or CPU)
CPU for processing and storage (to save GPU memory)
dtype=torch.bfloat16 for efficient computation

This session object now holds the entire video, memory, and tracking state.

Next, we tell SAM3 what concept we want to track using processor.add_text_prompt(...).

For example:

"person"
"car"
"player in red jersey"

From this point on, the session is configured to: detect, segment, and track all instances of this concept across the video.

The real work happens when we call model.propagate_in_video_iterator(...).

This function:

Processes frames sequentially
Uses memory to propagate masks and identities
Emits results for each frame

For each frame:

We post-process raw model outputs
Convert them into usable masks, boxes, scores, and IDs
Store them in a dictionary indexed by frame number

At the end of this loop, we have a full timeline of segmentation and tracking results.

We now prepare an OpenCV video writer.

Same resolution as input
Same FPS
MP4 format

We now loop over each frame. If SAM3 produced results for that frame: we overlay masks, boxes, IDs, and scores using our helper function. Then we write the frame to the output video. We convert RGB → BGR because OpenCV expects BGR format.

We close the video writer and return the path to the annotated video. At this point, we have a complete pipeline that: takes a video + text prompt → returns a fully segmented and tracked video.

Launch the Gradio Application

Now that our video pipeline is ready, we wrap everything into a simple Gradio web interface.

This allows us to:

Upload a video
Enter a text prompt (e.g., "person")
Run SAM3 segmentation and tracking
Preview the result
Download the annotated video

# ------------------------------------------------
# Gradio UI
# ------------------------------------------------
with gr.Blocks() as demo:
   gr.Markdown("# 🎥 SAM3 Video Segmentation & Tracking")

   with gr.Row():
       video_input = gr.Video(label="Upload Video")
       prompt = gr.Textbox(
           label="Text Prompt",
           placeholder="e.g. person, chair, bed",
           value="person",
       )

   run_btn = gr.Button("Run Segmentation")

   video_out = gr.Video(label="Annotated Output")
   download = gr.File(label="Download Video")

   run_btn.click(
       fn=run_sam3_video,
       inputs=[video_input, prompt],
       outputs=[video_out, download],
   )

demo.launch(debug=True)

gr.Blocks() lets us build a custom UI layout instead of a simple one-function demo. Everything inside this block becomes part of our web interface.

This is just a header that appears at the top of the page.

Here, we place 2 widgets side by side:

A video uploader
A text box for the concept prompt

The default value is "person", so the app works immediately without typing anything.

This button will trigger the SAM3 pipeline.

We create 2 outputs:

One to preview the annotated video
One to download the result file

It tells Gradio: When the button is clicked, call run_sam3_video(video, prompt) and display its outputs.

Recall that our function returns 2 values.

So:

The first output goes to the video preview
The second output goes to the download widget

This starts a local web server and opens the interface in the browser.

The debug=True flag helps:

Show errors in the console
Make debugging easier during development

At this point, we have a complete application:

upload a video
type a concept
get a fully segmented and tracked output video

This completes the text-prompt video segmentation and tracking pipeline.

Output: Text-Prompt Video Segmentation and Tracking Results

Figure 1: Text-Prompt Video Segmentation and Tracking Demo (source: GIF by the author).

Real-Time Text-Prompt Tracking (Webcam)

Here, we extend text-prompt tracking to a live webcam stream. Instead of processing a preloaded video, frames arrive continuously, and SAM3 updates its tracking state in real time. This enables live, concept-aware segmentation with stable object identities across streaming frames.

Helper Function: Stable Color Overlays for Real-Time Video Tracking

In the previous pipeline, we used a visualization helper to draw masks and bounding boxes on each frame. For streaming and webcam scenarios, we slightly modify this helper so that each tracked object keeps a stable color across frames.

This small change makes tracking much easier to follow visually, especially when objects move across the scene.

Here is the updated helper:

# ------------------------------------------------
# Visualization helper
# ------------------------------------------------
def overlay_masks_boxes(image, masks, boxes, scores, object_ids, alpha=0.5):
   image = image.copy()
   h, w = image.shape[:2]

   for i, mask in enumerate(masks):
       # Stable color per object id
       rng = np.random.default_rng(int(object_ids[i]))
       color = rng.integers(0, 255, size=3, dtype=np.uint8)

       if mask.shape[-2:] != (h, w):
           mask = cv2.resize(mask.astype(np.uint8), (w, h)) > 0

       colored = np.zeros_like(image)
       colored[mask] = color
       image = cv2.addWeighted(image, 1.0, colored, alpha, 0)

       x1, y1, x2, y2 = boxes[i].astype(int)
       cv2.rectangle(image, (x1, y1), (x2, y2), color.tolist(), 2)

       label = f"ID {int(object_ids[i])} | {scores[i]:.2f}"
       cv2.putText(
           image,
           label,
           (x1, max(y1 - 5, 15)),
           cv2.FONT_HERSHEY_SIMPLEX,
           0.5,
           color.tolist(),
           1,
           cv2.LINE_AA,
       )

   return image

Let us walk through what changed and why it matters.

In streaming inference, objects appear across many frames. If colors change every frame, it becomes hard to follow which object is which.

To solve this, we generate colors based on the object ID:

Here, on Lines 10 and 11:

The object ID is used as the random seed.
The same object always produces the same color.
Tracking becomes visually consistent across frames.

So if object ID 3 appears in 200 frames, it will always use the same color.

Sometimes masks are produced at a lower resolution, so we resize them to match the frame. Lines 13 and 14 ensure masks perfectly align with the streaming frame.

Next, we create a colored mask and blend it with the original frame on Lines 16-18. The transparency factor alpha controls how strongly the mask appears.

We then draw bounding boxes for each tracked object. Line 21 helps confirm detection regions visually.

Finally, we render the object ID and confidence score and draw it above the box. This allows us to verify identity consistency and inspect tracking confidence per frame (Lines 23-33).

Streaming Inference Function: Maintaining Temporal Memory Across Live Video Frames

So far, we processed videos by loading all frames first and then propagating segmentation across the entire clip.

For webcam or live-stream scenarios, this approach does not work because frames arrive one at a time. Instead, we need a streaming pipeline that:

Maintains tracking memory across frames
Processes each incoming frame independently
Updates segmentation and tracking state continuously

To achieve this, we maintain a persistent session and reuse it across frames.

# ------------------------------------------------
# Global streaming state (kept across frames)
# ------------------------------------------------
STREAM_STATE = {
   "session": None,
   "prompt": None,
}

First, we create a small global structure that keeps track of the active inference session.

This dictionary stores:

The active SAM3 inference session
The currently used text prompt

This allows us to reuse the same session across frames instead of recreating it repeatedly.

# ------------------------------------------------
# Streaming inference function
# ------------------------------------------------
def process_webcam_frame(frame, text_prompt):
   global STREAM_STATE

   if frame is None:
       return None

   # Initialize session if needed or if prompt changed
   if (
       STREAM_STATE["session"] is None
       or STREAM_STATE["prompt"] != text_prompt
   ):
       session = processor.init_video_session(
           inference_device=device,
           processing_device="cpu",
           video_storage_device="cpu",
           dtype=torch.bfloat16,
       )
       session = processor.add_text_prompt(
           inference_session=session,
           text=text_prompt,
       )
       STREAM_STATE["session"] = session
       STREAM_STATE["prompt"] = text_prompt

   session = STREAM_STATE["session"]

   # Preprocess frame
   inputs = processor(images=frame, device=device, return_tensors="pt")

   # Streaming forward pass
   with torch.no_grad():
       model_outputs = model(
           inference_session=session,
           frame=inputs.pixel_values[0],
           reverse=False,
       )

   # Postprocess to original resolution
   outputs = processor.postprocess_outputs(
       session,
       model_outputs,
       original_sizes=inputs.original_sizes,
   )

   # Visualize
   if outputs and len(outputs["object_ids"]) > 0:
       frame = overlay_masks_boxes(
           frame,
           outputs["masks"].cpu().numpy(),
           outputs["boxes"].cpu().numpy(),
           outputs["scores"].cpu().numpy(),
           outputs["object_ids"].cpu().numpy(),
       )

   return frame

Now we define the function that processes each incoming webcam frame. We initialize the global stream state defined earlier. If no frame is available, we simply return None (Lines 14 and 15).

Next, we check whether a session already exists or whether the user has changed the prompt. If either condition is true, we create a new session. Unlike offline video processing, we do not pass frames during initialization because frames arrive one at a time. We then attach the prompt and store the session globally (Lines 18-33).

This session (Line 35) now contains:

Tracking memory
Object identity history
Propagation state

across previous frames.

To convert each frame into tensors before inference, we initialize the processor, which handles resizing, normalization, tensor formatting, and device placement (Line 38).

We now run inference for the current frame (Lines 41-46). Key points:

torch.no_grad() disables gradients, improving speed.
The frame is processed using the existing session.
SAM3 updates tracking memory internally.

So segmentation and identities propagate automatically.

Model outputs are resized back to original resolution. This produces masks, bounding boxes, scores, and object IDs for the current frame (Lines 49-53).

If objects are detected, we overlay results on the frame. This produces the annotated frame. The processed frame is returned to the UI or video stream (Lines 56-65).

Launch the Gradio Application

Now we connect our streaming inference pipeline to a live webcam interface using Gradio.

This allows us to:

Capture frames directly from a webcam
Run SAM3 segmentation and tracking in real time
Visualize masks and tracked objects continuously

# ------------------------------------------------
# Gradio UI
# ------------------------------------------------
with gr.Blocks() as demo:
   gr.Markdown("# 📷 SAM3 Live Webcam Segmentation & Tracking")

   with gr.Row():
       webcam = gr.Image(
           sources=["webcam"],
           streaming=True,
           label="Webcam",
           type="numpy",
       )

       output = gr.Image(
           label="Live Segmentation",
           type="numpy",
       )

   prompt = gr.Textbox(
       label="Text Prompt",
       value="person",
       placeholder="e.g. person, face, chair, bottle",
   )

   webcam.stream(
       fn=process_webcam_frame,
       inputs=[webcam, prompt],
       outputs=output,
   )

demo.launch(debug=True)

Line 1 creates the Gradio app layout where we can combine multiple UI components. Line 2 displays a header explaining what the demo does.

We place input and output side by side. Inside gr.Row, we define 2 components:

First component:

Captures frames from the webcam
Streams frames continuously
Sends frames as NumPy arrays to our function

The streaming=True flag enables real-time frame delivery.

The second component displays the processed frame returned by our streaming pipeline.

Next, we create a textbox for specifying the concept to track. Users can change this dynamically while the webcam runs. If the prompt changes, our pipeline automatically resets the session.

The key connection happens on Lines 26-30. This tells Gradio:

Send every webcam frame to process_webcam_frame
Pass along the current text prompt
Display the returned frame in the output panel

This loop runs continuously while the webcam is active.

Finally, we launch the interface. This starts a local server and opens the demo in a browser. The debug=True flag helps diagnose errors during development.

Output: Real-Time Webcam Video Segmentation Results

Figure 2: Real-Time Webcam Video Segmentation Demo (source: GIF by the author).

Single-Click Object Tracking

In this workflow, we remove text prompts and select an object by clicking on it in the first frame. SAM3 segments the clicked object and propagates its mask throughout the video using its tracking memory. With just one foreground point, we obtain consistent object tracking across the full sequence.

Load the SAM3 Tracker Video Model

So far, we used text prompts to detect and track concepts across videos and live streams.

Now we move to a different workflow: interactive tracking, where we manually select objects and let SAM3 track them across frames.

To enable this, we switch from the text-prompt video model to the tracker-specific video model.

Here is how we load it:

# Initialize model
device = Accelerator().device
model = Sam3TrackerVideoModel.from_pretrained("facebook/sam3").to(device, dtype=torch.bfloat16)
processor = Sam3TrackerVideoProcessor.from_pretrained("facebook/sam3")

The Accelerator().device automatically selects the available hardware:

GPU if available
CPU otherwise

This keeps our code portable across machines.

We load the SAM3 tracker model, which is optimized for:

Point-based object prompting
Interactive tracking
Multi-object identity propagation
Frame-to-frame tracking consistency

Unlike the previous model, this one does not require text prompts. Instead, it expects clicks or point annotations.

We also move the model to the selected device and run it in bfloat16 precision, reducing memory usage and speeding up inference.

We load the processor which prepares inputs and postprocesses outputs specifically for tracking workflows. It handles frame preprocessing, prompt encoding (clicks or points), mask decoding, and identity propagation formatting.

Extract First Frame: Preparing the Initial Frame for Object Selection

Before we start tracking objects in a video, we need a way to select them.

In our workflow, we select objects by clicking on them in the first frame. So, the first step is to extract that frame from the video.

Here is a small helper function that does exactly that:

def extract_first_frame(video_path):
   """Extract first frame from video for point selection"""
   cap = cv2.VideoCapture(video_path)
   ret, frame = cap.read()
   cap.release()
   if ret:
       return cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
   return None

First, OpenCV opens the video file and prepares it for frame reading. Then it attempts to read the next frame from the video.

ret indicates whether reading succeeded.
frame contains the actual image data.

Since we just opened the video, this call returns the first frame.

We release the video resource immediately after reading the frame. This is important because:

It frees system resources
It prevents file locks
It keeps later video processing clean

OpenCV loads images in BGR format, but most visualization and processing pipelines expect RGB. So we convert BGR to RGB before returning the frame.

If the frame cannot be read, the function safely returns None.

Tracking Object Function: Propagating a Single Object Mask Across Video Frames

We now build the core function that allows us to track an object through an entire video using a single click.

The workflow is simple:

The user uploads a video.
The user clicks on the object in the first frame.
SAM3 segments that object.
The tracker propagates the object mask across all frames.
A new annotated video is generated.

Let us walk through the implementation step by step:

def track_object(video_path, point_coords):
   """Track object through video based on clicked point"""
   if video_path is None:
       return None, "Please upload a video first"

   if point_coords is None:
       return None, "Please click on the first frame to select an object"

   try:
       # Load video
       video_frames, _ = load_video(video_path)

       # Get click coordinates
       x, y = int(point_coords[0]), int(point_coords[1])

       # Initialize session
       inference_session = processor.init_video_session(
           video=video_frames,
           inference_device=device,
           dtype=torch.bfloat16,
       )

       # Add point annotation
       points = [[[[x, y]]]]
       labels = [[[1]]]  # 1 for foreground point

       processor.add_inputs_to_inference_session(
           inference_session=inference_session,
           frame_idx=0,
           obj_ids=1,
           input_points=points,
           input_labels=labels,
       )

       # First, segment the object on the first frame
       outputs = model(
           inference_session=inference_session,
           frame_idx=0,
       )
       first_frame_masks = processor.post_process_masks(
           [outputs.pred_masks],
           original_sizes=[[inference_session.video_height, inference_session.video_width]],
           binarize=False
       )[0]

       # Propagate through video
       video_segments = {0: first_frame_masks}
       for sam3_tracker_video_output in model.propagate_in_video_iterator(inference_session):
           video_res_masks = processor.post_process_masks(
               [sam3_tracker_video_output.pred_masks],
               original_sizes=[[inference_session.video_height, inference_session.video_width]],
               binarize=False
           )[0]
           video_segments[sam3_tracker_video_output.frame_idx] = video_res_masks

       # Create output video with masks
       output_path = "/tmp/output_tracked.mp4"
       fourcc = cv2.VideoWriter_fourcc(*'mp4v')
       height, width = video_frames[0].shape[:2]
       out = cv2.VideoWriter(output_path, fourcc, 30.0, (width, height))

       for idx in range(len(video_frames)):
           frame = video_frames[idx].copy().astype(np.uint8)
           if idx in video_segments:
               masks = video_segments[idx]
               # Convert mask to float32 first, then to boolean
               mask = masks[0, 0].float().cpu().numpy() > 0.0
               # Overlay red mask
               overlay = frame.copy()
               overlay[mask] = [255, 0, 0]
               frame = cv2.addWeighted(frame.astype(np.float32), 0.6, overlay.astype(np.float32), 0.4, 0).astype(np.uint8)
           out.write(cv2.cvtColor(frame, cv2.COLOR_RGB2BGR))

       out.release()

       status = f"✅ Successfully tracked object through {len(video_segments)} frames at point ({x}, {y})"
       return output_path, status

   except Exception as e:
       return None, f"❌ Error: {str(e)}"

The track_object() function accepts:

video_path: Path to the uploaded video file.
point_coords: (x, y) coordinates of the user’s click on the first frame.

The goal is simple: “Given a video and one clicked point, track that object through the entire video.”

If no video is uploaded, tracking cannot begin. The function returns:

None: No output video
A message explaining the issue

Likewise, tracking requires a foreground prompt. SAM3 needs at least one positive point to know: which object to segment, and where that object exists in the first frame. Without it, tracking is undefined.

Inside a try block, the load_video() function reads the video file, extracts all frames into memory, and returns them as a list of NumPy arrays.

Why load all frames?

Because SAM3 tracking requires:

Access to the entire temporal sequence
Mask propagation across frames
Internal memory consistency

Each frame shape is typically: (height, width, 3). The UI provides floating-point coordinates. We convert them to integers because:

Pixel indices must be integers
Mask indexing requires integer positions

This (x, y) now represents a foreground location inside frame 0.

Next, we initialize a video session which creates an internal tracking session, stores all video frames, model memory state, and object tracking buffers. We also set computation device (either CPU or GPU) and use bfloat16 for faster inference and lower memory usage. This prepares SAM3’s brain to process a full video.

Then, we prepare a foreground prompt. Since SAM3 expects inputs in batch format, points = [[[[x, y]]]] means:

Batch size = 1
Object ID = 1
One point
Coordinates (x, y)

and in labels = [[[1]]]

1 means:

Foreground point

If it were 0, it would mean:

Background point

So this tells SAM3: “This pixel belongs to the object.”

Then processor.add_inputs_to_inference_session() injects the prompt into the tracking session.

frame_idx=0: Object exists in the first frame
obj_ids=1: This is object ID 1
input_points: Where the object is
input_labels: Foreground signal

At this point, the model knows that object 1 is located at (x, y) in frame 0.

We explicitly run segmentation on frame 0. This produces:

Raw mask logits
Low-resolution mask predictions

Then processor.post_process_masks():

Resizes masks to original resolution
Converts internal representation into full-size masks
Keeps them as float probabilities (not binarized)

We now have a full-resolution mask for frame 0.

On Line 47, we store frame 0 results first. Then model.propagate_in_video_iterator() runs SAM3’s tracking mechanism.

What happens internally:

It uses memory from frame 0
Matches object appearance across frames
Predicts masks for each new frame

For each frame processor.post_process_masks(...), we resize masks and store them in a dictionary.

Final structure:

video_segments = {
  0: mask0,
  1: mask1,
  2: mask2,
  ...
}

Now we have segmentation results for the full video.

Next we define the output video path, and define the codec format to mp4v. We get the resolution which ensures output video matches input resolution. We also initialize the OpenCV writer using cv2.VideoWriter() and pass the output video path, codec format, 30 FPS, and height and width.

We then iterate over each frame and create a copy to avoid modifying the original frames. We move the output tensor to the CPU, convert it to a NumPy array, compute probabilities, and threshold the result to obtain a Boolean mask. The resulting mask is True where the object exists and False elsewhere.

We then overlay a red mask and blend it using cv2.addWeighted(), producing a frame with 60% original content and 40% red overlay for smooth visualization. Because OpenCV expects BGR format, we convert the frame using cv2.cvtColor() with the cv2.COLOR_RGB2BGR flag.

out.release() finalizes the file, writes any remaining buffers, and closes the video properly. Without this call, the output file may become corrupted.

Finally, we return the path to the saved video and a success message. If an error occurs:

The function safely returns an error message
The application does not crash

Launch the Gradio Application

We now connect our tracking pipeline to an interactive Gradio interface. This interface allows users to upload a video, click on an object in the first frame, and automatically track that object across the entire clip.

Here is the full interface code:

# Create Gradio interface with blocks for better control
with gr.Blocks(title="SAM3 Video Tracker") as demo:
   gr.Markdown("# 🎯 SAM3 Video Object Tracker")
   gr.Markdown("Upload a video and click on an object in the first frame to track it throughout the video")

   with gr.Row():
       with gr.Column():
           video_input = gr.Video(label="Upload Video")
           first_frame = gr.Image(label="Click on object to track", type="numpy")
           point_display = gr.Textbox(label="Selected Point", interactive=False)
           track_btn = gr.Button("Track Object", variant="primary")

       with gr.Column():
           video_output = gr.Video(label="Tracked Video")
           status_output = gr.Textbox(label="Status")

   # Store clicked point
   clicked_point = gr.State(None)

   # Extract first frame when video is uploaded
   def on_video_upload(video):
       if video:
           frame = extract_first_frame(video)
           return frame, None, "Upload complete. Click on the object you want to track."
       return None, None, ""

   video_input.change(
       on_video_upload,
       inputs=[video_input],
       outputs=[first_frame, clicked_point, status_output]
   )

   # Handle click on first frame
   def on_click(img, evt: gr.SelectData):
       x, y = evt.index[0], evt.index[1]
       # Draw a circle on the clicked point
       img_copy = img.copy()
       cv2.circle(img_copy, (x, y), 5, (255, 0, 0), -1)
       return img_copy, (x, y), f"Point selected: ({x}, {y})"

   first_frame.select(
       on_click,
       inputs=[first_frame],
       outputs=[first_frame, clicked_point, point_display]
   )

   # Track button
   track_btn.click(
       track_object,
       inputs=[video_input, clicked_point],
       outputs=[video_output, status_output]
   )

# Launch
demo.launch(debug=True)

The Gradio interface is built using gr.Blocks, which gives full control over layout, components, and event handling. The goal is simple: “Allow a user to upload a video, click on a single object in the first frame, and track that object throughout the entire video.”

At the top of the interface, we display two gr.Markdown() sections. The first acts as the main heading so users immediately understand what the application does. The second provides short instructions explaining the workflow: upload a video, click on an object in the first frame, and then track it.

Next, we structure the layout using a gr.Row(). Inside that row, we create two columns. The left column contains all inputs and interactions. The right column displays outputs.

In the left column, we first add a video upload component. This allows the user to upload a video file from their system. Once uploaded, the backend receives the file path. That file path is later used to extract frames and run tracking.

Below the video upload, we place an image component. This image will display the first frame of the uploaded video. We set its type to NumPy so that the backend receives the frame as a NumPy array. This is important because we draw visual markers on the frame using OpenCV when the user clicks.

Below the image, we add a textbox labeled "Selected Point". This textbox is non-interactive, meaning the user cannot manually edit it. It simply displays the coordinates of the selected point so the user can confirm their click.

Under that, we add a "Track Object" button. This button is styled as primary so it visually stands out as the main action. When clicked, it triggers the tracking pipeline.

In the right column, we create a video output component. This will display the processed video after tracking is complete. Below it, we add a status textbox. This displays messages such as upload confirmation, tracking success, or error details.

To maintain interaction state, we use a Gradio State variable called clicked_point. This variable stores the coordinates of the selected object. Initially, it is set to None. State is important because the click event and the tracking event happen at different times, and we need a way to remember which point the user selected.

When a video is uploaded, a function on_video_upload() is triggered. This function checks whether a valid video exists. If it does, we extract the first frame using a helper function. That first frame is returned to the image component so the user can see it. We also reset the stored clicked point to None, ensuring that any previous selection is cleared. Finally, we return a status message informing the user that the upload is complete and they should click on an object.

If no video is uploaded, the function returns empty values, keeping the interface clean.

When the user clicks on the first frame image, another function on_click() handles the event. The click event provides the pixel coordinates of the selected location through evt.index. We extract the x and y coordinates from that event data.

Next, we create a copy of the displayed image. This is important because we do not want to modify the original image directly. On this copy, we draw a small filled circle at the clicked location using OpenCV. The circle visually marks the selected object so the user knows exactly where they clicked.

After drawing the marker, we return 3 things:

The updated image with the circle drawn
The tuple (x, y) stored in the state variable
A formatted string such as "Point selected: (x, y)" displayed in the textbox

This ensures the UI updates immediately and the selected point is stored for later use.

The Track Object button is connected to the backend tracking function. When pressed, it sends 2 inputs:

The uploaded video path
The stored clicked point

The tracking function then performs segmentation and mask propagation across the entire video. Once processing is complete, it returns:

The path to the output tracked video
A status message indicating success or failure

These outputs are displayed in the video output component and the status textbox, respectively.

Finally, the application is launched with debug mode enabled. Debug mode prints detailed logs in case of errors, which is helpful during development and testing.

The complete flow works as follows:

The user uploads a video.
The first frame is extracted and displayed.
The user clicks on an object.
A visual marker appears and the coordinates are stored.
The user presses Track Object.
The backend processes the video and returns the tracked result.
The output video and status message are displayed.

This design keeps the interface simple, intuitive, and focused on a single-click tracking workflow while properly managing state and user interaction.

Output: Single-Click Video Object Tracking Results

Figure 3: Single-Click Video Object Tracking Demo (source: GIF by the author).

Multi-Click Object Tracking

In this final setup, we select multiple objects by clicking different locations in the first frame. Each click initializes a unique object ID, and SAM3 tracks all selected objects simultaneously. The output video shows multiple masks with distinct colors, preserving identity consistency across frames.

Initialize Few Colors: Defining a Color Palette for Multi-Object Tracking Visualization

When tracking multiple objects at the same time, visualization becomes very important. If all objects share the same mask color, it becomes difficult to understand which mask corresponds to which object.

To solve this, we assign different colors to different tracked objects.

Here is a small color palette we use:

# Different colors for different objects
COLORS = [
   [255, 0, 0],    # Red
   [0, 255, 0],    # Green
   [0, 0, 255],    # Blue
   [255, 255, 0],  # Yellow
   [255, 0, 255],  # Magenta
   [0, 255, 255],  # Cyan
   [255, 128, 0],  # Orange
   [128, 0, 255],  # Purple
]

Each entry in this list represents an RGB color used to render masks and overlays for a tracked object.

For example:

Object 1 may appear in red
Object 2 in green
Object 3 in blue
and so on.

During visualization, we typically assign colors based on object index or object ID, cycling through the list if the number of objects exceeds the available colors.

Extract First Frame: Preparing the First Frame for Multi-Object Selection

For multi-object tracking, we again begin by extracting the first frame of the video. This frame is used as the interaction surface where users click on multiple objects they want to track.

The helper function below reads the first frame from the uploaded video.

def extract_first_frame(video_path):
   """Extract first frame from video for point selection"""
   cap = cv2.VideoCapture(video_path)
   ret, frame = cap.read()
   cap.release()
   if ret:
       return cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
   return None

First, we open the video using OpenCV’s VideoCapture. Immediately after opening, we read a single frame. Since the video has just been opened, this corresponds to the very first frame.

Next, we release the video handle to free system resources and avoid locking the file for later processing steps.

OpenCV loads images in BGR format, but our visualization and model pipelines expect RGB images. Therefore, we convert the frame from BGR to RGB before returning it.

If frame extraction fails, the function safely returns None, allowing the application to handle the error gracefully.

This function now allows users to click on multiple objects in the first frame, which we will use as prompts for tracking several objects simultaneously in the next step.

Tracking Object Function: Tracking Multiple Objects with Unique IDs Across Video Frames

def track_objects(video_path, points_list):
   """Track multiple objects through video based on clicked points"""
   if video_path is None:
       return None, "Please upload a video first"

   if not points_list or len(points_list) == 0:
       return None, "Please click on at least one object in the first frame"

   try:
       # Load video
       video_frames, _ = load_video(video_path)

       # Initialize session
       inference_session = processor.init_video_session(
           video=video_frames,
           inference_device=device,
           dtype=torch.bfloat16,
       )

       # Prepare points for all objects
       obj_ids = list(range(1, len(points_list) + 1))
       input_points = [[[[int(x), int(y)]] for x, y in points_list]]
       input_labels = [[[1] for _ in points_list]]  # All are foreground points

       # Add all objects to inference session
       processor.add_inputs_to_inference_session(
           inference_session=inference_session,
           frame_idx=0,
           obj_ids=obj_ids,
           input_points=input_points,
           input_labels=input_labels,
       )

       # First, segment objects on the first frame
       outputs = model(
           inference_session=inference_session,
           frame_idx=0,
       )
       first_frame_masks = processor.post_process_masks(
           [outputs.pred_masks],
           original_sizes=[[inference_session.video_height, inference_session.video_width]],
           binarize=False
       )[0]

       # Initialize video segments with first frame
       video_segments = {0: {
           obj_id: first_frame_masks[i]
           for i, obj_id in enumerate(inference_session.obj_ids)
       }}

       # Propagate through video
       for sam3_tracker_video_output in model.propagate_in_video_iterator(inference_session):
           video_res_masks = processor.post_process_masks(
               [sam3_tracker_video_output.pred_masks],
               original_sizes=[[inference_session.video_height, inference_session.video_width]],
               binarize=False
           )[0]
           video_segments[sam3_tracker_video_output.frame_idx] = {
               obj_id: video_res_masks[i]
               for i, obj_id in enumerate(inference_session.obj_ids)
           }

       # Create output video with masks
       output_path = "/tmp/output_tracked.mp4"
       fourcc = cv2.VideoWriter_fourcc(*'mp4v')
       height, width = video_frames[0].shape[:2]
       out = cv2.VideoWriter(output_path, fourcc, 30.0, (width, height))

       for idx in range(len(video_frames)):
           frame = video_frames[idx].copy().astype(np.uint8)

           if idx in video_segments:
               # Create overlay for all objects
               overlay = frame.copy().astype(np.float32)

               for obj_idx, (obj_id, masks) in enumerate(video_segments[idx].items()):
                   # Convert mask to float32 first, then to boolean
                   mask = masks[0].float().cpu().numpy() > 0.0

                   # Use different color for each object
                   color = COLORS[obj_idx % len(COLORS)]
                   overlay[mask] = np.array(color, dtype=np.float32)

               # Blend overlay with original frame
               frame = cv2.addWeighted(frame.astype(np.float32), 0.6, overlay, 0.4, 0).astype(np.uint8)

           out.write(cv2.cvtColor(frame, cv2.COLOR_RGB2BGR))

       out.release()

       status = f"✅ Successfully tracked {len(points_list)} object(s) through {len(video_segments)} frames"
       return output_path, status

   except Exception as e:
       import traceback
       return None, f"❌ Error: {str(e)}\n{traceback.format_exc()}"

The track_objects() function accepts:

video_path: Path to the uploaded video file.
points_list: A list of (x, y) coordinates where the user clicked on the first frame (one click per object).

The goal is simple: “Given a video and multiple clicked points, track all selected objects through the entire video.”

If no video is uploaded, tracking cannot begin. The function returns:

None: No output video
A message explaining the issue

Likewise, tracking requires at least one foreground prompt. SAM3 needs one or more positive points to know:

Which objects to segment
Where those objects exist in the first frame

If points_list is empty, tracking is undefined.

Inside a try block, the load_video() function reads the video file, extracts all frames into memory, and returns them as a list of NumPy arrays.

Why load all frames?

Because SAM3 tracking requires:

Access to the entire temporal sequence
Mask propagation across frames
Internal memory consistency

Each frame shape is typically: (height, width, 3). Next, we initialize a video session using processor.init_video_session(). This creates an internal tracking session that:

Stores all video frames
Maintains model memory state
Manages object tracking buffers

We also:

Set the computation device (CPU or GPU)
Use bfloat16 for faster inference and lower memory usage

This step prepares SAM3’s internal tracking mechanism to process the full video.

Now we prepare inputs for multiple objects. If the user clicked 3 points, this becomes:

[1, 2, 3]

Each clicked point is treated as a separate object with its own ID.

Then we structure the coordinates:

input_points = [[[[int(x), int(y)]] for x, y in points_list]]

SAM3 expects batch format:

[batch][object][points][coordinates]

So this structure means:

Batch size = 1
Multiple objects
One point per object
Each point has (x, y)

We convert coordinates to integers because:

Pixel indices must be integers
Mask indexing requires integer positions

Each (x, y) now represents a foreground location for a different object in frame 0.

Next, we define labels:

input_labels = [[[1] for _ in points_list]]

1 means:

Foreground point

If it were 0, it would mean:

Background point

So this tells SAM3: “Each of these clicked pixels belongs to a separate object.”

Then we inject everything into the inference session using processor.add_inputs_to_inference_session():

Here:

frame_idx=0: Objects exist in the first frame
obj_ids: Multiple object identifiers
input_points: Click locations
input_labels: Foreground signals

At this point, the model knows that multiple objects are present at the selected coordinates in frame 0.

Next, we explicitly run segmentation on frame 0. This produces:

Raw mask logits
Low-resolution mask predictions for all objects

Then we post-process the masks using processor.post_process_masks(...). This step:

Resizes masks to original resolution
Converts internal representation into full-size masks
Keeps them as float probabilities (not binarized)

We now have: Full-resolution masks for all selected objects in frame 0.

Now we initialize storage on Lines 46-49:

This means:

Frame 0 results are stored first
Each object ID maps to its own mask

Structure becomes:

video_segments = {
   0: {
       1: mask_for_object_1,
       2: mask_for_object_2,
       ...
   }
}

Next, we propagate through the video using model.propagate_in_video_iterator(). This runs SAM3’s tracking mechanism.

What happens internally:

It uses memory from frame 0
Matches object appearance across frames
Maintains identity consistency for each object
Predicts masks for each new frame

For every frame:

We resize masks
Store them in a dictionary per object

Final structure:

video_segments = {
 0: {1: mask0_1, 2: mask0_2},
 1: {1: mask1_1, 2: mask1_2},
 2: {1: mask2_1, 2: mask2_2},
 ...
}

Now we have segmentation results for all objects across the full video.

Next, we define:

Output video path
Codec format (mp4v)

We get the resolution from the first frame to ensure the output video matches the input resolution. Then we initialize cv2.VideoWriter() which sets:

Output path
Codec
30 FPS
Frame dimensions

Now we iterate through each frame. We copy each frame to avoid modifying the original.

If masks exist for that frame:

We create an overlay
For each object:
- Convert tensor: CPU
- Convert to NumPy
- Convert probabilities: Boolean mask

Now:

True: Object pixels
False: Background

Unlike the single-object version, here we:

Assign a different color for each object
Use COLORS[obj_idx % len(COLORS)]
Overlay masks for multiple objects

Then we blend using cv2.addWeighted(). This results in:

60% original frame
40% colored overlay
Smooth visualization

OpenCV expects BGR format, so we convert using cv2.COLOR_RGB2BGR.

Finally, out.release() finalizes the file, writes any remaining buffers, and properly closes the video. Without this step, the output video may become corrupted.

At the end, we return:

Path to the saved video
A success message indicating how many objects were tracked

If anything fails:

The function safely returns the error
The application does not crash
The traceback is included for debugging

Launch the Gradio Application

# Create Gradio interface with blocks for better control
with gr.Blocks(title="SAM3 Multi-Object Video Tracker") as demo:
   gr.Markdown("# 🎯 SAM3 Multi-Object Video Tracker")
   gr.Markdown("Upload a video and click on multiple objects in the first frame to track them. Each object gets a different color!")

   with gr.Row():
       with gr.Column():
           video_input = gr.Video(label="Upload Video")
           first_frame = gr.Image(label="Click on objects to track (multiple clicks supported)", type="numpy")

           with gr.Row():
               clear_points_btn = gr.Button("Clear Points", variant="secondary")
               track_btn = gr.Button("Track Objects", variant="primary")

           points_display = gr.Textbox(label="Selected Points", interactive=False, lines=5)

       with gr.Column():
           video_output = gr.Video(label="Tracked Video")
           status_output = gr.Textbox(label="Status")
           gr.Markdown("""
           ### Color Legend:
           - 🔴 Red - Object 1
           - 🟢 Green - Object 2
           - 🔵 Blue - Object 3
           - 🟡 Yellow - Object 4
           - 🩷 Magenta - Object 5
           - 🩵 Cyan - Object 6
           - 🟠 Orange - Object 7
           - 🟣 Purple - Object 8
           """)

   # Store clicked points and original frame
   clicked_points = gr.State([])
   original_frame = gr.State(None)

   # Extract first frame when video is uploaded
   def on_video_upload(video):
       if video:
           frame = extract_first_frame(video)
           return frame, frame, [], "Upload complete. Click on objects you want to track."
       return None, None, [], ""

   video_input.change(
       on_video_upload,
       inputs=[video_input],
       outputs=[first_frame, original_frame, clicked_points, status_output]
   )

   # Handle click on first frame
   def on_click(img, orig_frame, points, evt: gr.SelectData):
       if orig_frame is None:
           return img, points, "Please upload a video first"

       x, y = evt.index[0], evt.index[1]

       # Add point to list
       points.append((x, y))

       # Draw all points on the image
       img_copy = orig_frame.copy()
       for i, (px, py) in enumerate(points):
           color = COLORS[i % len(COLORS)]
           cv2.circle(img_copy, (px, py), 8, tuple(color), -1)
           cv2.circle(img_copy, (px, py), 10, (255, 255, 255), 2)
           # Add number label
           cv2.putText(img_copy, str(i+1), (px+15, py+5),
                      cv2.FONT_HERSHEY_SIMPLEX, 0.6, tuple(color), 2)

       points_text = "\n".join([f"Object {i+1}: ({x}, {y})" for i, (x, y) in enumerate(points)])

       return img_copy, points, points_text

   first_frame.select(
       on_click,
       inputs=[first_frame, original_frame, clicked_points],
       outputs=[first_frame, clicked_points, points_display]
   )

   # Clear points button
   def clear_points(orig_frame):
       return orig_frame, [], ""

   clear_points_btn.click(
       clear_points,
       inputs=[original_frame],
       outputs=[first_frame, clicked_points, points_display]
   )

   # Track button
   track_btn.click(
       track_objects,
       inputs=[video_input, clicked_points],
       outputs=[video_output, status_output]
   )

# Launch
demo.launch(debug=True)

The Gradio interface is built using gr.Blocks, which allows us to design a structured, interactive layout with full control over components and events. The goal is simple: “Create an interactive UI where a user uploads a video, clicks on multiple objects in the first frame, and then tracks them across the entire video.”

At the top of the interface, we display a title and short instructions using gr.Markdown(). This helps users immediately understand what the application does and what steps they need to follow. Clear instructions reduce confusion and improve usability.

Next, we organize the layout using a gr.Row(). Inside that row, we create 2 columns using gr.Columns(). The left column handles inputs and interactions. The right column displays outputs and tracking results. This separation keeps the workflow intuitive and clean.

In the left column, we first create a video upload component. This allows the user to upload a video file from their system. Once a video is uploaded, the backend receives the file path, which is later used to extract frames and perform tracking.

Below the video upload, we place an image component that displays the first frame of the uploaded video. This image is interactive and supports click events. We set its type to NumPy so the backend receives the image as a NumPy array. This is important because we draw circles and labels on the frame using OpenCV.

Under the image, we add 2 buttons side by side. The first button clears selected points. The second button triggers the tracking pipeline. The track button is styled as the primary button so it visually stands out as the main action.

Below the buttons, we add a textbox that displays selected points. This textbox is non-interactive, meaning users cannot edit it manually. It simply shows a formatted list of selected objects and their coordinates. This helps users confirm that they clicked the correct locations before starting tracking.

In the right column, we create a video output component. This will display the processed video returned by the tracking function. Below that, we add a status textbox to show messages such as upload confirmation, tracking success, or error details. Finally, we display a color legend using Markdown so users understand which color corresponds to which object during visualization.

To manage interaction data across events, we use Gradio’s State component. One state variable stores the list of clicked points. This list grows as the user clicks on multiple objects. Another state variable stores the original first frame. This is important because every time a new click occurs, we redraw all points on a clean copy of the original frame instead of repeatedly drawing over an already modified image. Without this, the markers would stack incorrectly and distort the visualization.

When a video is uploaded, a function on_video_upload() is triggered. This function extracts the first frame from the video and returns 4 values:

The extracted first frame for display
The same frame stored as the original clean frame
An empty list of clicked points
A status message confirming upload completion

This ensures that each new upload resets the application state properly.

When the user clicks on the first frame image, another function on_click() handles the event. The click event provides the pixel coordinates of the selected location. First, we check whether a video has been uploaded. If not, we return a message asking the user to upload one.

If a frame exists, we extract the x and y coordinates from the click event. We then append this coordinate pair to the stored list of points. After updating the list, we redraw the image. We copy the original clean frame and loop over all stored points. For each point:

We select a color from the predefined COLORS list
We draw a filled circle at the clicked location
We draw a white border around the circle for better visibility
We add a numeric label next to the point indicating Object 1, Object 2, and so on

This ensures each selected object is visually distinct and clearly labeled.

We also generate formatted text listing all selected objects and their coordinates. This text is displayed in the textbox so the user can verify selections.

The Clear Points button is connected to a function that resets the interface. It restores the original clean frame, empties the clicked points list, and clears the points display textbox. This allows the user to start fresh without reloading the video.

The Track Objects button is connected to the tracking function. When clicked, it sends:

The uploaded video
The stored list of clicked points

to the backend tracking pipeline. The tracking function processes the video, segments and propagates masks for all selected objects, and returns:

The path to the processed output video
A status message

These are then displayed in the output video component and the status textbox.

Finally, the application is launched with debug mode enabled. Debug mode provides detailed logs in case errors occur during development, making troubleshooting easier.

Overall, the interface follows this flow:

The user uploads a video.
The first frame is extracted and displayed.
The user clicks multiple objects.
Points are stored and visualized with colors and labels.
The user presses Track Objects.
The backend processes the video and returns the tracked result.
The output video and status message are displayed.

State management ensures smooth interaction across multiple events, and the 2-column layout keeps inputs and outputs clearly separated.

Output: Multi-Object Video Segmentation and Tracking Results

Figure 4: Multi-Object Video Segmentation and Tracking Demo (source: GIF by the author).

What's next? We recommend PyImageSearch University.

Course information:
86+ total classes • 115+ hours hours of on-demand code walkthrough videos • Last updated: May 2026
★★★★★ 4.84 (128 Ratings) • 16,000+ Students Enrolled

I strongly believe that if you had the right teacher you could master computer vision and deep learning.

Do you think learning computer vision and deep learning has to be time-consuming, overwhelming, and complicated? Or has to involve complex mathematics and equations? Or requires a degree in computer science?

That’s not the case.

All you need to master computer vision and deep learning is for someone to explain things to you in simple, intuitive terms. And that’s exactly what I do. My mission is to change education and how complex Artificial Intelligence topics are taught.

If you're serious about learning computer vision, your next stop should be PyImageSearch University, the most comprehensive computer vision, deep learning, and OpenCV course online today. Here you’ll learn how to successfully and confidently apply computer vision to your work, research, and projects. Join me in computer vision mastery.

Inside PyImageSearch University you'll find:

✓ 86+ courses on essential computer vision, deep learning, and OpenCV topics
✓ 86 Certificates of Completion
✓ 115+ hours hours of on-demand video
✓ Brand new courses released regularly, ensuring you can keep up with state-of-the-art techniques
✓ Pre-configured Jupyter Notebooks in Google Colab
✓ Run all code examples in your web browser — works on Windows, macOS, and Linux (no dev environment configuration required!)
✓ Access to centralized code repos for all 540+ tutorials on PyImageSearch
✓ Easy one-click downloads for code, datasets, pre-trained models, etc.
✓ Access on mobile, laptop, desktop, etc.

Click here to join PyImageSearch University

Summary

In this tutorial, we extended SAM3 from image-based segmentation workflows to full video understanding and tracking. We first built pipelines that detect, segment, and track concepts across videos using simple text prompts, enabling automatic tracking of objects such as people or vehicles without manual annotation.

Next, we moved to streaming inference, where SAM 3 processes frames continuously from a webcam while maintaining tracking memory across time. This allowed us to build real-time segmentation and tracking systems that operate on live video streams.

We then explored interactive tracking workflows, where users select objects directly using click prompts. Starting from single-object tracking, we progressed to multi-object tracking, enabling several objects to be tracked simultaneously with consistent identity and color-coded visualization.

By the end of this tutorial, we developed complete end-to-end systems that combine detection, segmentation, tracking, and interactive workflows into practical applications using Gradio interfaces. Together with the previous parts of this series, we now have a full understanding of how SAM 3 enables concept-aware segmentation and tracking across both images and videos, opening the door to intelligent video editing, annotation, and analysis workflows.

Citation Information

Thakur, P. “SAM 3 for Video: Concept-Aware Segmentation and Object Tracking,” PyImageSearch, S. Huot, G. Kudriavtsev, and A. Sharma, eds., 2026, https://pyimg.co/luxfd

@incollection{Thakur_2026_sam-3-sam3-for-video-concept-aware-segmentation-and-object-tracking,
  author = {Piyush Thakur},
  title = {{SAM 3 for Video: Concept-Aware Segmentation and Object Tracking}},
  booktitle = {PyImageSearch},
  editor = {Susan Huot and Georgii Kudriavtsev and Aditya Sharma},
  year = {2026},
  url = {https://pyimg.co/luxfd},
}

To download the source code to this post (and be notified when future tutorials are published here on PyImageSearch), simply enter your email address in the form below!

Download the Source Code and FREE 17-page Resource Guide

Enter your email address below to get a .zip of the code and a FREE 17-page Resource Guide on Computer Vision, OpenCV, and Deep Learning. Inside you'll find my hand-picked tutorials, books, courses, and libraries to help you master CV and DL!

Looking for the source code to this post?

Need Help Configuring Your Development Environment?

What's next? We recommend PyImageSearch University.

Download the Source Code and FREE 17-page Resource Guide

About the Author

Vector Search Using Ollama for Retrieval-Augmented Generation (RAG)

DeepSeek-V3 Model: Theory, Config, and Rotary Positional Embeddings

Comment section

Similar articles

You can learn Computer Vision, Deep Learning, and OpenCV.

Footer

Topics

Books & Courses

PyImageSearch

Access the code to this tutorial and all other 500+ tutorials on PyImageSearch

What's included in PyImageSearch University?