Table of Contents
- SAM 3 for Video: Concept-Aware Segmentation and Object Tracking
- Configuring Your Development Environment
- Setup and Imports
- Text-Prompt Video Tracking
- Load the SAM3 Video Model
- Helper Function: Visualizing Video Segmentation Masks, Bounding Boxes, and Tracking IDs
- Main Pipeline: Running the Full Video Segmentation and Tracking Workflow
- Launch the Gradio Application
- Output: Text-Prompt Video Segmentation and Tracking Results
- Real-Time Text-Prompt Tracking (Webcam)
- Helper Function: Stable Color Overlays for Real-Time Video Tracking
- Streaming Inference Function: Maintaining Temporal Memory Across Live Video Frames
- Launch the Gradio Application
- Output: Real-Time Webcam Video Segmentation Results
- Single-Click Object Tracking
- Load the SAM3 Tracker Video Model
- Extract First Frame: Preparing the Initial Frame for Object Selection
- Tracking Object Function: Propagating a Single Object Mask Across Video Frames
- Launch the Gradio Application
- Output: Single-Click Video Object Tracking Results
- Multi-Click Object Tracking
- Initialize Few Colors: Defining a Color Palette for Multi-Object Tracking Visualization
- Extract First Frame: Preparing the First Frame for Multi-Object Selection
- Tracking Object Function: Tracking Multiple Objects with Unique IDs Across Video Frames
- Launch the Gradio Application
- Output: Multi-Object Video Segmentation and Tracking Results
- Summary
SAM 3 for Video: Concept-Aware Segmentation and Object Tracking
In Part 1 of this series, we introduced Segment Anything Model 3 (SAM 3) and saw how it moves beyond geometric prompts to concept-based visual understanding. We learned how the model can segment all instances of a concept using natural language and visual examples.
In Part 2, we went one step further. We explored multi-modal prompting and interactive workflows. We combined text, bounding boxes, and point clicks to build precise and controllable segmentation pipelines on images.
So far, however, everything we have done lives in a static world.
Images do not move. Objects do not disappear. There is no notion of time.
Video changes everything.
In videos, segmentation alone is not enough. We also need temporal consistency. If a person appears in frame 1 and walks across the scene, the model must not only segment that person — it must also remember that it is the same person in frame 200.
This is where SAM3 becomes fundamentally different from previous systems.
SAM3 does not treat video as a bag of independent images. Instead, it maintains a streaming memory and a tracking state that allows it to propagate object identities across frames. Detection, segmentation, and tracking are no longer separate steps. They are part of a single, unified pipeline.
In other words, SAM3 does not just answer: “Where is the object in this frame?”
It answers: “Where is this concept over time?”
In this tutorial, we will focus entirely on using SAM3 with video. We will build several practical pipelines that combine detection, segmentation, and tracking into one coherent workflow.
Specifically, we will implement 4 tasks:
- Video Detection, Segmentation, and Tracking Using a Text Prompt
Here, we use a text prompt such as"person"and let SAM3 detect, segment, and track all instances of that concept throughout the video. - Real-Time Detection, Segmentation, and Tracking Using a Text Prompt via Webcam
This is the same idea, but running in real time on a live camera stream. - Detection, Segmentation, and Tracking Using a Single Click on an Object
Here, we do not use text. We simply click on an object in the first frame and let SAM3 track it. - Detection, Segmentation, and Tracking Using Multiple Clicks on Objects
In this case, we select multiple objects interactively and track all of them at once.
Across these examples, we will see the same core idea again and again:
- First, SAM3 recognizes what to track.
- Then, it segments it.
- Finally, it remembers and propagates it through time.
This lesson is the 3rd of a 4-part series on SAM 3:
- SAM 3: Concept-Based Visual Understanding and Segmentation
- Advanced SAM 3: Multi-Modal Prompting and Interactive Segmentation
- SAM 3 for Video: Concept-Aware Segmentation and Object Tracking (this tutorial)
- Lesson 4
To learn how to build SAM3-powered video segmentation and tracking pipelines — using text prompts, real-time webcam streams, and interactive multi-object tracking inside a dynamic Gradio interface, just keep reading.
Would you like immediate access to 3,457 images curated and labeled with hand gestures to train, explore, and experiment with … for free? Head over to Roboflow and get a free account to grab these hand gesture images.
Configuring Your Development Environment
To follow this guide, you need to have the following libraries installed on your system.
!pip install --q git+https://github.com/huggingface/transformers av gradio
First, we install the latest development version of transformers directly from GitHub.
We do this because SAM3 is very new. Its video processors, tracker models, and streaming APIs are not yet available in older stable releases. Installing from GitHub ensures we get access to:
Sam3VideoProcessorandSam3VideoModelfor concept-based segmentationSam3TrackerVideoProcessorandSam3TrackerVideoModelfor video tracking- The video session and streaming inference utilities
In short, this gives us all SAM3 image and video capabilities in one place.
Next, we install av. This is a Python binding for FFmpeg. We use it to decode video files, read frames efficiently, and handle video streams coming from disk or webcam. Without av, working with video frames would be slow and unreliable.
Finally, we install gradio. We use Gradio to build interactive demos, run webcam-based real-time segmentation, create simple UI components for clicking on objects and visualizing results. This allows us to turn SAM3 into a live, interactive video application, not just a notebook script.
We also pass the --q flag to keep the installation output quiet. This keeps our notebook clean and easy to read.
Need Help Configuring Your Development Environment?

All that said, are you:
- Short on time?
- Learning on your employer’s administratively locked system?
- Wanting to skip the hassle of fighting with the command line, package managers, and virtual environments?
- Ready to run the code immediately on your Windows, macOS, or Linux system?
Then join PyImageSearch University today!
Gain access to Jupyter Notebooks for this tutorial and other PyImageSearch guides pre-configured to run on Google Colab’s ecosystem right in your web browser! No installation required.
And best of all, these Jupyter Notebooks will run on Windows, macOS, and Linux!
Setup and Imports
Once the dependencies are installed, we can import the libraries we need for video processing, model execution, and interactive demos.
import cv2 import torch import numpy as np import gradio as gr from PIL import Image from accelerate import Accelerator from transformers.video_utils import load_video from transformers import Sam3VideoModel, Sam3VideoProcessor from transformers import Sam3TrackerVideoModel, Sam3TrackerVideoProcessor
First, we import OpenCV (cv2). We use OpenCV to read frames from videos and webcams, convert between different image formats, and perform basic image and video I/O operations. This forms the low-level video input layer of our system.
Next, we import PyTorch (torch). PyTorch is the backend that runs SAM3. We use it to move models and tensors to GPU or CPU, run inference efficiently, and control precision and memory usage. All heavy computation in this tutorial happens inside PyTorch.
We also import NumPy (numpy). NumPy is used for simple array manipulations, converting between OpenCV images, PIL images, and tensors, and handling masks and frame buffers in a lightweight way.
Then, we import Gradio (gradio). We use Gradio to build interactive demos, create webcam-based real-time segmentation apps, and handle mouse clicks for point-based object selection. This turns our video pipeline into a live, interactive application instead of a static script.
Next, we import PIL’s Image. This is used to convert frames into PIL format when required by the processor, handle RGB image conversions cleanly, and bridge between OpenCV and the Transformers preprocessing pipeline.
Then, we import Accelerator from Hugging Face Accelerate. The Accelerator class helps us automatically place the model on CPU or GPU, write device-agnostic code, and scale to different hardware setups without changing logic. This keeps the code clean and portable.
Next, we import load_video from transformers.video_utils. This utility function loads a video file from disk, decodes it into a list (or generator) of frames, and handles resizing and format conversion in a consistent way. We use this for file-based video experiments.
Finally, we import the SAM3 video models and processors. These are the core components of this tutorial.
Sam3VideoModelandSam3VideoProcessorare used for:- Text-prompted video segmentation
- Concept detection on video frames
Sam3TrackerVideoModelandSam3TrackerVideoProcessorare used for:- Point-based prompting
- Interactive object selection
- Multi-object tracking with memory across frames
Together, these 2 model families allow us to cover all 4 workflows:
- Text-prompted tracking
- Webcam-based tracking
- Single-click object tracking
- Multi-click object tracking
Text-Prompt Video Tracking
In this section, we use a natural language prompt such as "person" or "car" to detect, segment, and track objects across an entire video. SAM3 maintains temporal memory, ensuring that object identities remain consistent from the first frame to the last. The result is a fully annotated video with masks, bounding boxes, and tracking IDs propagated over time.
Load the SAM3 Video Model
Before we process any video, we need to load the SAM3 model and its processor.
Since these models are large, we load them once and reuse them across all videos and frames.
# ------------------------------------------------
# Model setup (loaded once)
# ------------------------------------------------
accelerator = Accelerator()
device = accelerator.device
model = Sam3VideoModel.from_pretrained(
"facebook/sam3"
).to(device, dtype=torch.bfloat16)
processor = Sam3VideoProcessor.from_pretrained("facebook/sam3")
First, we create an Accelerator object. The Accelerator automatically:
- Detects whether a GPU is available
- Chooses the best device (CPU or GPU)
- Handles device placement in a clean and consistent way
By using this, we avoid hardcoding "cuda" or "cpu" anywhere in our code. The device variable now represents where the model will run.
Next, we load the SAM3 Video model. This does 3 important things:
- It downloads the pretrained SAM3 weights from Hugging Face
- It constructs the full video-capable segmentation model
- It moves the model to the selected device (GPU or CPU)
We also explicitly set dtype=torch.bfloat16. This tells PyTorch to run the model in bfloat16 precision. Using bfloat16 reduces memory usage significantly, speeds up inference on modern GPUs, and has almost no impact on segmentation quality. This is especially important because SAM3 is a very large model.
Then, we load the processor. The processor is responsible for preprocessing video frames (resizing, normalization, padding), encoding text prompts, formatting inputs for the model, and post-processing outputs (resizing masks back to the original resolution).
At this point, we have:
- A fully loaded SAM3 video model on the correct device
- A video processor that knows how to prepare inputs and decode outputs
Helper Function: Visualizing Video Segmentation Masks, Bounding Boxes, and Tracking IDs
Before we start running video inference, we define a small helper function to visualize segmentation and tracking results.
This function overlays:
- Segmentation masks
- Bounding boxes
- Object IDs
- Confidence scores
directly on top of each video frame.
# ------------------------------------------------
# Visualization helper
# ------------------------------------------------
def overlay_masks_boxes(image, masks, boxes, scores, object_ids, alpha=0.5):
image = image.copy()
h, w = image.shape[:2]
for i, mask in enumerate(masks):
color = np.random.randint(0, 255, (3,), dtype=np.uint8)
if mask.shape[-2:] != (h, w):
mask = cv2.resize(mask.astype(np.uint8), (w, h)) > 0
colored = np.zeros_like(image)
colored[mask] = color
image = cv2.addWeighted(image, 1.0, colored, alpha, 0)
x1, y1, x2, y2 = boxes[i].astype(int)
cv2.rectangle(image, (x1, y1), (x2, y2), color.tolist(), 2)
label = f"ID {int(object_ids[i])} | {scores[i]:.2f}"
cv2.putText(
image,
label,
(x1, max(y1 - 5, 15)),
cv2.FONT_HERSHEY_SIMPLEX,
0.5,
color.tolist(),
1,
cv2.LINE_AA,
)
return image
First, we create a copy of the input image and read its spatial resolution. We do this to avoid modifying the original frame and ensure all masks are resized to the same height and width.
Next, we loop over each detected object. Each iteration corresponds to one tracked instance in the frame. Inside the loop, we generate a random color for that object. This makes each object visually distinct in the overlay.
Then, we ensure the mask matches the image resolution. Because masks are produced at a lower resolution, we resize them to the full frame size and convert them to a Boolean mask. This ensures perfect alignment with the video frame.
Next, we create a colored overlay and blend it with the image. This:
- Paints the object region with the selected color
- Blends it with transparency (
alpha) - Keeps the original image visible underneath
Then, we draw the bounding box for the same object. This helps us visually confirm the detection region and the spatial extent of each tracked object.
Next, we prepare a label string. This shows the tracking ID assigned by SAM3 and the confidence score of the detection.
We then render this text near the bounding box. We slightly shift the text upward to avoid overlapping with the box. Finally, after all objects are drawn, we return the annotated frame.
Main Pipeline: Running the Full Video Segmentation and Tracking Workflow
Now we build the main function that runs SAM3 on a video using a text prompt and produces an annotated output video.
This function does 4 things:
- Loads the video frames
- Initializes a SAM3 video session with memory
- Propagates segmentation and tracking across frames
- Writes an annotated video to disk
Let us walk through the full pipeline step by step.
# ------------------------------------------------
# Main pipeline
# ------------------------------------------------
def run_sam3_video(video_path, text_prompt):
# Load frames for SAM3
video_frames, _ = load_video(video_path)
# Reliable FPS extraction (OpenCV)
cap = cv2.VideoCapture(video_path)
fps = cap.get(cv2.CAP_PROP_FPS)
cap.release()
if fps is None or fps <= 0:
fps = 25.0
fps = float(fps)
# Init SAM3 session
inference_session = processor.init_video_session(
video=video_frames,
inference_device=device,
processing_device="cpu",
video_storage_device="cpu",
dtype=torch.bfloat16,
)
inference_session = processor.add_text_prompt(
inference_session=inference_session,
text=text_prompt,
)
outputs_per_frame = {}
for model_outputs in model.propagate_in_video_iterator(
inference_session=inference_session,
max_frame_num_to_track=len(video_frames),
):
processed = processor.postprocess_outputs(
inference_session,
model_outputs,
)
outputs_per_frame[model_outputs.frame_idx] = processed
# Prepare output video
h, w = video_frames[0].shape[:2]
out_path = "sam3_annotated.mp4"
writer = cv2.VideoWriter(
out_path,
cv2.VideoWriter_fourcc(*"mp4v"),
fps,
(w, h),
)
for idx, frame in enumerate(video_frames):
outputs = outputs_per_frame.get(idx)
if outputs and len(outputs["object_ids"]) > 0:
frame = overlay_masks_boxes(
frame,
outputs["masks"].cpu().numpy(),
outputs["boxes"].cpu().numpy(),
outputs["scores"].cpu().numpy(),
outputs["object_ids"].cpu().numpy(),
)
# OpenCV expects BGR
writer.write(frame[:, :, ::-1])
writer.release()
return out_path, out_path
First, we load the video frames using the Transformers utility. This returns:
- A list of RGB frames as NumPy arrays
- (Optionally) audio, which we ignore here
Next, we extract the frame rate using OpenCV. We do this because:
- We want the output video to play at the same speed as the input
- Some video loaders do not always return reliable FPS metadata
If FPS is missing or invalid, we fall back to a default value.
Instead of processing each frame independently, SAM3 uses a video session that maintains the following:
- Temporal memory
- Object identities
- Propagation state
We initialize the session using processor.init_video_session(...). We pass:
- The full list of frames
- The device for model inference (GPU or CPU)
- CPU for processing and storage (to save GPU memory)
dtype=torch.bfloat16for efficient computation
This session object now holds the entire video, memory, and tracking state.
Next, we tell SAM3 what concept we want to track using processor.add_text_prompt(...).
For example:
"person""car""player in red jersey"
From this point on, the session is configured to: detect, segment, and track all instances of this concept across the video.
The real work happens when we call model.propagate_in_video_iterator(...).
This function:
- Processes frames sequentially
- Uses memory to propagate masks and identities
- Emits results for each frame
For each frame:
- We post-process raw model outputs
- Convert them into usable masks, boxes, scores, and IDs
- Store them in a dictionary indexed by frame number
At the end of this loop, we have a full timeline of segmentation and tracking results.
We now prepare an OpenCV video writer.
- Same resolution as input
- Same FPS
- MP4 format
We now loop over each frame. If SAM3 produced results for that frame: we overlay masks, boxes, IDs, and scores using our helper function. Then we write the frame to the output video. We convert RGB → BGR because OpenCV expects BGR format.
We close the video writer and return the path to the annotated video. At this point, we have a complete pipeline that: takes a video + text prompt → returns a fully segmented and tracked video.
Launch the Gradio Application
Now that our video pipeline is ready, we wrap everything into a simple Gradio web interface.
This allows us to:
- Upload a video
- Enter a text prompt (e.g.,
"person") - Run SAM3 segmentation and tracking
- Preview the result
- Download the annotated video
# ------------------------------------------------
# Gradio UI
# ------------------------------------------------
with gr.Blocks() as demo:
gr.Markdown("# 🎥 SAM3 Video Segmentation & Tracking")
with gr.Row():
video_input = gr.Video(label="Upload Video")
prompt = gr.Textbox(
label="Text Prompt",
placeholder="e.g. person, chair, bed",
value="person",
)
run_btn = gr.Button("Run Segmentation")
video_out = gr.Video(label="Annotated Output")
download = gr.File(label="Download Video")
run_btn.click(
fn=run_sam3_video,
inputs=[video_input, prompt],
outputs=[video_out, download],
)
demo.launch(debug=True)
gr.Blocks() lets us build a custom UI layout instead of a simple one-function demo. Everything inside this block becomes part of our web interface.
This is just a header that appears at the top of the page.
Here, we place 2 widgets side by side:
- A video uploader
- A text box for the concept prompt
The default value is "person", so the app works immediately without typing anything.
This button will trigger the SAM3 pipeline.
We create 2 outputs:
- One to preview the annotated video
- One to download the result file
It tells Gradio: When the button is clicked, call run_sam3_video(video, prompt) and display its outputs.
Recall that our function returns 2 values.
So:
- The first output goes to the video preview
- The second output goes to the download widget
This starts a local web server and opens the interface in the browser.
The debug=True flag helps:
- Show errors in the console
- Make debugging easier during development
At this point, we have a complete application:
- upload a video
- type a concept
- get a fully segmented and tracked output video
This completes the text-prompt video segmentation and tracking pipeline.
Output: Text-Prompt Video Segmentation and Tracking Results
Real-Time Text-Prompt Tracking (Webcam)
Here, we extend text-prompt tracking to a live webcam stream. Instead of processing a preloaded video, frames arrive continuously, and SAM3 updates its tracking state in real time. This enables live, concept-aware segmentation with stable object identities across streaming frames.
Helper Function: Stable Color Overlays for Real-Time Video Tracking
In the previous pipeline, we used a visualization helper to draw masks and bounding boxes on each frame. For streaming and webcam scenarios, we slightly modify this helper so that each tracked object keeps a stable color across frames.
This small change makes tracking much easier to follow visually, especially when objects move across the scene.
Here is the updated helper:
# ------------------------------------------------
# Visualization helper
# ------------------------------------------------
def overlay_masks_boxes(image, masks, boxes, scores, object_ids, alpha=0.5):
image = image.copy()
h, w = image.shape[:2]
for i, mask in enumerate(masks):
# Stable color per object id
rng = np.random.default_rng(int(object_ids[i]))
color = rng.integers(0, 255, size=3, dtype=np.uint8)
if mask.shape[-2:] != (h, w):
mask = cv2.resize(mask.astype(np.uint8), (w, h)) > 0
colored = np.zeros_like(image)
colored[mask] = color
image = cv2.addWeighted(image, 1.0, colored, alpha, 0)
x1, y1, x2, y2 = boxes[i].astype(int)
cv2.rectangle(image, (x1, y1), (x2, y2), color.tolist(), 2)
label = f"ID {int(object_ids[i])} | {scores[i]:.2f}"
cv2.putText(
image,
label,
(x1, max(y1 - 5, 15)),
cv2.FONT_HERSHEY_SIMPLEX,
0.5,
color.tolist(),
1,
cv2.LINE_AA,
)
return image
Let us walk through what changed and why it matters.
In streaming inference, objects appear across many frames. If colors change every frame, it becomes hard to follow which object is which.
To solve this, we generate colors based on the object ID:
Here, on Lines 10 and 11:
- The object ID is used as the random seed.
- The same object always produces the same color.
- Tracking becomes visually consistent across frames.
So if object ID 3 appears in 200 frames, it will always use the same color.
Sometimes masks are produced at a lower resolution, so we resize them to match the frame. Lines 13 and 14 ensure masks perfectly align with the streaming frame.
Next, we create a colored mask and blend it with the original frame on Lines 16-18. The transparency factor alpha controls how strongly the mask appears.
We then draw bounding boxes for each tracked object. Line 21 helps confirm detection regions visually.
Finally, we render the object ID and confidence score and draw it above the box. This allows us to verify identity consistency and inspect tracking confidence per frame (Lines 23-33).
Streaming Inference Function: Maintaining Temporal Memory Across Live Video Frames
So far, we processed videos by loading all frames first and then propagating segmentation across the entire clip.
For webcam or live-stream scenarios, this approach does not work because frames arrive one at a time. Instead, we need a streaming pipeline that:
- Maintains tracking memory across frames
- Processes each incoming frame independently
- Updates segmentation and tracking state continuously
To achieve this, we maintain a persistent session and reuse it across frames.
# ------------------------------------------------
# Global streaming state (kept across frames)
# ------------------------------------------------
STREAM_STATE = {
"session": None,
"prompt": None,
}
First, we create a small global structure that keeps track of the active inference session.
This dictionary stores:
- The active SAM3 inference session
- The currently used text prompt
This allows us to reuse the same session across frames instead of recreating it repeatedly.
# ------------------------------------------------
# Streaming inference function
# ------------------------------------------------
def process_webcam_frame(frame, text_prompt):
global STREAM_STATE
if frame is None:
return None
# Initialize session if needed or if prompt changed
if (
STREAM_STATE["session"] is None
or STREAM_STATE["prompt"] != text_prompt
):
session = processor.init_video_session(
inference_device=device,
processing_device="cpu",
video_storage_device="cpu",
dtype=torch.bfloat16,
)
session = processor.add_text_prompt(
inference_session=session,
text=text_prompt,
)
STREAM_STATE["session"] = session
STREAM_STATE["prompt"] = text_prompt
session = STREAM_STATE["session"]
# Preprocess frame
inputs = processor(images=frame, device=device, return_tensors="pt")
# Streaming forward pass
with torch.no_grad():
model_outputs = model(
inference_session=session,
frame=inputs.pixel_values[0],
reverse=False,
)
# Postprocess to original resolution
outputs = processor.postprocess_outputs(
session,
model_outputs,
original_sizes=inputs.original_sizes,
)
# Visualize
if outputs and len(outputs["object_ids"]) > 0:
frame = overlay_masks_boxes(
frame,
outputs["masks"].cpu().numpy(),
outputs["boxes"].cpu().numpy(),
outputs["scores"].cpu().numpy(),
outputs["object_ids"].cpu().numpy(),
)
return frame
Now we define the function that processes each incoming webcam frame. We initialize the global stream state defined earlier. If no frame is available, we simply return None (Lines 14 and 15).
Next, we check whether a session already exists or whether the user has changed the prompt. If either condition is true, we create a new session. Unlike offline video processing, we do not pass frames during initialization because frames arrive one at a time. We then attach the prompt and store the session globally (Lines 18-33).
This session (Line 35) now contains:
- Tracking memory
- Object identity history
- Propagation state
across previous frames.
To convert each frame into tensors before inference, we initialize the processor, which handles resizing, normalization, tensor formatting, and device placement (Line 38).
We now run inference for the current frame (Lines 41-46). Key points:
torch.no_grad()disables gradients, improving speed.- The frame is processed using the existing session.
- SAM3 updates tracking memory internally.
So segmentation and identities propagate automatically.
Model outputs are resized back to original resolution. This produces masks, bounding boxes, scores, and object IDs for the current frame (Lines 49-53).
If objects are detected, we overlay results on the frame. This produces the annotated frame. The processed frame is returned to the UI or video stream (Lines 56-65).
Launch the Gradio Application
Now we connect our streaming inference pipeline to a live webcam interface using Gradio.
This allows us to:
- Capture frames directly from a webcam
- Run SAM3 segmentation and tracking in real time
- Visualize masks and tracked objects continuously
# ------------------------------------------------
# Gradio UI
# ------------------------------------------------
with gr.Blocks() as demo:
gr.Markdown("# 📷 SAM3 Live Webcam Segmentation & Tracking")
with gr.Row():
webcam = gr.Image(
sources=["webcam"],
streaming=True,
label="Webcam",
type="numpy",
)
output = gr.Image(
label="Live Segmentation",
type="numpy",
)
prompt = gr.Textbox(
label="Text Prompt",
value="person",
placeholder="e.g. person, face, chair, bottle",
)
webcam.stream(
fn=process_webcam_frame,
inputs=[webcam, prompt],
outputs=output,
)
demo.launch(debug=True)
Line 1 creates the Gradio app layout where we can combine multiple UI components. Line 2 displays a header explaining what the demo does.
We place input and output side by side. Inside gr.Row, we define 2 components:
First component:
- Captures frames from the webcam
- Streams frames continuously
- Sends frames as NumPy arrays to our function
The streaming=True flag enables real-time frame delivery.
The second component displays the processed frame returned by our streaming pipeline.
Next, we create a textbox for specifying the concept to track. Users can change this dynamically while the webcam runs. If the prompt changes, our pipeline automatically resets the session.
The key connection happens on Lines 26-30. This tells Gradio:
- Send every webcam frame to
process_webcam_frame - Pass along the current text prompt
- Display the returned frame in the output panel
This loop runs continuously while the webcam is active.
Finally, we launch the interface. This starts a local server and opens the demo in a browser. The debug=True flag helps diagnose errors during development.
Output: Real-Time Webcam Video Segmentation Results
Single-Click Object Tracking
In this workflow, we remove text prompts and select an object by clicking on it in the first frame. SAM3 segments the clicked object and propagates its mask throughout the video using its tracking memory. With just one foreground point, we obtain consistent object tracking across the full sequence.
Load the SAM3 Tracker Video Model
So far, we used text prompts to detect and track concepts across videos and live streams.
Now we move to a different workflow: interactive tracking, where we manually select objects and let SAM3 track them across frames.
To enable this, we switch from the text-prompt video model to the tracker-specific video model.
Here is how we load it:
# Initialize model
device = Accelerator().device
model = Sam3TrackerVideoModel.from_pretrained("facebook/sam3").to(device, dtype=torch.bfloat16)
processor = Sam3TrackerVideoProcessor.from_pretrained("facebook/sam3")
The Accelerator().device automatically selects the available hardware:
- GPU if available
- CPU otherwise
This keeps our code portable across machines.
We load the SAM3 tracker model, which is optimized for:
- Point-based object prompting
- Interactive tracking
- Multi-object identity propagation
- Frame-to-frame tracking consistency
Unlike the previous model, this one does not require text prompts. Instead, it expects clicks or point annotations.
We also move the model to the selected device and run it in bfloat16 precision, reducing memory usage and speeding up inference.
We load the processor which prepares inputs and postprocesses outputs specifically for tracking workflows. It handles frame preprocessing, prompt encoding (clicks or points), mask decoding, and identity propagation formatting.
Extract First Frame: Preparing the Initial Frame for Object Selection
Before we start tracking objects in a video, we need a way to select them.
In our workflow, we select objects by clicking on them in the first frame. So, the first step is to extract that frame from the video.
Here is a small helper function that does exactly that:
def extract_first_frame(video_path):
"""Extract first frame from video for point selection"""
cap = cv2.VideoCapture(video_path)
ret, frame = cap.read()
cap.release()
if ret:
return cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
return None
First, OpenCV opens the video file and prepares it for frame reading. Then it attempts to read the next frame from the video.
retindicates whether reading succeeded.framecontains the actual image data.
Since we just opened the video, this call returns the first frame.
We release the video resource immediately after reading the frame. This is important because:
- It frees system resources
- It prevents file locks
- It keeps later video processing clean
OpenCV loads images in BGR format, but most visualization and processing pipelines expect RGB. So we convert BGR to RGB before returning the frame.
If the frame cannot be read, the function safely returns None.
Tracking Object Function: Propagating a Single Object Mask Across Video Frames
We now build the core function that allows us to track an object through an entire video using a single click.
The workflow is simple:
- The user uploads a video.
- The user clicks on the object in the first frame.
- SAM3 segments that object.
- The tracker propagates the object mask across all frames.
- A new annotated video is generated.
Let us walk through the implementation step by step:
def track_object(video_path, point_coords):
"""Track object through video based on clicked point"""
if video_path is None:
return None, "Please upload a video first"
if point_coords is None:
return None, "Please click on the first frame to select an object"
try:
# Load video
video_frames, _ = load_video(video_path)
# Get click coordinates
x, y = int(point_coords[0]), int(point_coords[1])
# Initialize session
inference_session = processor.init_video_session(
video=video_frames,
inference_device=device,
dtype=torch.bfloat16,
)
# Add point annotation
points = [[[[x, y]]]]
labels = [[[1]]] # 1 for foreground point
processor.add_inputs_to_inference_session(
inference_session=inference_session,
frame_idx=0,
obj_ids=1,
input_points=points,
input_labels=labels,
)
# First, segment the object on the first frame
outputs = model(
inference_session=inference_session,
frame_idx=0,
)
first_frame_masks = processor.post_process_masks(
[outputs.pred_masks],
original_sizes=[[inference_session.video_height, inference_session.video_width]],
binarize=False
)[0]
# Propagate through video
video_segments = {0: first_frame_masks}
for sam3_tracker_video_output in model.propagate_in_video_iterator(inference_session):
video_res_masks = processor.post_process_masks(
[sam3_tracker_video_output.pred_masks],
original_sizes=[[inference_session.video_height, inference_session.video_width]],
binarize=False
)[0]
video_segments[sam3_tracker_video_output.frame_idx] = video_res_masks
# Create output video with masks
output_path = "/tmp/output_tracked.mp4"
fourcc = cv2.VideoWriter_fourcc(*'mp4v')
height, width = video_frames[0].shape[:2]
out = cv2.VideoWriter(output_path, fourcc, 30.0, (width, height))
for idx in range(len(video_frames)):
frame = video_frames[idx].copy().astype(np.uint8)
if idx in video_segments:
masks = video_segments[idx]
# Convert mask to float32 first, then to boolean
mask = masks[0, 0].float().cpu().numpy() > 0.0
# Overlay red mask
overlay = frame.copy()
overlay[mask] = [255, 0, 0]
frame = cv2.addWeighted(frame.astype(np.float32), 0.6, overlay.astype(np.float32), 0.4, 0).astype(np.uint8)
out.write(cv2.cvtColor(frame, cv2.COLOR_RGB2BGR))
out.release()
status = f"✅ Successfully tracked object through {len(video_segments)} frames at point ({x}, {y})"
return output_path, status
except Exception as e:
return None, f"❌ Error: {str(e)}"
The track_object() function accepts:
video_path: Path to the uploaded video file.point_coords:(x, y)coordinates of the user’s click on the first frame.
The goal is simple: “Given a video and one clicked point, track that object through the entire video.”
If no video is uploaded, tracking cannot begin. The function returns:
None: No output video- A message explaining the issue
Likewise, tracking requires a foreground prompt. SAM3 needs at least one positive point to know: which object to segment, and where that object exists in the first frame. Without it, tracking is undefined.
Inside a try block, the load_video() function reads the video file, extracts all frames into memory, and returns them as a list of NumPy arrays.
Why load all frames?
Because SAM3 tracking requires:
- Access to the entire temporal sequence
- Mask propagation across frames
- Internal memory consistency
Each frame shape is typically: (height, width, 3). The UI provides floating-point coordinates. We convert them to integers because:
- Pixel indices must be integers
- Mask indexing requires integer positions
This (x, y) now represents a foreground location inside frame 0.
Next, we initialize a video session which creates an internal tracking session, stores all video frames, model memory state, and object tracking buffers. We also set computation device (either CPU or GPU) and use bfloat16 for faster inference and lower memory usage. This prepares SAM3’s brain to process a full video.
Then, we prepare a foreground prompt. Since SAM3 expects inputs in batch format, points = [[[[x, y]]]] means:
- Batch size = 1
- Object ID = 1
- One point
- Coordinates
(x, y)
and in labels = [[[1]]]
1 means:
- Foreground point
If it were 0, it would mean:
- Background point
So this tells SAM3: “This pixel belongs to the object.”
Then processor.add_inputs_to_inference_session() injects the prompt into the tracking session.
frame_idx=0: Object exists in the first frameobj_ids=1: This is object ID 1input_points: Where the object isinput_labels: Foreground signal
At this point, the model knows that object 1 is located at (x, y) in frame 0.
We explicitly run segmentation on frame 0. This produces:
- Raw mask logits
- Low-resolution mask predictions
Then processor.post_process_masks():
- Resizes masks to original resolution
- Converts internal representation into full-size masks
- Keeps them as float probabilities (not binarized)
We now have a full-resolution mask for frame 0.
On Line 47, we store frame 0 results first. Then model.propagate_in_video_iterator() runs SAM3’s tracking mechanism.
What happens internally:
- It uses memory from frame 0
- Matches object appearance across frames
- Predicts masks for each new frame
For each frame processor.post_process_masks(...), we resize masks and store them in a dictionary.
Final structure:
video_segments = {
0: mask0,
1: mask1,
2: mask2,
...
}
Now we have segmentation results for the full video.
Next we define the output video path, and define the codec format to mp4v. We get the resolution which ensures output video matches input resolution. We also initialize the OpenCV writer using cv2.VideoWriter() and pass the output video path, codec format, 30 FPS, and height and width.
We then iterate over each frame and create a copy to avoid modifying the original frames. We move the output tensor to the CPU, convert it to a NumPy array, compute probabilities, and threshold the result to obtain a Boolean mask. The resulting mask is True where the object exists and False elsewhere.
We then overlay a red mask and blend it using cv2.addWeighted(), producing a frame with 60% original content and 40% red overlay for smooth visualization. Because OpenCV expects BGR format, we convert the frame using cv2.cvtColor() with the cv2.COLOR_RGB2BGR flag.
out.release() finalizes the file, writes any remaining buffers, and closes the video properly. Without this call, the output file may become corrupted.
Finally, we return the path to the saved video and a success message. If an error occurs:
- The function safely returns an error message
- The application does not crash
Launch the Gradio Application
We now connect our tracking pipeline to an interactive Gradio interface. This interface allows users to upload a video, click on an object in the first frame, and automatically track that object across the entire clip.
Here is the full interface code:
# Create Gradio interface with blocks for better control
with gr.Blocks(title="SAM3 Video Tracker") as demo:
gr.Markdown("# 🎯 SAM3 Video Object Tracker")
gr.Markdown("Upload a video and click on an object in the first frame to track it throughout the video")
with gr.Row():
with gr.Column():
video_input = gr.Video(label="Upload Video")
first_frame = gr.Image(label="Click on object to track", type="numpy")
point_display = gr.Textbox(label="Selected Point", interactive=False)
track_btn = gr.Button("Track Object", variant="primary")
with gr.Column():
video_output = gr.Video(label="Tracked Video")
status_output = gr.Textbox(label="Status")
# Store clicked point
clicked_point = gr.State(None)
# Extract first frame when video is uploaded
def on_video_upload(video):
if video:
frame = extract_first_frame(video)
return frame, None, "Upload complete. Click on the object you want to track."
return None, None, ""
video_input.change(
on_video_upload,
inputs=[video_input],
outputs=[first_frame, clicked_point, status_output]
)
# Handle click on first frame
def on_click(img, evt: gr.SelectData):
x, y = evt.index[0], evt.index[1]
# Draw a circle on the clicked point
img_copy = img.copy()
cv2.circle(img_copy, (x, y), 5, (255, 0, 0), -1)
return img_copy, (x, y), f"Point selected: ({x}, {y})"
first_frame.select(
on_click,
inputs=[first_frame],
outputs=[first_frame, clicked_point, point_display]
)
# Track button
track_btn.click(
track_object,
inputs=[video_input, clicked_point],
outputs=[video_output, status_output]
)
# Launch
demo.launch(debug=True)
The Gradio interface is built using gr.Blocks, which gives full control over layout, components, and event handling. The goal is simple: “Allow a user to upload a video, click on a single object in the first frame, and track that object throughout the entire video.”
At the top of the interface, we display two gr.Markdown() sections. The first acts as the main heading so users immediately understand what the application does. The second provides short instructions explaining the workflow: upload a video, click on an object in the first frame, and then track it.
Next, we structure the layout using a gr.Row(). Inside that row, we create two columns. The left column contains all inputs and interactions. The right column displays outputs.
In the left column, we first add a video upload component. This allows the user to upload a video file from their system. Once uploaded, the backend receives the file path. That file path is later used to extract frames and run tracking.
Below the video upload, we place an image component. This image will display the first frame of the uploaded video. We set its type to NumPy so that the backend receives the frame as a NumPy array. This is important because we draw visual markers on the frame using OpenCV when the user clicks.
Below the image, we add a textbox labeled "Selected Point". This textbox is non-interactive, meaning the user cannot manually edit it. It simply displays the coordinates of the selected point so the user can confirm their click.
Under that, we add a "Track Object" button. This button is styled as primary so it visually stands out as the main action. When clicked, it triggers the tracking pipeline.
In the right column, we create a video output component. This will display the processed video after tracking is complete. Below it, we add a status textbox. This displays messages such as upload confirmation, tracking success, or error details.
To maintain interaction state, we use a Gradio State variable called clicked_point. This variable stores the coordinates of the selected object. Initially, it is set to None. State is important because the click event and the tracking event happen at different times, and we need a way to remember which point the user selected.
When a video is uploaded, a function on_video_upload() is triggered. This function checks whether a valid video exists. If it does, we extract the first frame using a helper function. That first frame is returned to the image component so the user can see it. We also reset the stored clicked point to None, ensuring that any previous selection is cleared. Finally, we return a status message informing the user that the upload is complete and they should click on an object.
If no video is uploaded, the function returns empty values, keeping the interface clean.
When the user clicks on the first frame image, another function on_click() handles the event. The click event provides the pixel coordinates of the selected location through evt.index. We extract the x and y coordinates from that event data.
Next, we create a copy of the displayed image. This is important because we do not want to modify the original image directly. On this copy, we draw a small filled circle at the clicked location using OpenCV. The circle visually marks the selected object so the user knows exactly where they clicked.
After drawing the marker, we return 3 things:
- The updated image with the circle drawn
- The tuple
(x, y)stored in the state variable - A formatted string such as
"Point selected: (x, y)"displayed in the textbox
This ensures the UI updates immediately and the selected point is stored for later use.
The Track Object button is connected to the backend tracking function. When pressed, it sends 2 inputs:
- The uploaded video path
- The stored clicked point
The tracking function then performs segmentation and mask propagation across the entire video. Once processing is complete, it returns:
- The path to the output tracked video
- A status message indicating success or failure
These outputs are displayed in the video output component and the status textbox, respectively.
Finally, the application is launched with debug mode enabled. Debug mode prints detailed logs in case of errors, which is helpful during development and testing.
The complete flow works as follows:
- The user uploads a video.
- The first frame is extracted and displayed.
- The user clicks on an object.
- A visual marker appears and the coordinates are stored.
- The user presses Track Object.
- The backend processes the video and returns the tracked result.
- The output video and status message are displayed.
This design keeps the interface simple, intuitive, and focused on a single-click tracking workflow while properly managing state and user interaction.
Output: Single-Click Video Object Tracking Results
Multi-Click Object Tracking
In this final setup, we select multiple objects by clicking different locations in the first frame. Each click initializes a unique object ID, and SAM3 tracks all selected objects simultaneously. The output video shows multiple masks with distinct colors, preserving identity consistency across frames.
Initialize Few Colors: Defining a Color Palette for Multi-Object Tracking Visualization
When tracking multiple objects at the same time, visualization becomes very important. If all objects share the same mask color, it becomes difficult to understand which mask corresponds to which object.
To solve this, we assign different colors to different tracked objects.
Here is a small color palette we use:
# Different colors for different objects COLORS = [ [255, 0, 0], # Red [0, 255, 0], # Green [0, 0, 255], # Blue [255, 255, 0], # Yellow [255, 0, 255], # Magenta [0, 255, 255], # Cyan [255, 128, 0], # Orange [128, 0, 255], # Purple ]
Each entry in this list represents an RGB color used to render masks and overlays for a tracked object.
For example:
- Object 1 may appear in red
- Object 2 in green
- Object 3 in blue
- and so on.
During visualization, we typically assign colors based on object index or object ID, cycling through the list if the number of objects exceeds the available colors.
Extract First Frame: Preparing the First Frame for Multi-Object Selection
For multi-object tracking, we again begin by extracting the first frame of the video. This frame is used as the interaction surface where users click on multiple objects they want to track.
The helper function below reads the first frame from the uploaded video.
def extract_first_frame(video_path):
"""Extract first frame from video for point selection"""
cap = cv2.VideoCapture(video_path)
ret, frame = cap.read()
cap.release()
if ret:
return cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
return None
First, we open the video using OpenCV’s VideoCapture. Immediately after opening, we read a single frame. Since the video has just been opened, this corresponds to the very first frame.
Next, we release the video handle to free system resources and avoid locking the file for later processing steps.
OpenCV loads images in BGR format, but our visualization and model pipelines expect RGB images. Therefore, we convert the frame from BGR to RGB before returning it.
If frame extraction fails, the function safely returns None, allowing the application to handle the error gracefully.
This function now allows users to click on multiple objects in the first frame, which we will use as prompts for tracking several objects simultaneously in the next step.
Tracking Object Function: Tracking Multiple Objects with Unique IDs Across Video Frames
def track_objects(video_path, points_list):
"""Track multiple objects through video based on clicked points"""
if video_path is None:
return None, "Please upload a video first"
if not points_list or len(points_list) == 0:
return None, "Please click on at least one object in the first frame"
try:
# Load video
video_frames, _ = load_video(video_path)
# Initialize session
inference_session = processor.init_video_session(
video=video_frames,
inference_device=device,
dtype=torch.bfloat16,
)
# Prepare points for all objects
obj_ids = list(range(1, len(points_list) + 1))
input_points = [[[[int(x), int(y)]] for x, y in points_list]]
input_labels = [[[1] for _ in points_list]] # All are foreground points
# Add all objects to inference session
processor.add_inputs_to_inference_session(
inference_session=inference_session,
frame_idx=0,
obj_ids=obj_ids,
input_points=input_points,
input_labels=input_labels,
)
# First, segment objects on the first frame
outputs = model(
inference_session=inference_session,
frame_idx=0,
)
first_frame_masks = processor.post_process_masks(
[outputs.pred_masks],
original_sizes=[[inference_session.video_height, inference_session.video_width]],
binarize=False
)[0]
# Initialize video segments with first frame
video_segments = {0: {
obj_id: first_frame_masks[i]
for i, obj_id in enumerate(inference_session.obj_ids)
}}
# Propagate through video
for sam3_tracker_video_output in model.propagate_in_video_iterator(inference_session):
video_res_masks = processor.post_process_masks(
[sam3_tracker_video_output.pred_masks],
original_sizes=[[inference_session.video_height, inference_session.video_width]],
binarize=False
)[0]
video_segments[sam3_tracker_video_output.frame_idx] = {
obj_id: video_res_masks[i]
for i, obj_id in enumerate(inference_session.obj_ids)
}
# Create output video with masks
output_path = "/tmp/output_tracked.mp4"
fourcc = cv2.VideoWriter_fourcc(*'mp4v')
height, width = video_frames[0].shape[:2]
out = cv2.VideoWriter(output_path, fourcc, 30.0, (width, height))
for idx in range(len(video_frames)):
frame = video_frames[idx].copy().astype(np.uint8)
if idx in video_segments:
# Create overlay for all objects
overlay = frame.copy().astype(np.float32)
for obj_idx, (obj_id, masks) in enumerate(video_segments[idx].items()):
# Convert mask to float32 first, then to boolean
mask = masks[0].float().cpu().numpy() > 0.0
# Use different color for each object
color = COLORS[obj_idx % len(COLORS)]
overlay[mask] = np.array(color, dtype=np.float32)
# Blend overlay with original frame
frame = cv2.addWeighted(frame.astype(np.float32), 0.6, overlay, 0.4, 0).astype(np.uint8)
out.write(cv2.cvtColor(frame, cv2.COLOR_RGB2BGR))
out.release()
status = f"✅ Successfully tracked {len(points_list)} object(s) through {len(video_segments)} frames"
return output_path, status
except Exception as e:
import traceback
return None, f"❌ Error: {str(e)}\n{traceback.format_exc()}"
The track_objects() function accepts:
video_path: Path to the uploaded video file.points_list: A list of(x, y)coordinates where the user clicked on the first frame (one click per object).
The goal is simple: “Given a video and multiple clicked points, track all selected objects through the entire video.”
If no video is uploaded, tracking cannot begin. The function returns:
None: No output video- A message explaining the issue
Likewise, tracking requires at least one foreground prompt. SAM3 needs one or more positive points to know:
- Which objects to segment
- Where those objects exist in the first frame
If points_list is empty, tracking is undefined.
Inside a try block, the load_video() function reads the video file, extracts all frames into memory, and returns them as a list of NumPy arrays.
Why load all frames?
Because SAM3 tracking requires:
- Access to the entire temporal sequence
- Mask propagation across frames
- Internal memory consistency
Each frame shape is typically: (height, width, 3). Next, we initialize a video session using processor.init_video_session(). This creates an internal tracking session that:
- Stores all video frames
- Maintains model memory state
- Manages object tracking buffers
We also:
- Set the computation device (CPU or GPU)
- Use
bfloat16for faster inference and lower memory usage
This step prepares SAM3’s internal tracking mechanism to process the full video.
Now we prepare inputs for multiple objects. If the user clicked 3 points, this becomes:
[1, 2, 3]
Each clicked point is treated as a separate object with its own ID.
Then we structure the coordinates:
input_points = [[[[int(x), int(y)]] for x, y in points_list]]
SAM3 expects batch format:
[batch][object][points][coordinates]
So this structure means:
- Batch size = 1
- Multiple objects
- One point per object
- Each point has
(x, y)
We convert coordinates to integers because:
- Pixel indices must be integers
- Mask indexing requires integer positions
Each (x, y) now represents a foreground location for a different object in frame 0.
Next, we define labels:
input_labels = [[[1] for _ in points_list]]
1 means:
- Foreground point
If it were 0, it would mean:
- Background point
So this tells SAM3: “Each of these clicked pixels belongs to a separate object.”
Then we inject everything into the inference session using processor.add_inputs_to_inference_session():
Here:
frame_idx=0: Objects exist in the first frameobj_ids: Multiple object identifiersinput_points: Click locationsinput_labels: Foreground signals
At this point, the model knows that multiple objects are present at the selected coordinates in frame 0.
Next, we explicitly run segmentation on frame 0. This produces:
- Raw mask logits
- Low-resolution mask predictions for all objects
Then we post-process the masks using processor.post_process_masks(...). This step:
- Resizes masks to original resolution
- Converts internal representation into full-size masks
- Keeps them as float probabilities (not binarized)
We now have: Full-resolution masks for all selected objects in frame 0.
Now we initialize storage on Lines 46-49:
This means:
- Frame 0 results are stored first
- Each object ID maps to its own mask
Structure becomes:
video_segments = {
0: {
1: mask_for_object_1,
2: mask_for_object_2,
...
}
}
Next, we propagate through the video using model.propagate_in_video_iterator(). This runs SAM3’s tracking mechanism.
What happens internally:
- It uses memory from frame 0
- Matches object appearance across frames
- Maintains identity consistency for each object
- Predicts masks for each new frame
For every frame:
- We resize masks
- Store them in a dictionary per object
Final structure:
video_segments = {
0: {1: mask0_1, 2: mask0_2},
1: {1: mask1_1, 2: mask1_2},
2: {1: mask2_1, 2: mask2_2},
...
}
Now we have segmentation results for all objects across the full video.
Next, we define:
- Output video path
- Codec format (
mp4v)
We get the resolution from the first frame to ensure the output video matches the input resolution. Then we initialize cv2.VideoWriter() which sets:
- Output path
- Codec
- 30 FPS
- Frame dimensions
Now we iterate through each frame. We copy each frame to avoid modifying the original.
If masks exist for that frame:
- We create an overlay
- For each object:
- Convert tensor: CPU
- Convert to NumPy
- Convert probabilities: Boolean mask
Now:
True: Object pixelsFalse: Background
Unlike the single-object version, here we:
- Assign a different color for each object
- Use
COLORS[obj_idx % len(COLORS)] - Overlay masks for multiple objects
Then we blend using cv2.addWeighted(). This results in:
- 60% original frame
- 40% colored overlay
- Smooth visualization
OpenCV expects BGR format, so we convert using cv2.COLOR_RGB2BGR.
Finally, out.release() finalizes the file, writes any remaining buffers, and properly closes the video. Without this step, the output video may become corrupted.
At the end, we return:
- Path to the saved video
- A success message indicating how many objects were tracked
If anything fails:
- The function safely returns the error
- The application does not crash
- The traceback is included for debugging
Launch the Gradio Application
# Create Gradio interface with blocks for better control
with gr.Blocks(title="SAM3 Multi-Object Video Tracker") as demo:
gr.Markdown("# 🎯 SAM3 Multi-Object Video Tracker")
gr.Markdown("Upload a video and click on multiple objects in the first frame to track them. Each object gets a different color!")
with gr.Row():
with gr.Column():
video_input = gr.Video(label="Upload Video")
first_frame = gr.Image(label="Click on objects to track (multiple clicks supported)", type="numpy")
with gr.Row():
clear_points_btn = gr.Button("Clear Points", variant="secondary")
track_btn = gr.Button("Track Objects", variant="primary")
points_display = gr.Textbox(label="Selected Points", interactive=False, lines=5)
with gr.Column():
video_output = gr.Video(label="Tracked Video")
status_output = gr.Textbox(label="Status")
gr.Markdown("""
### Color Legend:
- 🔴 Red - Object 1
- 🟢 Green - Object 2
- 🔵 Blue - Object 3
- 🟡 Yellow - Object 4
- 🩷 Magenta - Object 5
- 🩵 Cyan - Object 6
- 🟠 Orange - Object 7
- 🟣 Purple - Object 8
""")
# Store clicked points and original frame
clicked_points = gr.State([])
original_frame = gr.State(None)
# Extract first frame when video is uploaded
def on_video_upload(video):
if video:
frame = extract_first_frame(video)
return frame, frame, [], "Upload complete. Click on objects you want to track."
return None, None, [], ""
video_input.change(
on_video_upload,
inputs=[video_input],
outputs=[first_frame, original_frame, clicked_points, status_output]
)
# Handle click on first frame
def on_click(img, orig_frame, points, evt: gr.SelectData):
if orig_frame is None:
return img, points, "Please upload a video first"
x, y = evt.index[0], evt.index[1]
# Add point to list
points.append((x, y))
# Draw all points on the image
img_copy = orig_frame.copy()
for i, (px, py) in enumerate(points):
color = COLORS[i % len(COLORS)]
cv2.circle(img_copy, (px, py), 8, tuple(color), -1)
cv2.circle(img_copy, (px, py), 10, (255, 255, 255), 2)
# Add number label
cv2.putText(img_copy, str(i+1), (px+15, py+5),
cv2.FONT_HERSHEY_SIMPLEX, 0.6, tuple(color), 2)
points_text = "\n".join([f"Object {i+1}: ({x}, {y})" for i, (x, y) in enumerate(points)])
return img_copy, points, points_text
first_frame.select(
on_click,
inputs=[first_frame, original_frame, clicked_points],
outputs=[first_frame, clicked_points, points_display]
)
# Clear points button
def clear_points(orig_frame):
return orig_frame, [], ""
clear_points_btn.click(
clear_points,
inputs=[original_frame],
outputs=[first_frame, clicked_points, points_display]
)
# Track button
track_btn.click(
track_objects,
inputs=[video_input, clicked_points],
outputs=[video_output, status_output]
)
# Launch
demo.launch(debug=True)
The Gradio interface is built using gr.Blocks, which allows us to design a structured, interactive layout with full control over components and events. The goal is simple: “Create an interactive UI where a user uploads a video, clicks on multiple objects in the first frame, and then tracks them across the entire video.”
At the top of the interface, we display a title and short instructions using gr.Markdown(). This helps users immediately understand what the application does and what steps they need to follow. Clear instructions reduce confusion and improve usability.
Next, we organize the layout using a gr.Row(). Inside that row, we create 2 columns using gr.Columns(). The left column handles inputs and interactions. The right column displays outputs and tracking results. This separation keeps the workflow intuitive and clean.
In the left column, we first create a video upload component. This allows the user to upload a video file from their system. Once a video is uploaded, the backend receives the file path, which is later used to extract frames and perform tracking.
Below the video upload, we place an image component that displays the first frame of the uploaded video. This image is interactive and supports click events. We set its type to NumPy so the backend receives the image as a NumPy array. This is important because we draw circles and labels on the frame using OpenCV.
Under the image, we add 2 buttons side by side. The first button clears selected points. The second button triggers the tracking pipeline. The track button is styled as the primary button so it visually stands out as the main action.
Below the buttons, we add a textbox that displays selected points. This textbox is non-interactive, meaning users cannot edit it manually. It simply shows a formatted list of selected objects and their coordinates. This helps users confirm that they clicked the correct locations before starting tracking.
In the right column, we create a video output component. This will display the processed video returned by the tracking function. Below that, we add a status textbox to show messages such as upload confirmation, tracking success, or error details. Finally, we display a color legend using Markdown so users understand which color corresponds to which object during visualization.
To manage interaction data across events, we use Gradio’s State component. One state variable stores the list of clicked points. This list grows as the user clicks on multiple objects. Another state variable stores the original first frame. This is important because every time a new click occurs, we redraw all points on a clean copy of the original frame instead of repeatedly drawing over an already modified image. Without this, the markers would stack incorrectly and distort the visualization.
When a video is uploaded, a function on_video_upload() is triggered. This function extracts the first frame from the video and returns 4 values:
- The extracted first frame for display
- The same frame stored as the original clean frame
- An empty list of clicked points
- A status message confirming upload completion
This ensures that each new upload resets the application state properly.
When the user clicks on the first frame image, another function on_click() handles the event. The click event provides the pixel coordinates of the selected location. First, we check whether a video has been uploaded. If not, we return a message asking the user to upload one.
If a frame exists, we extract the x and y coordinates from the click event. We then append this coordinate pair to the stored list of points. After updating the list, we redraw the image. We copy the original clean frame and loop over all stored points. For each point:
- We select a color from the predefined
COLORSlist - We draw a filled circle at the clicked location
- We draw a white border around the circle for better visibility
- We add a numeric label next to the point indicating Object 1, Object 2, and so on
This ensures each selected object is visually distinct and clearly labeled.
We also generate formatted text listing all selected objects and their coordinates. This text is displayed in the textbox so the user can verify selections.
The Clear Points button is connected to a function that resets the interface. It restores the original clean frame, empties the clicked points list, and clears the points display textbox. This allows the user to start fresh without reloading the video.
The Track Objects button is connected to the tracking function. When clicked, it sends:
- The uploaded video
- The stored list of clicked points
to the backend tracking pipeline. The tracking function processes the video, segments and propagates masks for all selected objects, and returns:
- The path to the processed output video
- A status message
These are then displayed in the output video component and the status textbox.
Finally, the application is launched with debug mode enabled. Debug mode provides detailed logs in case errors occur during development, making troubleshooting easier.
Overall, the interface follows this flow:
- The user uploads a video.
- The first frame is extracted and displayed.
- The user clicks multiple objects.
- Points are stored and visualized with colors and labels.
- The user presses Track Objects.
- The backend processes the video and returns the tracked result.
- The output video and status message are displayed.
State management ensures smooth interaction across multiple events, and the 2-column layout keeps inputs and outputs clearly separated.
Output: Multi-Object Video Segmentation and Tracking Results
What's next? We recommend PyImageSearch University.
86+ total classes • 115+ hours hours of on-demand code walkthrough videos • Last updated: March 2026
★★★★★ 4.84 (128 Ratings) • 16,000+ Students Enrolled
I strongly believe that if you had the right teacher you could master computer vision and deep learning.
Do you think learning computer vision and deep learning has to be time-consuming, overwhelming, and complicated? Or has to involve complex mathematics and equations? Or requires a degree in computer science?
That’s not the case.
All you need to master computer vision and deep learning is for someone to explain things to you in simple, intuitive terms. And that’s exactly what I do. My mission is to change education and how complex Artificial Intelligence topics are taught.
If you're serious about learning computer vision, your next stop should be PyImageSearch University, the most comprehensive computer vision, deep learning, and OpenCV course online today. Here you’ll learn how to successfully and confidently apply computer vision to your work, research, and projects. Join me in computer vision mastery.
Inside PyImageSearch University you'll find:
- ✓ 86+ courses on essential computer vision, deep learning, and OpenCV topics
- ✓ 86 Certificates of Completion
- ✓ 115+ hours hours of on-demand video
- ✓ Brand new courses released regularly, ensuring you can keep up with state-of-the-art techniques
- ✓ Pre-configured Jupyter Notebooks in Google Colab
- ✓ Run all code examples in your web browser — works on Windows, macOS, and Linux (no dev environment configuration required!)
- ✓ Access to centralized code repos for all 540+ tutorials on PyImageSearch
- ✓ Easy one-click downloads for code, datasets, pre-trained models, etc.
- ✓ Access on mobile, laptop, desktop, etc.
Summary
In this tutorial, we extended SAM3 from image-based segmentation workflows to full video understanding and tracking. We first built pipelines that detect, segment, and track concepts across videos using simple text prompts, enabling automatic tracking of objects such as people or vehicles without manual annotation.
Next, we moved to streaming inference, where SAM 3 processes frames continuously from a webcam while maintaining tracking memory across time. This allowed us to build real-time segmentation and tracking systems that operate on live video streams.
We then explored interactive tracking workflows, where users select objects directly using click prompts. Starting from single-object tracking, we progressed to multi-object tracking, enabling several objects to be tracked simultaneously with consistent identity and color-coded visualization.
By the end of this tutorial, we developed complete end-to-end systems that combine detection, segmentation, tracking, and interactive workflows into practical applications using Gradio interfaces. Together with the previous parts of this series, we now have a full understanding of how SAM 3 enables concept-aware segmentation and tracking across both images and videos, opening the door to intelligent video editing, annotation, and analysis workflows.
Citation Information
Thakur, P. “SAM 3 for Video: Concept-Aware Segmentation and Object Tracking,” PyImageSearch, S. Huot, G. Kudriavtsev, and A. Sharma, eds., 2026, https://pyimg.co/luxfd
@incollection{Thakur_2026_sam-3-sam3-for-video-concept-aware-segmentation-and-object-tracking,
author = {Piyush Thakur},
title = {{SAM 3 for Video: Concept-Aware Segmentation and Object Tracking}},
booktitle = {PyImageSearch},
editor = {Susan Huot and Georgii Kudriavtsev and Aditya Sharma},
year = {2026},
url = {https://pyimg.co/luxfd},
}
To download the source code to this post (and be notified when future tutorials are published here on PyImageSearch), simply enter your email address in the form below!

Download the Source Code and FREE 17-page Resource Guide
Enter your email address below to get a .zip of the code and a FREE 17-page Resource Guide on Computer Vision, OpenCV, and Deep Learning. Inside you'll find my hand-picked tutorials, books, courses, and libraries to help you master CV and DL!


Comment section
Hey, Adrian Rosebrock here, author and creator of PyImageSearch. While I love hearing from readers, a couple years ago I made the tough decision to no longer offer 1:1 help over blog post comments.
At the time I was receiving 200+ emails per day and another 100+ blog post comments. I simply did not have the time to moderate and respond to them all, and the sheer volume of requests was taking a toll on me.
Instead, my goal is to do the most good for the computer vision, deep learning, and OpenCV community at large by focusing my time on authoring high-quality blog posts, tutorials, and books/courses.
If you need help learning computer vision and deep learning, I suggest you refer to my full catalog of books and courses — they have helped tens of thousands of developers, students, and researchers just like yourself learn Computer Vision, Deep Learning, and OpenCV.
Click here to browse my full catalog.