Table of Contents
- Advanced SAM 3: Multi-Modal Prompting and Interactive Segmentation
- Configuring Your Development Environment
- Setup and Imports
- Loading the SAM 3 Model
- Downloading a Few Images
- Multi-Text Prompts on a Single Image
- Batched Inference Using Multiple Text Prompts Across Multiple Images
- Single Bounding Box Prompt
- Multiple Bounding Box Prompts on a Single Image (Dual Positive Foreground Regions)
- Multiple Bounding Box Prompts on a Single Image (Positive Foreground and Negative Background Control)
- Combining Text and Visual Prompts for Selective Segmentation (Excluding the Undesired Regions)
- Batched Mixed-Prompt Segmentation Across Two Images (Text and Bounding Box Guidance)
- Interactive Segmentation Using Bounding Box Refinement (Draw to Segment)
- Interactive Segmentation Using Point-Based Refinement (Click to Guide the Model)
- Summary
Advanced SAM 3: Multi-Modal Prompting and Interactive Segmentation
Welcome to Part 2 of our SAM 3 tutorial. In Part 1, we explored the theoretical foundations of SAM 3 and demonstrated basic text-based segmentation. Now, we unlock its full potential by mastering advanced prompting techniques and interactive workflows.
SAM 3’s true power lies in its flexibility; it does not just accept text prompts. It can process multiple text queries simultaneously, interpret bounding box coordinates, combine text with visual cues, and respond to interactive point-based guidance. This multi-modal approach enables sophisticated segmentation workflows that were previously impractical with traditional models.
In Part 2, we will cover:
- Multi-prompt Segmentation: Query multiple concepts in a single image
- Batched Inference: Process multiple images with different prompts efficiently
- Bounding Box Guidance: Use spatial hints for precise localization
- Positive and Negative Prompts: Include desired regions while excluding unwanted areas
- Hybrid Prompting: Combine text and visual cues for selective segmentation
- Interactive Refinement: Draw bounding boxes and click points for real-time segmentation control
Each technique is demonstrated with complete code examples and visual outputs, providing production-ready workflows for data annotation, video editing, scientific research, and more.
This lesson is the 2nd of a 4-part series on SAM 3:
- SAM 3: Concept-Based Visual Understanding and Segmentation
- Advanced SAM 3: Multi-Modal Prompting and Interactive Segmentation (this tutorial)
- Lesson 3
- Lesson 4
To learn how to perform advanced multi-modal prompting and interactive segmentation with SAM 3, just keep reading.
Would you like immediate access to 3,457 images curated and labeled with hand gestures to train, explore, and experiment with … for free? Head over to Roboflow and get a free account to grab these hand gesture images.
Configuring Your Development Environment
To follow this guide, you need to have the following libraries installed on your system.
!pip install --q git+https://github.com/huggingface/transformers supervision jupyter_bbox_widget
We install the transformers library to load the SAM 3 model and processor, the supervision library for annotation, drawing, and inspection (which we use later to visualize bounding boxes and segmentation outputs). Additionally, we install jupyter_bbox_widget, an interactive widget that runs inside a notebook, enabling us to click on the image to add points or draw bounding boxes.
We also pass the --q flag to hide installation logs. This keeps notebook output clean.
Need Help Configuring Your Development Environment?

All that said, are you:
- Short on time?
- Learning on your employer’s administratively locked system?
- Wanting to skip the hassle of fighting with the command line, package managers, and virtual environments?
- Ready to run the code immediately on your Windows, macOS, or Linux system?
Then join PyImageSearch University today!
Gain access to Jupyter Notebooks for this tutorial and other PyImageSearch guides pre-configured to run on Google Colab’s ecosystem right in your web browser! No installation required.
And best of all, these Jupyter Notebooks will run on Windows, macOS, and Linux!
Setup and Imports
Once installed, we proceed to import the required libraries.
import io import torch import base64 import requests import matplotlib import numpy as np import ipywidgets as widgets import matplotlib.pyplot as plt from google.colab import output from accelerate import Accelerator from IPython.display import display from jupyter_bbox_widget import BBoxWidget from PIL import Image, ImageDraw, ImageFont from transformers import Sam3Processor, Sam3Model, Sam3TrackerProcessor, Sam3TrackerModel
We import the following:
io: Python’s built-in module for handling in-memory image buffers when converting PIL images to base64 formattorch: used to run the SAM 3 model, send tensors to the GPU, and work with model outputsbase64: used to convert our images into base64 strings so that the BBox widget can display them in the notebookrequests: a library to download images directly from a URL; this keeps our workflow simple and avoids manual file uploads
We also import several helper libraries:
matplotlib.pyplot: helps us visualize masks and overlaysnumpy: gives us fast array operationsipywidgets: enables interactive elements inside the notebook
We import the output utility from Colab, which we later use to enable interactive widgets. Without this step, our bounding box widget will not render. We also import Accelerator from Hugging Face to run the model efficiently on either the CPU or GPU using the same code. It also simplifies device placement.
We import the display function to render images and widgets directly in notebook cells, and BBoxWidget serves as the core interactive tool, allowing us to click and draw bounding boxes or points on an image. We use this as our prompt input system.
We also import 3 classes from Pillow:
Image: loads RGB imagesImageDraw: helps us draw shapes on imagesImageFont: gives us text rendering support for overlays
Finally, we import our SAM 3 tools from transformers:
Sam3Processor: prepares inputs for the segmentation modelSam3Model: performs segmentation from text and box promptsSam3TrackerProcessor: prepares inputs for point-based or tracking promptsSam3TrackerModel: runs point-based segmentation and masking
Loading the SAM 3 Model
device = "cuda" if torch.cuda.is_available() else "cpu"
processor = Sam3Processor.from_pretrained("facebook/sam3")
model = Sam3Model.from_pretrained("facebook/sam3").to(device)
First, we check if a GPU is available in the environment. If PyTorch detects CUDA (Compute Unified Device Architecture) support, then we use the GPU for faster inference. Otherwise, we fall back to the CPU. This check ensures our code runs efficiently on any machine (Line 1).
Next, we load the Sam3Processor. The processor is responsible for preparing all inputs before they reach the model. It handles image preprocessing, bounding box formatting, text prompts, and tensor conversion. In short, it makes our raw images compatible with the model (Line 3).
Finally, we load the Sam3Model from Hugging Face. This model takes the processed inputs and generates segmentation masks. We immediately move the model to the selected device (GPU or CPU) for inference (Line 4).
Downloading a Few Images
!wget -q https://media.roboflow.com/notebooks/examples/birds.jpg !wget -q https://media.roboflow.com/notebooks/examples/traffic_jam.jpg !wget -q https://media.roboflow.com/notebooks/examples/basketball_game.jpg !wget -q https://media.roboflow.com/notebooks/examples/dog-2.jpeg
Here, we download a few images from the Roboflow media server using the wget command and use the -q flag to suppress output and keep the notebook clean.
Multi-Text Prompts on a Single Image
In this example, we apply two different text prompts to the same image: player in white and player in blue. Instead of running SAM 3 once, we loop over both prompts, and each text query produces a new set of instance masks. We then merge all detections into a single result and visualize them together.
prompts = ["player in white", "player in blue"]
IMAGE_PATH = "/content/basketball_game.jpg"
# Load image
image = Image.open(IMAGE_PATH).convert("RGB")
all_masks = []
all_boxes = []
all_scores = []
total_objects = 0
for prompt in prompts:
inputs = processor(
images=image,
text=prompt,
return_tensors="pt"
).to(device)
with torch.no_grad():
outputs = model(**inputs)
results = processor.post_process_instance_segmentation(
outputs,
threshold=0.5,
mask_threshold=0.5,
target_sizes=inputs["original_sizes"].tolist()
)[0]
num_objects = len(results["masks"])
total_objects += num_objects
print(f"Found {num_objects} objects for prompt: '{prompt}'")
all_masks.append(results["masks"])
all_boxes.append(results["boxes"])
all_scores.append(results["scores"])
results = {
"masks": torch.cat(all_masks, dim=0),
"boxes": torch.cat(all_boxes, dim=0),
"scores": torch.cat(all_scores, dim=0),
}
print(f"\nTotal objects found across all prompts: {total_objects}")
First, we define our two text prompts. Each describes a different visual concept in the image (Line 1). We also set the path to our basketball game image (Line 2). We load the image and convert it to RGB. This ensures the colors are consistent before sending it to the model (Line 5).
Next, we initialize empty lists to store masks, bounding boxes, and confidence scores for each prompt. We also track the total number of detections (Lines 7-11).
We run inference without tracking gradients. This is more efficient and uses less memory. After inference, we post-process the outputs. We apply thresholds, convert logits to binary masks, and resize them to match the original image (Lines 13-28).
We count the number of objects detected for the current prompt, update the running total, and print the result. We store the current prompt’s masks, boxes, and scores in their respective lists (Lines 30-37).
Once the loop is finished, we concatenate all masks, bounding boxes, and scores into a single results dictionary. This allows us to visualize all objects together, regardless of which prompt produced them. We print the total number of detections across all prompts (Lines 39-45).
Below are the numbers of objects detected for each prompt, as well as the total number of objects detected.
Found 5 objects for prompt: 'player in white'
Found 6 objects for prompt: 'player in blue'
Total objects found across all prompts: 11
Output
labels = [] for prompt, scores in zip(prompts, all_scores): labels.extend([prompt] * len(scores)) overlay_masks_boxes_scores( image=image, masks=results["masks"], boxes=results["boxes"], scores=results["scores"], labels=labels, score_threshold=0.5, alpha=0.45, )
Now, to visualize the output, we generate a list of text labels. Each label matches the prompt that produced the detection (Lines 1-3).
Finally, we visualize everything at once using overlay_masks_boxes_scores. The output image (Figure 1) shows masks, bounding boxes, and confidence scores for players in white and players in blue — cleanly layered on top of the original frame (Lines 5-13).

Batched Inference Using Multiple Text Prompts Across Multiple Images
In this example, we run SAM 3 on two images at once and provide a separate text prompt for each. This gives us a clean, parallel workflow: one batch, two prompts, two images, two sets of segmentation results.
cat_url = "http://images.cocodataset.org/val2017/000000077595.jpg"
kitchen_url = "http://images.cocodataset.org/val2017/000000136466.jpg"
images = [
Image.open(requests.get(cat_url, stream=True).raw).convert("RGB"),
Image.open(requests.get(kitchen_url, stream=True).raw).convert("RGB")
]
text_prompts = ["ear", "dial"]
inputs = processor(images=images, text=text_prompts, return_tensors="pt").to(device)
with torch.no_grad():
outputs = model(**inputs)
# Post-process results for both images
results = processor.post_process_instance_segmentation(
outputs,
threshold=0.5,
mask_threshold=0.5,
target_sizes=inputs.get("original_sizes").tolist()
)
print(f"Image 1: {len(results[0]['masks'])} objects found")
print(f"Image 2: {len(results[1]['masks'])} objects found")
First, we define two URLs. The first points to a cat image. The second points to a kitchen scene from COCO (Lines 1 and 2).
Next, we download the two images, load them into memory, and convert them to RGB. We store both images in a list. This allows us to batch them later. Then, we define one prompt per image. The first prompt searches for a cat’s ear. The second prompt looks for a dial in the kitchen scene (Lines 3-8).
We batch the images and batch the prompts into a single input structure. This gives SAM 3 two parallel vision-language tasks, packed into one tensor (Line 10).
We disable gradient computation and run the model in inference mode. The outputs contain segmentation predictions for both images. We post-process the raw logits. SAM 3 returns results as a list: one entry per image. Each entry contains instance masks, bounding boxes, and confidence scores (Lines 12-21).
We count the number of objects detected for each prompt. This gives us a simple, semantic summary of model performance (Lines 23 and 24).
Below is the total number of objects detected in each image provided for each text prompt.
Image 1: 2 objects found
Image 2: 7 objects found
Output
for image, result, prompt in zip(images, results, text_prompts): labels = [prompt] * len(result["scores"]) vis = overlay_masks_boxes_scores(image, result["masks"], result["boxes"], result["scores"], labels) display(vis)
To visualize the output, we pair each image with its corresponding prompt and result. For each batch entry, we do the following (Line 1):
- create a label per detected object (Line 2)
- visualize the masks, boxes, and scores using our overlay helper (Line 3)
- display the annotated result in the notebook (Line 4)
This approach shows how SAM 3 handles multiple text prompts and images simultaneously, without writing separate inference loops.
In Figure 2, we can see the object (ear) detected in the image.

In Figure 3, we can see the object (dial) detected in the image.

Single Bounding Box Prompt
In this example, we perform segmentation using a bounding box instead of a text prompt. We provide the model with a spatial hint that says: “focus here.” SAM 3 then segments all detected instances of a concept provided by the spatial hint.
# Load image
image_url = "http://images.cocodataset.org/val2017/000000077595.jpg"
image = Image.open(requests.get(image_url, stream=True).raw).convert("RGB")
# Box in xyxy format: [x1, y1, x2, y2]
box_xyxy = [100, 150, 500, 450]
input_boxes = [[box_xyxy]]
input_boxes_labels = [[1]] # 1 = positive (foreground) box
def draw_input_box(image, box, color="red", width=3):
img = image.copy().convert("RGB")
draw = ImageDraw.Draw(img)
x1, y1, x2, y2 = box
draw.rectangle([(x1, y1), (x2, y2)], outline=color, width=width)
return img
input_box_vis = draw_input_box(image, box_xyxy)
input_box_vis
First, we load an example COCO image directly from a URL. We read the raw bytes, open them with Pillow, and convert them to RGB (Lines 2 and 3).
Next, we define a bounding box around the region to be segmented. The coordinates follow the xyxy format (Line 6).
(x1, y1): top-left corner(x2, y2): bottom-right corner
We prepare the box for the processor.
- The outer list indicates a batch size of 1. The inner list holds the single bounding box (Line 8).
- We set the label to
1, meaning this is a positive box, and SAM 3 should focus on this region (Line 9).
Then, we define a helper to visualize the prompt box. The function draws a colored rectangle over the image, making the prompt easy to verify before segmentation (Lines 11-16).
We display the input box overlay. This confirms our prompt is correct before running the model (Lines 18 and 19).
Figure 4 shows the bounding box prompt overlaid on the input image.

inputs = processor(
images=image,
input_boxes=input_boxes,
input_boxes_labels=input_boxes_labels,
return_tensors="pt"
).to(device)
with torch.no_grad():
outputs = model(**inputs)
results = processor.post_process_instance_segmentation(
outputs,
threshold=0.5,
mask_threshold=0.5,
target_sizes=inputs["original_sizes"].tolist()
)[0]
print(f"Found {len(results['masks'])} objects")
Now, we prepare the final inputs for the model. Instead of passing text, we pass bounding box prompts. The processor handles resizing, padding, normalization, and tensor conversion. We then move everything to the selected device (GPU or CPU) (Lines 1-6).
We run SAM 3 in inference mode. The torch.no_grad() function disables gradient computation, reducing memory usage and improving speed (Lines 8 and 9).
After inference, we reshape and threshold the predicted masks. We resize them back to their original sizes so they align perfectly. We index [0] because we are working with a single image (Lines 11-16).
We print the number of foreground objects that SAM 3 detected within the bounding box (Line 18).
Found 1 objects
Output
labels = ["box-prompted object"] * len(results["scores"]) overlay_masks_boxes_scores( image=image, masks=results["masks"], boxes=results["boxes"], scores=results["scores"], labels=labels, score_threshold=0.5, alpha=0.45, )
To visualize the results, we create a label string "box-prompted object" for each detected instance to keep the overlay looking clean (Line 1).
Finally, we call our overlay helper. It blends the segmentation mask, draws the bounding box, and shows confidence scores on top of the original image (Lines 3-11).
Figure 5 shows the segmented object.

Multiple Bounding Box Prompts on a Single Image (Dual Positive Foreground Regions)
In this example, we guide SAM 3 using two positive bounding boxes. Each box marks a small region of interest inside the image: one around the oven dial and one around a nearby button. Both boxes act as foreground signals. SAM 3 then segments all detected objects within these marked regions.
kitchen_url = "http://images.cocodataset.org/val2017/000000136466.jpg"
kitchen_image = Image.open(
requests.get(kitchen_url, stream=True).raw
).convert("RGB")
box1_xyxy = [59, 144, 76, 163] # Dial
box2_xyxy = [87, 148, 104, 159] # Button
input_boxes = [[box1_xyxy, box2_xyxy]]
input_boxes_labels = [[1, 1]] # 1 = positive (foreground)
def draw_input_boxes(image, boxes, color="red", width=3):
img = image.copy().convert("RGB")
draw = ImageDraw.Draw(img)
for box in boxes:
x1, y1, x2, y2 = box
draw.rectangle([(x1, y1), (x2, y2)], outline=color, width=width)
return img
input_box_vis = draw_input_boxes(
kitchen_image,
[box1_xyxy, box2_xyxy]
)
input_box_vis
First, we load the kitchen image from COCO. We download the raw image bytes, open them with Pillow, and convert the image to RGB. Next, we define two bounding boxes. Both follow the xyxy format. The first box highlights the oven dial. The second box highlights the oven button (Lines 1-7).
We pack both bounding boxes into a single list, since we are working with a single image. We assign a value of 1 to both boxes, indicating that both are positive prompts. We define a helper function to visualize the bounding box prompts. For each box, we draw a red rectangle overlay on a copy of the image (Lines 9-20).
We draw both boxes and display the result. This gives us a visual confirmation of our bounding box prompts before running the model (Lines 22-27).
Figure 6 shows the two positive bounding boxes superimposed on the input image.

inputs = processor(
images=kitchen_image,
input_boxes=input_boxes,
input_boxes_labels=input_boxes_labels,
return_tensors="pt"
).to(device)
with torch.no_grad():
outputs = model(**inputs)
results = processor.post_process_instance_segmentation(
outputs,
threshold=0.5,
mask_threshold=0.5,
target_sizes=inputs["original_sizes"].tolist()
)[0]
print(f"Found {len(results['masks'])} objects")
Now, we prepare the image and the bounding box prompts using the processor. We then send the tensors to the CPU or GPU. We run SAM 3 in inference mode. We disable gradient tracking to improve memory and speed (Lines 1-9).
Next, we post-process the raw outputs. We resize masks back to their original shape, and we filter low-confidence results. We print the number of detected objects that fall inside our two positive bounding box prompts (Lines 11-18).
Below is the total number of objects detected in the image.
Found 7 objects
Output
labels = ["box-prompted object"] * len(results["scores"]) overlay_masks_boxes_scores( image=kitchen_image, masks=results["masks"], boxes=results["boxes"], scores=results["scores"], labels=labels, )
We generate a label for visualization. Finally, we overlay the segmented objects on the image using the overlay_masks_boxes_scores function (Lines 1-9).
Here, Figure 7 displays all segmented objects.

Multiple Bounding Box Prompts on a Single Image (Positive Foreground and Negative Background Control)
In this example, we guide SAM 3 using two bounding boxes: one positive and one negative. The positive box highlights the region we want to segment, while the negative box tells the model to ignore a nearby region. This combination gives us fine control over the segmentation result.
kitchen_url = "http://images.cocodataset.org/val2017/000000136466.jpg"
kitchen_image = Image.open(
requests.get(kitchen_url, stream=True).raw
).convert("RGB")
box1_xyxy = [59, 144, 76, 163] # Dial
box2_xyxy = [87, 148, 104, 159] # Button
input_boxes = [[box1_xyxy, box2_xyxy]]
input_boxes_labels = [[1, 0]]
def draw_input_boxes(image, boxes, labels, width=3):
"""
boxes : list of [x1, y1, x2, y2]
labels : list of ints (1 = positive, 0 = negative)
"""
img = image.copy().convert("RGB")
draw = ImageDraw.Draw(img)
for box, label in zip(boxes, labels):
x1, y1, x2, y2 = box
# Color by label
color = "green" if label == 1 else "red"
draw.rectangle(
[(x1, y1), (x2, y2)],
outline=color,
width=width,
)
return img
input_box_vis = draw_input_boxes(
kitchen_image,
boxes=[box1_xyxy, box2_xyxy],
labels=[1, 0], # 1 = positive, 0 = negative
)
input_box_vis
First, we load our kitchen image from the COCO dataset. We fetch the bytes from the URL and convert them to RGB (Lines 1-4).
Next, we define two bounding boxes. Both follow the xyxy coordinate format (Lines 6 and 7):
- first box: surrounds the oven dial
- second box: surrounds a nearby oven button
We pack the two boxes into a single list because we are working with a single image. We set labels [1, 0], meaning (Lines 9 and 10):
- dial box: positive (foreground to include)
- button box: negative (area to exclude)
We define a helper function that draws bounding boxes in different colors. Positive prompts are drawn in green. Negative prompts are drawn in red (Lines 12-32).
We visualize the bounding box prompts overlaid on the image. This gives us a clear understanding of how we are instructing SAM 3 (Lines 34-40).
Figure 8 shows the positive and negative box prompts superimposed on the input image.

inputs = processor(
images=kitchen_image,
input_boxes=input_boxes,
input_boxes_labels=input_boxes_labels,
return_tensors="pt"
).to(device)
with torch.no_grad():
outputs = model(**inputs)
results = processor.post_process_instance_segmentation(
outputs,
threshold=0.5,
mask_threshold=0.5,
target_sizes=inputs["original_sizes"].tolist()
)[0]
print(f"Found {len(results['masks'])} objects")
We prepare the inputs for SAM 3. The processor handles preprocessing and tensor conversion. We perform inference. Gradients are disabled to reduce memory usage. Next, we post-process the results. SAM 3 returns instance masks filtered by confidence and resized to the original resolution (Lines 1-16).
We print the number of objects segmented using this foreground-background combination (Line 18).
Below is the total number of objects detected in the image.
Found 6 objects
Output
labels = ["box-prompted object"] * len(results["scores"]) overlay_masks_boxes_scores( image=kitchen_image, masks=results["masks"], boxes=results["boxes"], scores=results["scores"], labels=labels, )
We assign labels to detections to ensure the overlay displays meaningful text. Finally, we visualize the segmentation (Lines 1-9).
In Figure 9, the positive prompt guides SAM 3 to segment the dial, while the negative prompt suppresses the nearby button.

Combining Text and Visual Prompts for Selective Segmentation (Excluding the Undesired Regions)
In this example, we use two different prompt types at the same time:
- text prompt: to search for
"handle" - negative bounding box: to exclude the oven handle region
This provides selective control, allowing SAM 3 to focus on handles in the scene while ignoring a specific area.
kitchen_url = "http://images.cocodataset.org/val2017/000000136466.jpg"
kitchen_image = Image.open(
requests.get(kitchen_url, stream=True).raw
).convert("RGB")
# Segment "handle" but exclude the oven handle using a negative box
text = "handle"
# Negative box covering oven handle area (xyxy): [40, 183, 318, 204]
oven_handle_box = [40, 183, 318, 204]
input_boxes = [[oven_handle_box]]
def draw_negative_box(image, box, width=3):
img = image.copy().convert("RGB")
draw = ImageDraw.Draw(img)
x1, y1, x2, y2 = box
draw.rectangle(
[(x1, y1), (x2, y2)],
outline="red", # red = negative
width=width,
)
return img
neg_box_vis = draw_negative_box(
kitchen_image,
oven_handle_box
)
neg_box_vis
First, we load the kitchen image from the COCO dataset. We read the file from the URL, open it as a Pillow image, and convert it to RGB (Lines 1-4).
Next, we define the structure of our prompt. We want to segment handles in the kitchen, but exclude the large oven handle. We describe the concept using text ("handle") and draw a bounding box over the oven handle region (Lines 7-10).
We write a helper function to visualize our negative region. We draw a red bounding box to show that this area should be excluded. We display the negative prompt overlay. This helps confirm that the region is positioned correctly (Lines 12-30).
Figure 10 shows the bounding box prompt to exclude the oven handle region.

inputs = processor(
images=kitchen_image,
text="handle",
input_boxes=[[oven_handle_box]],
input_boxes_labels=[[0]], # negative box
return_tensors="pt"
).to(device)
with torch.no_grad():
outputs = model(**inputs)
results = processor.post_process_instance_segmentation(
outputs,
threshold=0.5,
mask_threshold=0.5,
target_sizes=inputs["original_sizes"].tolist()
)[0]
print(f"Found {len(results['masks'])} objects")
Here, we prepare the inputs for SAM 3. We combine text and bounding box prompts. We mark the bounding box with a 0 label, meaning it is a negative region that the model must ignore (Lines 1-7).
We run the model in inference mode. This yields raw segmentation predictions based on both prompt types. We post-process the results by converting logits into binary masks, filtering low-confidence predictions, and resizing the masks back to the original resolution (Lines 9-17).
We report below the number of handle-like objects remaining after excluding the oven handle (Line 19).
Found 3 objects
Output
labels = ["handle (excluding oven)"] * len(results["scores"]) final_vis = overlay_masks_boxes_scores( image=kitchen_image, masks=results["masks"], boxes=results["boxes"], scores=results["scores"], labels=labels, score_threshold=0.5, alpha=0.45, ) final_vis
We assign meaningful labels for visualization. Finally, we draw masks, bounding boxes, labels, and scores on the image (Lines 1-13).
In Figure 11, the result shows only handles outside the negative region.

"handle" segmentation while excluding the oven handle via a negative box (source: visualization by the author)Batched Mixed-Prompt Segmentation Across Two Images (Text and Bounding Box Guidance)
In this example, we demonstrate how SAM 3 can handle multiple prompt types in a single batch. The first image receives a text prompt ("laptop"), while the second image receives a visual prompt (positive bounding box). Both images are processed together in a single forward pass.
text=["laptop", None]
input_boxes=[None, [box2_xyxy]]
input_boxes_labels=[None, [1]]
def draw_input_box(image, box, color="green", width=3):
img = image.copy().convert("RGB")
draw = ImageDraw.Draw(img)
x1, y1, x2, y2 = box
draw.rectangle([(x1, y1), (x2, y2)], outline=color, width=width)
return img
input_vis_1 = images[0] # text prompt → no box
input_vis_2 = draw_input_box(images[1], box2_xyxy)
First, we define 3 parallel prompt lists:
- 1 for text
- 1 for bounding boxes
- 1 for bounding box labels
We set the first entry in each list to None for the first image because we only want to use natural language there (laptop). For the second image, we supply a bounding box and label it as positive (1) (Lines 1-3).
We define a small helper function to draw a bounding box on an image. This helps us visualize the prompt region before inference. Here, we prepare two preview images (Lines 5-13):
- first image: shows no box, since it will use text only
- second image: is rendered with its bounding box prompt
input_vis_1
Figure 12 shows no box over the image, since it uses a text prompt for segmentation.

input_vis_2
Figure 13 shows a bounding box over the image because it uses a box prompt for segmentation.

inputs = processor( images=images, text=["laptop", None], input_boxes=[None, [box2_xyxy]], input_boxes_labels=[None, [1]], return_tensors="pt" ).to(device) with torch.no_grad(): outputs = model(**inputs) results = processor.post_process_instance_segmentation( outputs, threshold=0.5, mask_threshold=0.5, target_sizes=inputs["original_sizes"].tolist() )
Next, we assemble everything into a single batched input. This gives SAM 3:
- 2 images
- 2 prompt types
- 1 forward pass
We run SAM 3 inference without computing gradients. This produces segmentation predictions for both images simultaneously (Lines 1-10).
We post-process the model outputs for both images. The result is a two-element list (Lines 12-17):
- entry
[0]: corresponds to the laptop query - entry
[1]: corresponds to the bounding box query
Output 1: Text Prompt Segmentation
labels_1 = ["laptop"] * len(results[0]["scores"]) overlay_masks_boxes_scores( image=images[0], masks=results[0]["masks"], boxes=results[0]["boxes"], scores=results[0]["scores"], labels=labels_1, score_threshold=0.5, )
We apply a label to each detected object in the first image. We visualize the segmentation results overlaid on the first image (Lines 1-10).
In Figure 14, we observe detections guided by the text prompt "laptop".

"laptop" in Image 1 (source: visualization by the author)Output 2: Bounding Box Prompt Segmentation
labels_2 = ["box-prompted object"] * len(results[1]["scores"]) overlay_masks_boxes_scores( image=images[1], masks=results[1]["masks"], boxes=results[1]["boxes"], scores=results[1]["scores"], labels=labels_2, score_threshold=0.5, )
We create labels for the second image. These detections are from the bounding box prompt. Finally, we visualize the bounding box guided segmentation on the second image (Lines 1-10).
In Figure 15, we can see the detections guided by the bounding box prompt.
Interactive Segmentation Using Bounding Box Refinement (Draw to Segment)
In this example, we turn segmentation into a fully interactive workflow. We draw bounding boxes directly over the image using a widget UI. Each drawn box becomes a prompt signal for SAM 3:
- green (positive) boxes: identify areas we want to segment
- red (negative) boxes: exclude areas we want the model to ignore
After drawing, we convert the widget output into proper box coordinates and run SAM 3 to produce refined segmentation masks.
output.enable_custom_widget_manager()
# Load image
url = "http://images.cocodataset.org/val2017/000000136466.jpg"
image = Image.open(requests.get(url, stream=True).raw).convert("RGB")
# Convert to base64
def pil_to_base64(img):
buffer = io.BytesIO()
img.save(buffer, format="PNG")
return "data:image/png;base64," + base64.b64encode(buffer.getvalue()).decode()
# Create widget
widget = BBoxWidget(
image=pil_to_base64(image),
classes=["positive", "negative"]
)
widget
We enable custom widget support in Colab to ensure the bounding box UI renders properly. We download the kitchen image, load it into memory, and convert it to RGB format (Lines 1-5).
Before sending the image into the widget, we convert it into a base64 PNG buffer. This encoding step makes the image displayable in the browser UI (Lines 8-11).
We create an interactive drawing widget. It displays the image and allows the user to add labeled boxes. Each box is tagged as either "positive" or "negative" (Lines 14-17).
We render the widget in the notebook. At this point, the user can draw, move, resize, and delete bounding boxes (Line 19).
In Figure 16, we can see the positive and negative bounding boxes drawn by the user. The blue box indicates regions that belong to the object of interest, while the orange box marks background areas that should be ignored. These annotations serve as interactive guidance signals for refining the segmentation output.

print(widget.bboxes)
The widget.bboxes object stores metadata for every annotation drawn by the user on the image. Each entry corresponds to a single box created in the interactive widget.
A typical output looks like this:
[{'x': 58, 'y': 147, 'width': 18, 'height': 18, 'label': 'positive'}, {'x': 88, 'y': 149, 'width': 18, 'height': 8, 'label': 'negative'}]
Each dictionary represents a single user annotation:
xandy: indicate the top-left corner of the drawn box in pixel coordinateswidthandheight: describe the size of the boxlabel: tells us whether the annotation is a'positive'point (object) or a'negative'point (background)
def widget_to_sam_boxes(widget):
boxes = []
labels = []
for ann in widget.bboxes:
x = int(ann["x"])
y = int(ann["y"])
w = int(ann["width"])
h = int(ann["height"])
x1 = x
y1 = y
x2 = x + w
y2 = y + h
label = ann.get("label") or ann.get("class")
boxes.append([x1, y1, x2, y2])
labels.append(1 if label == "positive" else 0)
return boxes, labels
boxes, box_labels = widget_to_sam_boxes(widget)
print("Boxes:", boxes)
print("Labels:", box_labels)
We define a helper function to translate widget data into SAM-compatible xyxy coordinates. The widget gives us x/y + width/height. We convert to SAM’s xyxy format.
We encode labels into SAM 3 format:
1: positive region0: negative region
The function returns valid box lists ready for inference. We extract the interactive box prompts (Lines 23-45).
Below are the Boxes and Labels in the required format.
Boxes: [[58, 147, 76, 165], [88, 149, 106, 157]]
Labels: [1, 0]
inputs = processor(
images=image,
input_boxes=[boxes], # batch size = 1
input_boxes_labels=[box_labels],
return_tensors="pt"
).to(device)
with torch.no_grad():
outputs = model(**inputs)
results = processor.post_process_instance_segmentation(
outputs,
threshold=0.5,
mask_threshold=0.5,
target_sizes=inputs["original_sizes"].tolist()
)[0]
print(f"Found {len(results['masks'])} objects")
We pass the image and interactive box prompts into the processor. We run inference without tracking gradients. We convert logits into final mask predictions. We print the number of detected regions matching the interactive prompts (Lines 49-66).
Below is the number of objects detected by the model.
Found 6 objects
Output
labels = ["interactive object"] * len(results["scores"]) overlay_masks_boxes_scores( image=image, masks=results["masks"], boxes=results["boxes"], scores=results["scores"], labels=labels, alpha=0.45, )
We assign simple labels to each detected region and overlay masks, bounding boxes, and scores on the original image (Lines 1-10).
This workflow demonstrates an effective use case: human-guided refinement through live drawing tools. With just a few annotations, SAM 3 adapts the segmentation output, giving us precision control and immediate visual feedback.
In Figure 17, we can see the segmented regions according to the positive and negative bounding box prompts annotated by the user over the input image.

Interactive Segmentation Using Point-Based Refinement (Click to Guide the Model)
In this example, we segment using point prompts rather than text or bounding boxes. We click on the image to mark positive and negative points. The center of each clicked point becomes a guiding coordinate, and SAM 3 uses those coordinates to refine segmentation. This workflow provides fine-grained, pixel-level control, well suited for interactive editing or correction.
# Setup device
device = Accelerator().device
# Load model and processor
print("Loading SAM3 model...")
model = Sam3TrackerModel.from_pretrained("facebook/sam3").to(device)
processor = Sam3TrackerProcessor.from_pretrained("facebook/sam3")
print("Model loaded successfully!")
# Load image
IMAGE_PATH = "/content/dog-2.jpeg"
raw_image = Image.open(IMAGE_PATH).convert("RGB")
def pil_to_base64(img):
"""Convert PIL image to base64 for BBoxWidget"""
buffer = io.BytesIO()
img.save(buffer, format="PNG")
return "data:image/png;base64," + base64.b64encode(buffer.getvalue()).decode()
We set up our compute device using the Accelerator() class. This automatically detects the GPU if available. We load the SAM 3 tracking model and processor. This variant supports point-based refinement and multi-mask output (Lines 2-7).
We load the dog image into memory and convert it to RGB format. The BBoxWidget expects image data in base64 format. We write a helper function to convert a PIL image to base64 (Lines 11-18).
def get_points_from_widget(widget):
"""Extract point coordinates from widget bboxes"""
positive_points = []
negative_points = []
for ann in widget.bboxes:
x = int(ann["x"])
y = int(ann["y"])
w = int(ann["width"])
h = int(ann["height"])
# Get center point of the bbox
center_x = x + w // 2
center_y = y + h // 2
label = ann.get("label") or ann.get("class")
if label == "positive":
positive_points.append([center_x, center_y])
elif label == "negative":
negative_points.append([center_x, center_y])
return positive_points, negative_points
We loop over bounding boxes drawn on the widget and convert them into point coordinates. Each tiny bounding box becomes a center point. We split them into (Lines 20-42):
- positive points: object
- negative points: background
def segment_from_widget(b=None):
"""Run segmentation with points from widget"""
positive_points, negative_points = get_points_from_widget(widget)
if not positive_points and not negative_points:
print("⚠️ Please add at least one point (draw small boxes on the image)!")
return
# Combine points and labels
all_points = positive_points + negative_points
all_labels = [1] * len(positive_points) + [0] * len(negative_points)
print(f"\n🔄 Running segmentation...")
print(f" • {len(positive_points)} positive points: {positive_points}")
print(f" • {len(negative_points)} negative points: {negative_points}")
# Prepare inputs (4D for points, 3D for labels)
input_points = [[all_points]] # [batch, object, points, xy]
input_labels = [[all_labels]] # [batch, object, labels]
inputs = processor(
images=raw_image,
input_points=input_points,
input_labels=input_labels,
return_tensors="pt"
).to(device)
# Run inference
with torch.no_grad():
outputs = model(**inputs)
# Post-process masks
masks = processor.post_process_masks(
outputs.pred_masks.cpu(),
inputs["original_sizes"]
)[0]
print(f"✅ Generated {masks.shape[1]} masks with shape {masks.shape}")
# Visualize results
visualize_results(masks, positive_points, negative_points)
This segment_from_widget function handles (Lines 44-83):
- reading positive + negative points (Lines 46-58)
- building SAM 3 inputs (Lines 60-68)
- running inference (Lines 71 and 72)
- post-processing masks (Lines 75-78)
- visualizing results (Line 83)
We pack points and labels into the correct model format. The model generates multiple ranked masks. Better quality masks appear at index 0.
def visualize_results(masks, positive_points, negative_points):
"""Display segmentation results"""
n_masks = masks.shape[1]
# Create figure with subplots
fig, axes = plt.subplots(1, min(n_masks, 3), figsize=(15, 5))
if n_masks == 1:
axes = [axes]
for idx in range(min(n_masks, 3)):
mask = masks[0, idx].numpy()
# Overlay mask on image
img_array = np.array(raw_image)
colored_mask = np.zeros_like(img_array)
colored_mask[mask > 0] = [0, 255, 0] # Green mask
overlay = img_array.copy()
overlay[mask > 0] = (img_array[mask > 0] * 0.5 + colored_mask[mask > 0] * 0.5).astype(np.uint8)
axes[idx].imshow(overlay)
axes[idx].set_title(f"Mask {idx + 1} (Quality Ranked)", fontsize=12, fontweight='bold')
axes[idx].axis('off')
# Plot points on each mask
for px, py in positive_points:
axes[idx].plot(px, py, 'go', markersize=12, markeredgecolor='white', markeredgewidth=2.5)
for nx, ny in negative_points:
axes[idx].plot(nx, ny, 'ro', markersize=12, markeredgecolor='white', markeredgewidth=2.5)
plt.tight_layout()
plt.show()
We overlay segmentation masks over the original image. Positive points are displayed as green dots. Negative points are shown in red (Lines 85-116).
def reset_widget(b=None):
"""Clear all annotations"""
widget.bboxes = []
print("🔄 Reset! All points cleared.")
This clears previously selected points so we can start fresh (Lines 118-121).
# Create widget for point selection widget = BBoxWidget( image=pil_to_base64(raw_image), classes=["positive", "negative"] )
Users can click to add points anywhere on the image. The widget captures both position and label (Lines 124-127).
# Create UI buttons segment_button = widgets.Button( description='🎯 Segment', button_style='success', tooltip='Run segmentation with marked points', icon='check', layout=widgets.Layout(width='150px', height='40px') ) segment_button.on_click(segment_from_widget) reset_button = widgets.Button( description='🔄 Reset', button_style='warning', tooltip='Clear all points', icon='refresh', layout=widgets.Layout(width='150px', height='40px') ) reset_button.on_click(reset_widget)
We create UI buttons for:
- running segmentation (Lines 130-137)
- clearing annotations (Lines 139-146)
# Display UI
print("=" * 70)
print("🎨 INTERACTIVE SAM3 SEGMENTATION WITH BOUNDING BOX WIDGET")
print("=" * 70)
print("\n📋 Instructions:")
print(" 1. Draw SMALL boxes on the image where you want to mark points")
print(" 2. Label them as 'positive' (object) or 'negative' (background)")
print(" 3. The CENTER of each box will be used as a point coordinate")
print(" 4. Click 'Segment' button to run SAM3")
print(" 5. Click 'Reset' to clear all points and start over")
print("\n💡 Tips:")
print(" • Draw tiny boxes - just big enough to see")
print(" • Positive points = parts of the object you want")
print(" • Negative points = background areas to exclude")
print("\n" + "=" * 70 + "\n")
display(widgets.HBox([segment_button, reset_button]))
display(widget)
We render the interface side-by-side. The user can now:
- click positive points
- click negative points
- run segmentation live
- reset anytime
Output
In Figure 18, we can see the whole point-based segmentation process.
What's next? We recommend PyImageSearch University.
86+ total classes • 115+ hours hours of on-demand code walkthrough videos • Last updated: February 2026
★★★★★ 4.84 (128 Ratings) • 16,000+ Students Enrolled
I strongly believe that if you had the right teacher you could master computer vision and deep learning.
Do you think learning computer vision and deep learning has to be time-consuming, overwhelming, and complicated? Or has to involve complex mathematics and equations? Or requires a degree in computer science?
That’s not the case.
All you need to master computer vision and deep learning is for someone to explain things to you in simple, intuitive terms. And that’s exactly what I do. My mission is to change education and how complex Artificial Intelligence topics are taught.
If you're serious about learning computer vision, your next stop should be PyImageSearch University, the most comprehensive computer vision, deep learning, and OpenCV course online today. Here you’ll learn how to successfully and confidently apply computer vision to your work, research, and projects. Join me in computer vision mastery.
Inside PyImageSearch University you'll find:
- ✓ 86+ courses on essential computer vision, deep learning, and OpenCV topics
- ✓ 86 Certificates of Completion
- ✓ 115+ hours hours of on-demand video
- ✓ Brand new courses released regularly, ensuring you can keep up with state-of-the-art techniques
- ✓ Pre-configured Jupyter Notebooks in Google Colab
- ✓ Run all code examples in your web browser — works on Windows, macOS, and Linux (no dev environment configuration required!)
- ✓ Access to centralized code repos for all 540+ tutorials on PyImageSearch
- ✓ Easy one-click downloads for code, datasets, pre-trained models, etc.
- ✓ Access on mobile, laptop, desktop, etc.
Summary
In Part 2 of this tutorial, we explored the advanced capabilities of SAM 3, transforming it from a powerful segmentation tool into a flexible, interactive visual query system. We demonstrated how to leverage multiple prompt types (text, bounding boxes, and points) both individually and in combination to achieve precise, context-aware segmentation results.
We covered sophisticated workflows, including:
- Segmenting multiple concepts simultaneously in the same image
- Processing batches of images with different prompts efficiently
- Using positive bounding boxes to focus on regions of interest
- Employing negative prompts to exclude unwanted areas
- Combining text and visual prompts for selective, fine-grained control
- Building fully interactive segmentation interfaces where users can draw boxes or click points and see results in real-time
These techniques showcase SAM 3’s versatility for real-world applications. Whether you’re building large-scale data annotation pipelines, developing intelligent video editing tools, creating AR experiences, or conducting scientific research, the multi-modal prompting capabilities we explored give you pixel-perfect control over segmentation outputs.
Citation Information
Thakur, P. “Advanced SAM 3: Multi-Modal Prompting and Interactive Segmentation,” PyImageSearch, P. Chugh, S. Huot, G. Kudriavtsev, and A. Sharma, eds., 2026, https://pyimg.co/5c4ag
@incollection{Thakur_2026_advanced-sam-3-multi-modal-prompting-and-interactive-segmentation,
author = {Piyush Thakur},
title = {{Advanced SAM 3: Multi-Modal Prompting and Interactive Segmentation}},
booktitle = {PyImageSearch},
editor = {Puneet Chugh and Susan Huot and Georgii Kudriavtsev and Aditya Sharma},
year = {2026},
url = {https://pyimg.co/5c4ag},
}
To download the source code to this post (and be notified when future tutorials are published here on PyImageSearch), simply enter your email address in the form below!

Download the Source Code and FREE 17-page Resource Guide
Enter your email address below to get a .zip of the code and a FREE 17-page Resource Guide on Computer Vision, OpenCV, and Deep Learning. Inside you'll find my hand-picked tutorials, books, courses, and libraries to help you master CV and DL!



Comment section
Hey, Adrian Rosebrock here, author and creator of PyImageSearch. While I love hearing from readers, a couple years ago I made the tough decision to no longer offer 1:1 help over blog post comments.
At the time I was receiving 200+ emails per day and another 100+ blog post comments. I simply did not have the time to moderate and respond to them all, and the sheer volume of requests was taking a toll on me.
Instead, my goal is to do the most good for the computer vision, deep learning, and OpenCV community at large by focusing my time on authoring high-quality blog posts, tutorials, and books/courses.
If you need help learning computer vision and deep learning, I suggest you refer to my full catalog of books and courses — they have helped tens of thousands of developers, students, and researchers just like yourself learn Computer Vision, Deep Learning, and OpenCV.
Click here to browse my full catalog.