Skip to main content
SAM 3 enables powerful image segmentation using both natural language text prompts and visual prompts like bounding boxes. This guide covers the basics of running inference on images.

Setup

1

Import dependencies

import os
import matplotlib.pyplot as plt
import numpy as np
import sam3
from PIL import Image
from sam3 import build_sam3_image_model
from sam3.model.box_ops import box_xywh_to_cxcywh
from sam3.model.sam3_image_processor import Sam3Processor
from sam3.visualization_utils import draw_box_on_image, normalize_bbox, plot_results

sam3_root = os.path.join(os.path.dirname(sam3.__file__), "..")
2

Configure PyTorch for optimal performance

import torch

# Turn on tfloat32 for Ampere GPUs
torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.allow_tf32 = True

# Use bfloat16 for the entire notebook
torch.autocast("cuda", dtype=torch.bfloat16).__enter__()
3

Build the model

bpe_path = f"{sam3_root}/assets/bpe_simple_vocab_16e6.txt.gz"
model = build_sam3_image_model(bpe_path=bpe_path)
4

Load and process image

image_path = f"{sam3_root}/assets/images/test_image.jpg"
image = Image.open(image_path)
width, height = image.size
processor = Sam3Processor(model, confidence_threshold=0.5)
inference_state = processor.set_image(image)

Text Prompts

Segment objects using natural language descriptions:
processor.reset_all_prompts(inference_state)
inference_state = processor.set_text_prompt(state=inference_state, prompt="shoe")

img0 = Image.open(image_path)
plot_results(img0, inference_state)
Text prompts work best with specific, concrete object names like “person”, “shoe”, “cat” rather than abstract descriptions.

Visual Prompts with Bounding Boxes

Single Box Prompt

Use a bounding box to specify which object to segment:
# Box in (x,y,w,h) format, where (x,y) is the top left corner
box_input_xywh = torch.tensor([480.0, 290.0, 110.0, 360.0]).view(-1, 4)
box_input_cxcywh = box_xywh_to_cxcywh(box_input_xywh)

norm_box_cxcywh = normalize_bbox(box_input_cxcywh, width, height).flatten().tolist()
print("Normalized box input:", norm_box_cxcywh)

processor.reset_all_prompts(inference_state)
inference_state = processor.add_geometric_prompt(
    state=inference_state, box=norm_box_cxcywh, label=True
)

plot_results(img0, inference_state)

Multi-Box Prompting with Positive and Negative Boxes

Refine segmentation using both positive (include) and negative (exclude) boxes:
box_input_xywh = [[480.0, 290.0, 110.0, 360.0], [370.0, 280.0, 115.0, 375.0]]
box_input_cxcywh = box_xywh_to_cxcywh(torch.tensor(box_input_xywh).view(-1,4))
norm_boxes_cxcywh = normalize_bbox(box_input_cxcywh, width, height).tolist()

box_labels = [True, False]  # True = positive, False = negative

processor.reset_all_prompts(inference_state)

for box, label in zip(norm_boxes_cxcywh, box_labels):
    inference_state = processor.add_geometric_prompt(
        state=inference_state, box=box, label=label
    )

plot_results(img0, inference_state)
Boxes must be normalized to the image dimensions. The format is [center_x, center_y, width, height] where all values are in the range [0, 1].

Visualizing Results

The plot_results utility function displays:
  • Segmentation masks (colored overlays)
  • Bounding boxes around detected objects
  • Confidence scores
from sam3.visualization_utils import plot_results

plot_results(image, inference_state)

Box Coordinate Formats

SAM 3 uses center-based normalized coordinates:
# Top-left corner + width/height in pixels
box_xywh = [x, y, width, height]

Next Steps

Video Inference

Learn how to segment and track objects in videos

Batched Inference

Process multiple images efficiently in batches

Interactive Refinement

Refine segmentations interactively with additional prompts

SAM 3 Agent

Use complex natural language queries with MLLM integration

Build docs developers (and LLMs) love