Image Inference

SAM 3 enables powerful image segmentation using both natural language text prompts and visual prompts like bounding boxes. This guide covers the basics of running inference on images.

Setup

Import dependencies

import os
import matplotlib.pyplot as plt
import numpy as np
import sam3
from PIL import Image
from sam3 import build_sam3_image_model
from sam3.model.box_ops import box_xywh_to_cxcywh
from sam3.model.sam3_image_processor import Sam3Processor
from sam3.visualization_utils import draw_box_on_image, normalize_bbox, plot_results

sam3_root = os.path.join(os.path.dirname(sam3.__file__), "..")

Configure PyTorch for optimal performance

import torch

# Turn on tfloat32 for Ampere GPUs
torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.allow_tf32 = True

# Use bfloat16 for the entire notebook
torch.autocast("cuda", dtype=torch.bfloat16).__enter__()

Build the model

bpe_path = f"{sam3_root}/assets/bpe_simple_vocab_16e6.txt.gz"
model = build_sam3_image_model(bpe_path=bpe_path)

Load and process image

image_path = f"{sam3_root}/assets/images/test_image.jpg"
image = Image.open(image_path)
width, height = image.size
processor = Sam3Processor(model, confidence_threshold=0.5)
inference_state = processor.set_image(image)

Text Prompts

Segment objects using natural language descriptions:

processor.reset_all_prompts(inference_state)
inference_state = processor.set_text_prompt(state=inference_state, prompt="shoe")

img0 = Image.open(image_path)
plot_results(img0, inference_state)

Text prompts work best with specific, concrete object names like “person”, “shoe”, “cat” rather than abstract descriptions.

Visual Prompts with Bounding Boxes

Single Box Prompt

Use a bounding box to specify which object to segment:

# Box in (x,y,w,h) format, where (x,y) is the top left corner
box_input_xywh = torch.tensor([480.0, 290.0, 110.0, 360.0]).view(-1, 4)
box_input_cxcywh = box_xywh_to_cxcywh(box_input_xywh)

norm_box_cxcywh = normalize_bbox(box_input_cxcywh, width, height).flatten().tolist()
print("Normalized box input:", norm_box_cxcywh)

processor.reset_all_prompts(inference_state)
inference_state = processor.add_geometric_prompt(
    state=inference_state, box=norm_box_cxcywh, label=True
)

plot_results(img0, inference_state)

Multi-Box Prompting with Positive and Negative Boxes

Refine segmentation using both positive (include) and negative (exclude) boxes:

box_input_xywh = [[480.0, 290.0, 110.0, 360.0], [370.0, 280.0, 115.0, 375.0]]
box_input_cxcywh = box_xywh_to_cxcywh(torch.tensor(box_input_xywh).view(-1,4))
norm_boxes_cxcywh = normalize_bbox(box_input_cxcywh, width, height).tolist()

box_labels = [True, False]  # True = positive, False = negative

processor.reset_all_prompts(inference_state)

for box, label in zip(norm_boxes_cxcywh, box_labels):
    inference_state = processor.add_geometric_prompt(
        state=inference_state, box=box, label=label
    )

plot_results(img0, inference_state)

Boxes must be normalized to the image dimensions. The format is [center_x, center_y, width, height] where all values are in the range [0, 1].

Visualizing Results

The plot_results utility function displays:

Segmentation masks (colored overlays)
Bounding boxes around detected objects
Confidence scores

from sam3.visualization_utils import plot_results

plot_results(image, inference_state)

Box Coordinate Formats

SAM 3 uses center-based normalized coordinates:

Input Format (XYWH)
SAM 3 Format (CXCYWH)

# Top-left corner + width/height in pixels
box_xywh = [x, y, width, height]

# Center point + width/height, normalized [0-1]
box_cxcywh = [center_x, center_y, width, height]

# Convert from XYWH to CXCYWH
from sam3.model.box_ops import box_xywh_to_cxcywh
box_cxcywh = box_xywh_to_cxcywh(torch.tensor(box_xywh))

# Normalize to image dimensions
from sam3.visualization_utils import normalize_bbox
normalized_box = normalize_bbox(box_cxcywh, img_width, img_height)

Next Steps

Video Inference

Learn how to segment and track objects in videos

Batched Inference

Process multiple images efficiently in batches

Interactive Refinement

Refine segmentations interactively with additional prompts

SAM 3 Agent

Use complex natural language queries with MLLM integration

Get Started

Core Concepts

Guides

Training & Fine-tuning

Evaluation

Image Inference

Setup

Text Prompts

Visual Prompts with Bounding Boxes

Single Box Prompt

Multi-Box Prompting with Positive and Negative Boxes

Visualizing Results

Box Coordinate Formats

Next Steps

Video Inference

Batched Inference

Interactive Refinement

SAM 3 Agent

Build docs developers (and LLMs) love

Get Started

Core Concepts

Guides

Training & Fine-tuning

Evaluation

​Setup

​Text Prompts

​Visual Prompts with Bounding Boxes

​Single Box Prompt

​Multi-Box Prompting with Positive and Negative Boxes

​Visualizing Results

​Box Coordinate Formats

​Next Steps

Video Inference

Batched Inference

Interactive Refinement

SAM 3 Agent

Build docs developers (and LLMs) love

Setup

Text Prompts

Visual Prompts with Bounding Boxes

Single Box Prompt

Multi-Box Prompting with Positive and Negative Boxes

Visualizing Results

Box Coordinate Formats

Next Steps