Skip to main content

Overview

The Sam3Processor class provides a high-level interface for using SAM 3 on images with text and geometric prompts. It handles image preprocessing, prompt encoding, and result post-processing.

Class Initialization

from sam3.model.sam3_image_processor import Sam3Processor

processor = Sam3Processor(
    model,
    resolution=1008,
    device="cuda",
    confidence_threshold=0.5
)

Parameters

model
Sam3Image
required
The SAM 3 image model instance.
resolution
int
default:"1008"
Input image resolution (images are resized to resolution × resolution).
device
str
default:"'cuda'"
Device to run inference on.
confidence_threshold
float
default:"0.5"
Confidence threshold for filtering predictions.

Methods

set_image

Sets the image for inference and computes image embeddings.
state = processor.set_image(image, state=None)
image
PIL.Image.Image | torch.Tensor | np.ndarray
required
Input image in RGB format. Can be PIL Image, PyTorch tensor, or NumPy array.
state
dict | None
default:"None"
Optional state dictionary. If None, creates a new state.
state
dict
Updated state containing image embeddings and metadata:
  • original_height: Original image height
  • original_width: Original image width
  • backbone_out: Backbone feature maps

set_image_batch

Sets a batch of images for inference.
state = processor.set_image_batch(images, state=None)
images
list[PIL.Image.Image]
required
List of PIL images to process.
state
dict
State containing:
  • original_heights: List of original heights
  • original_widths: List of original widths
  • backbone_out: Batch backbone features

set_text_prompt

Sets text prompt and runs inference.
state = processor.set_text_prompt(prompt, state)
prompt
str
required
Text description of objects to segment (e.g., “person”, “dog”).
state
dict
required
State dictionary from set_image(). Must contain image embeddings.
state
dict
Updated state with segmentation results:
  • masks: Binary masks (bool tensor)
  • masks_logits: Mask logits (float tensor)
  • boxes: Bounding boxes in [x0, y0, x1, y1] format
  • scores: Confidence scores

add_geometric_prompt

Adds a box prompt and runs inference.
state = processor.add_geometric_prompt(box, label, state)
box
list[float]
required
Box in [center_x, center_y, width, height] format, normalized to [0, 1].
label
bool
required
True for positive box (include), False for negative box (exclude).
state
dict
required
State dictionary with image embeddings.
state
dict
Updated state with new segmentation results.

reset_all_prompts

Removes all prompts and results from the state.
processor.reset_all_prompts(state)

set_confidence_threshold

Updates the confidence threshold and re-filters results.
state = processor.set_confidence_threshold(threshold, state=None)
threshold
float
required
New confidence threshold (0.0 to 1.0).

Example Usage

Basic Text Prompting

from PIL import Image
from sam3.model_builder import build_sam3_image_model
from sam3.model.sam3_image_processor import Sam3Processor

# Load model and create processor
model = build_sam3_image_model()
processor = Sam3Processor(model)

# Load image
image = Image.open("image.jpg")

# Set image and text prompt
state = processor.set_image(image)
state = processor.set_text_prompt("person", state)

# Access results
masks = state["masks"]  # Binary masks
boxes = state["boxes"]  # Bounding boxes
scores = state["scores"]  # Confidence scores

Adding Box Prompts

# Set image
state = processor.set_image(image)

# Add positive box (normalized coordinates)
box = [0.5, 0.5, 0.3, 0.4]  # center_x, center_y, width, height
state = processor.add_geometric_prompt(box, label=True, state=state)

# Add negative box to exclude region
exclude_box = [0.7, 0.3, 0.2, 0.2]
state = processor.add_geometric_prompt(exclude_box, label=False, state=state)

Adjusting Confidence Threshold

# Initial inference
state = processor.set_image(image)
state = processor.set_text_prompt("car", state)

print(f"Found {len(state['scores'])} masks")

# Increase threshold to get fewer, higher-confidence results
state = processor.set_confidence_threshold(0.8, state)
print(f"After filtering: {len(state['scores'])} masks")

Batch Processing

# Load multiple images
images = [Image.open(f"image_{i}.jpg") for i in range(5)]

# Process batch
state = processor.set_image_batch(images)
state = processor.set_text_prompt("dog", state)

# Results contain predictions for all images

State Dictionary Structure

The state dictionary contains:
  • original_height / original_heights: Original image dimensions
  • original_width / original_widths: Original image dimensions
  • backbone_out: Cached backbone features
  • geometric_prompt: Current geometric prompts
  • masks: Binary segmentation masks (H, W)
  • masks_logits: Mask logits before thresholding
  • boxes: Bounding boxes in [x0, y0, x1, y1] format
  • scores: Confidence scores for each prediction

Notes

  • Call set_image() before adding any prompts
  • Text prompts work best with simple noun phrases
  • Box coordinates are normalized to [0, 1] range
  • Geometric prompts are accumulated (multiple boxes/points)
  • Use reset_all_prompts() to start fresh

Build docs developers (and LLMs) love