Overview
TheSam3Processor class provides a high-level interface for using SAM 3 on images with text and geometric prompts. It handles image preprocessing, prompt encoding, and result post-processing.
Class Initialization
Parameters
The SAM 3 image model instance.
Input image resolution (images are resized to resolution × resolution).
Device to run inference on.
Confidence threshold for filtering predictions.
Methods
set_image
Sets the image for inference and computes image embeddings.Input image in RGB format. Can be PIL Image, PyTorch tensor, or NumPy array.
Optional state dictionary. If
None, creates a new state.Updated state containing image embeddings and metadata:
original_height: Original image heightoriginal_width: Original image widthbackbone_out: Backbone feature maps
set_image_batch
Sets a batch of images for inference.List of PIL images to process.
State containing:
original_heights: List of original heightsoriginal_widths: List of original widthsbackbone_out: Batch backbone features
set_text_prompt
Sets text prompt and runs inference.Text description of objects to segment (e.g., “person”, “dog”).
State dictionary from
set_image(). Must contain image embeddings.Updated state with segmentation results:
masks: Binary masks (bool tensor)masks_logits: Mask logits (float tensor)boxes: Bounding boxes in [x0, y0, x1, y1] formatscores: Confidence scores
add_geometric_prompt
Adds a box prompt and runs inference.Box in [center_x, center_y, width, height] format, normalized to [0, 1].
True for positive box (include), False for negative box (exclude).State dictionary with image embeddings.
Updated state with new segmentation results.
reset_all_prompts
Removes all prompts and results from the state.set_confidence_threshold
Updates the confidence threshold and re-filters results.New confidence threshold (0.0 to 1.0).
Example Usage
Basic Text Prompting
Adding Box Prompts
Adjusting Confidence Threshold
Batch Processing
State Dictionary Structure
Thestate dictionary contains:
original_height/original_heights: Original image dimensionsoriginal_width/original_widths: Original image dimensionsbackbone_out: Cached backbone featuresgeometric_prompt: Current geometric promptsmasks: Binary segmentation masks (H, W)masks_logits: Mask logits before thresholdingboxes: Bounding boxes in [x0, y0, x1, y1] formatscores: Confidence scores for each prediction
Notes
- Call
set_image()before adding any prompts - Text prompts work best with simple noun phrases
- Box coordinates are normalized to [0, 1] range
- Geometric prompts are accumulated (multiple boxes/points)
- Use
reset_all_prompts()to start fresh