SAM 3 Agent

Overview

The SAM 3 Agent combines SAM 3 with a multimodal large language model (MLLM) to provide an agentic interface for interactive segmentation. The agent iteratively refines segmentation results through tool calls and multi-turn reasoning.

Function

agent_inference

Perform agentic segmentation with iterative refinement.

from sam3.agent.agent_core import agent_inference
from sam3.agent.client_llm import send_generate_request
from sam3.agent.client_sam3 import call_sam_service

messages, final_outputs, rendered_image = agent_inference(
    img_path,
    initial_text_prompt,
    debug=False,
    send_generate_request=send_generate_request,
    call_sam_service=call_sam_service,
    max_generations=100,
    output_dir="./sam3_agent_out"
)

Parameters

img_path

str

required

Path to the input image file.

initial_text_prompt

str

required

Initial text prompt describing what to segment (e.g., “the dog in the image”).

debug

bool

default:"False"

Enable debug mode to save conversation history.

send_generate_request

callable

default:"send_generate_request"

Function to send requests to the multimodal LLM. Defaults to the built-in implementation.

call_sam_service

callable

default:"call_sam_service"

Function to call SAM 3 segmentation service. Defaults to the built-in implementation.

max_generations

int

default:"100"

Maximum number of MLLM generation rounds allowed.

output_dir

str

default:"'../../sam3_agent_out'"

Directory to save SAM 3 outputs and debug information.

Returns

messages

list[dict]

Conversation history between user and agent.

final_outputs

dict

Final segmentation results containing:

original_image_path: Path to original image
orig_img_h: Original image height
orig_img_w: Original image width
pred_boxes: List of bounding boxes
pred_scores: List of confidence scores
pred_masks: List of segmentation masks (RLE format)

rendered_image

PIL.Image.Image

Visualization with all selected masks rendered.

Agent Tools

The agent has access to four tools:

segment_phrase

Call SAM 3 with a text prompt to generate segmentation masks. Parameters:

text_prompt (str): Simple noun phrase describing objects to segment

Usage:

<tool>{"name": "segment_phrase", "parameters": {"text_prompt": "person"}}</tool>

examine_each_mask

Examine each generated mask individually using MLLM to filter out incorrect predictions. Parameters: None Usage:

<tool>{"name": "examine_each_mask", "parameters": {}}</tool>

select_masks_and_return

Select final masks to return as the answer. Parameters:

final_answer_masks (list[int]): Mask indices to return (1-indexed)

Usage:

<tool>{"name": "select_masks_and_return", "parameters": {"final_answer_masks": [1, 3]}}</tool>

report_no_mask

Report that no valid masks exist for the query. Parameters: None Usage:

<tool>{"name": "report_no_mask", "parameters": {}}</tool>

Example Usage

Basic Usage

from sam3.agent.agent_core import agent_inference

# Run agent on an image
messages, outputs, rendered = agent_inference(
    img_path="image.jpg",
    initial_text_prompt="the red car on the left"
)

# Access results
print(f"Found {len(outputs['pred_masks'])} masks")
rendered.save("result.jpg")

With Debug Mode

# Enable debug to save conversation history
messages, outputs, rendered = agent_inference(
    img_path="complex_scene.jpg",
    initial_text_prompt="all people wearing hats",
    debug=True,
    output_dir="./debug_output"
)

# Debug files saved to ./debug_output/agent_debug_out/

Handling Multiple Objects

# Query for multiple objects
messages, outputs, rendered = agent_inference(
    img_path="street.jpg",
    initial_text_prompt="cars and trucks in the scene"
)

# Process each detected object
for i, (mask, box, score) in enumerate(
    zip(outputs['pred_masks'], outputs['pred_boxes'], outputs['pred_scores'])
):
    print(f"Object {i+1}: score={score:.3f}, box={box}")

Custom Output Directory

# Organize outputs by task
messages, outputs, rendered = agent_inference(
    img_path="image.jpg",
    initial_text_prompt="the main subject",
    output_dir="./experiments/run_001"
)

# Outputs saved to:
# - ./experiments/run_001/sam_out/ (SAM 3 visualizations)
# - ./experiments/run_001/agent_debug_out/ (debug info)

Agent Workflow

The agent follows this workflow:

Initial Segmentation: Calls segment_phrase with a text prompt
Evaluation: MLLM examines generated masks
Refinement (if needed):
- Call examine_each_mask to filter masks
- Or call segment_phrase with different prompt
Selection: Call select_masks_and_return with chosen masks

Example agent conversation:

User: "the brown dog"

Agent: <tool>{"name": "segment_phrase", "parameters": {"text_prompt": "dog"}}</tool>

System: Generated 3 masks

Agent: <tool>{"name": "examine_each_mask", "parameters": {}}</tool>

System: After examination, 2 masks remain (mask 1 rejected)

Agent: <tool>{"name": "select_masks_and_return", "parameters": {"final_answer_masks": [1]}}</tool>

Output Structure

final_outputs Dictionary

final_outputs = {
    "original_image_path": str,  # Path to input image
    "orig_img_h": int,  # Original height in pixels
    "orig_img_w": int,  # Original width in pixels
    "pred_boxes": List[List[float]],  # Bounding boxes (XYWH format)
    "pred_scores": List[float],  # Confidence scores
    "pred_masks": List[Dict],  # RLE-encoded masks
}

Mask Format

Masks are stored in RLE (Run-Length Encoding) format:

import pycocotools.mask as mask_utils

# Decode RLE mask to binary array
for rle_mask in final_outputs["pred_masks"]:
    binary_mask = mask_utils.decode(rle_mask)  # (H, W) bool array
    # Use the mask

System Prompts

The agent uses two system prompts:

Main System Prompt (system_prompt.txt): Guides tool selection and reasoning
Iterative Checking Prompt (system_prompt_iterative_checking.txt): Used by examine_each_mask for mask filtering

These prompts are located in sam3/agent/system_prompts/.

Advanced Features

Iterative Mask Examination

The examine_each_mask tool uses a separate MLLM call for each mask:

# For each mask, MLLM sees:
# - Original image
# - Image with mask overlay
# - Zoomed-in mask region
# 
# MLLM returns: <verdict>Accept</verdict> or <verdict>Reject</verdict>

Prompt Deduplication

The agent tracks used text prompts and prevents reusing them:

# If agent tries "dog" twice:
System: "You have previously used 'dog'. Please try a different prompt."

Context Pruning

To manage context length, the conversation history is pruned:

Always keeps: First 2 messages (system + initial user message)
Always keeps: Latest segment_phrase tool call and subsequent messages
Adds warnings about previously failed prompts

Error Handling

No Masks Found

# If SAM 3 returns no masks:
# Agent can either:
# 1. Try different text prompt
# 2. Call report_no_mask

messages, outputs, rendered = agent_inference(
    img_path="image.jpg",
    initial_text_prompt="unicorn"  # Not in image
)

# outputs will have empty lists:
assert len(outputs["pred_masks"]) == 0

Maximum Generations Exceeded

try:
    messages, outputs, rendered = agent_inference(
        img_path="complex.jpg",
        initial_text_prompt="find the specific object",
        max_generations=10  # Low limit
    )
except ValueError as e:
    print(f"Agent exceeded generation limit: {e}")

Requirements

SAM 3 Service: Running SAM 3 segmentation service
MLLM Service: Multimodal LLM with vision capabilities (e.g., Qwen-VL)
Client Functions:
- send_generate_request(): Send prompts to MLLM
- call_sam_service(): Call SAM 3 for segmentation

Notes

The agent uses simple noun phrases for best SAM 3 performance
MLLM must support vision and tool calling
Debug mode saves full conversation history for analysis
Agent autonomously decides when to stop iterating
Supports both single and multiple object queries

Model Builders

Image Inference

Video Inference

Agent

Evaluation

Overview

Function

agent_inference

Parameters

Returns

Agent Tools

segment_phrase

examine_each_mask

select_masks_and_return

report_no_mask

Example Usage

Basic Usage

With Debug Mode

Handling Multiple Objects

Custom Output Directory

Agent Workflow

Output Structure

final_outputs Dictionary

Mask Format

System Prompts

Advanced Features

Iterative Mask Examination

Prompt Deduplication

Context Pruning

Error Handling

No Masks Found

Maximum Generations Exceeded

Requirements

Notes

Build docs developers (and LLMs) love

Model Builders

Image Inference

Video Inference

Agent

Evaluation

​Overview

​Function

​agent_inference

​Parameters

​Returns

​Agent Tools

​segment_phrase

​examine_each_mask

​select_masks_and_return

​report_no_mask

​Example Usage

​Basic Usage

​With Debug Mode

​Handling Multiple Objects

​Custom Output Directory

​Agent Workflow

​Output Structure

​final_outputs Dictionary

​Mask Format

​System Prompts

​Advanced Features

​Iterative Mask Examination

​Prompt Deduplication

​Context Pruning

​Error Handling

​No Masks Found

​Maximum Generations Exceeded

​Requirements

​Notes

Build docs developers (and LLMs) love

Overview

Function

agent_inference

Parameters

Returns

Agent Tools

segment_phrase

examine_each_mask

select_masks_and_return

report_no_mask

Example Usage

Basic Usage

With Debug Mode

Handling Multiple Objects

Custom Output Directory

Agent Workflow

Output Structure

final_outputs Dictionary

Mask Format

System Prompts

Advanced Features

Iterative Mask Examination

Prompt Deduplication

Context Pruning

Error Handling

No Masks Found

Maximum Generations Exceeded

Requirements

Notes