Skip to main content

Overview

The SAM 3 Agent combines SAM 3 with a multimodal large language model (MLLM) to provide an agentic interface for interactive segmentation. The agent iteratively refines segmentation results through tool calls and multi-turn reasoning.

Function

agent_inference

Perform agentic segmentation with iterative refinement.
from sam3.agent.agent_core import agent_inference
from sam3.agent.client_llm import send_generate_request
from sam3.agent.client_sam3 import call_sam_service

messages, final_outputs, rendered_image = agent_inference(
    img_path,
    initial_text_prompt,
    debug=False,
    send_generate_request=send_generate_request,
    call_sam_service=call_sam_service,
    max_generations=100,
    output_dir="./sam3_agent_out"
)

Parameters

img_path
str
required
Path to the input image file.
initial_text_prompt
str
required
Initial text prompt describing what to segment (e.g., “the dog in the image”).
debug
bool
default:"False"
Enable debug mode to save conversation history.
send_generate_request
callable
default:"send_generate_request"
Function to send requests to the multimodal LLM. Defaults to the built-in implementation.
call_sam_service
callable
default:"call_sam_service"
Function to call SAM 3 segmentation service. Defaults to the built-in implementation.
max_generations
int
default:"100"
Maximum number of MLLM generation rounds allowed.
output_dir
str
default:"'../../sam3_agent_out'"
Directory to save SAM 3 outputs and debug information.

Returns

messages
list[dict]
Conversation history between user and agent.
final_outputs
dict
Final segmentation results containing:
  • original_image_path: Path to original image
  • orig_img_h: Original image height
  • orig_img_w: Original image width
  • pred_boxes: List of bounding boxes
  • pred_scores: List of confidence scores
  • pred_masks: List of segmentation masks (RLE format)
rendered_image
PIL.Image.Image
Visualization with all selected masks rendered.

Agent Tools

The agent has access to four tools:

segment_phrase

Call SAM 3 with a text prompt to generate segmentation masks. Parameters:
  • text_prompt (str): Simple noun phrase describing objects to segment
Usage:
<tool>{"name": "segment_phrase", "parameters": {"text_prompt": "person"}}</tool>

examine_each_mask

Examine each generated mask individually using MLLM to filter out incorrect predictions. Parameters: None Usage:
<tool>{"name": "examine_each_mask", "parameters": {}}</tool>

select_masks_and_return

Select final masks to return as the answer. Parameters:
  • final_answer_masks (list[int]): Mask indices to return (1-indexed)
Usage:
<tool>{"name": "select_masks_and_return", "parameters": {"final_answer_masks": [1, 3]}}</tool>

report_no_mask

Report that no valid masks exist for the query. Parameters: None Usage:
<tool>{"name": "report_no_mask", "parameters": {}}</tool>

Example Usage

Basic Usage

from sam3.agent.agent_core import agent_inference

# Run agent on an image
messages, outputs, rendered = agent_inference(
    img_path="image.jpg",
    initial_text_prompt="the red car on the left"
)

# Access results
print(f"Found {len(outputs['pred_masks'])} masks")
rendered.save("result.jpg")

With Debug Mode

# Enable debug to save conversation history
messages, outputs, rendered = agent_inference(
    img_path="complex_scene.jpg",
    initial_text_prompt="all people wearing hats",
    debug=True,
    output_dir="./debug_output"
)

# Debug files saved to ./debug_output/agent_debug_out/

Handling Multiple Objects

# Query for multiple objects
messages, outputs, rendered = agent_inference(
    img_path="street.jpg",
    initial_text_prompt="cars and trucks in the scene"
)

# Process each detected object
for i, (mask, box, score) in enumerate(
    zip(outputs['pred_masks'], outputs['pred_boxes'], outputs['pred_scores'])
):
    print(f"Object {i+1}: score={score:.3f}, box={box}")

Custom Output Directory

# Organize outputs by task
messages, outputs, rendered = agent_inference(
    img_path="image.jpg",
    initial_text_prompt="the main subject",
    output_dir="./experiments/run_001"
)

# Outputs saved to:
# - ./experiments/run_001/sam_out/ (SAM 3 visualizations)
# - ./experiments/run_001/agent_debug_out/ (debug info)

Agent Workflow

The agent follows this workflow:
  1. Initial Segmentation: Calls segment_phrase with a text prompt
  2. Evaluation: MLLM examines generated masks
  3. Refinement (if needed):
    • Call examine_each_mask to filter masks
    • Or call segment_phrase with different prompt
  4. Selection: Call select_masks_and_return with chosen masks
Example agent conversation:
User: "the brown dog"

Agent: <tool>{"name": "segment_phrase", "parameters": {"text_prompt": "dog"}}</tool>

System: Generated 3 masks

Agent: <tool>{"name": "examine_each_mask", "parameters": {}}</tool>

System: After examination, 2 masks remain (mask 1 rejected)

Agent: <tool>{"name": "select_masks_and_return", "parameters": {"final_answer_masks": [1]}}</tool>

Output Structure

final_outputs Dictionary

final_outputs = {
    "original_image_path": str,  # Path to input image
    "orig_img_h": int,  # Original height in pixels
    "orig_img_w": int,  # Original width in pixels
    "pred_boxes": List[List[float]],  # Bounding boxes (XYWH format)
    "pred_scores": List[float],  # Confidence scores
    "pred_masks": List[Dict],  # RLE-encoded masks
}

Mask Format

Masks are stored in RLE (Run-Length Encoding) format:
import pycocotools.mask as mask_utils

# Decode RLE mask to binary array
for rle_mask in final_outputs["pred_masks"]:
    binary_mask = mask_utils.decode(rle_mask)  # (H, W) bool array
    # Use the mask

System Prompts

The agent uses two system prompts:
  1. Main System Prompt (system_prompt.txt): Guides tool selection and reasoning
  2. Iterative Checking Prompt (system_prompt_iterative_checking.txt): Used by examine_each_mask for mask filtering
These prompts are located in sam3/agent/system_prompts/.

Advanced Features

Iterative Mask Examination

The examine_each_mask tool uses a separate MLLM call for each mask:
# For each mask, MLLM sees:
# - Original image
# - Image with mask overlay
# - Zoomed-in mask region
# 
# MLLM returns: <verdict>Accept</verdict> or <verdict>Reject</verdict>

Prompt Deduplication

The agent tracks used text prompts and prevents reusing them:
# If agent tries "dog" twice:
System: "You have previously used 'dog'. Please try a different prompt."

Context Pruning

To manage context length, the conversation history is pruned:
  • Always keeps: First 2 messages (system + initial user message)
  • Always keeps: Latest segment_phrase tool call and subsequent messages
  • Adds warnings about previously failed prompts

Error Handling

No Masks Found

# If SAM 3 returns no masks:
# Agent can either:
# 1. Try different text prompt
# 2. Call report_no_mask

messages, outputs, rendered = agent_inference(
    img_path="image.jpg",
    initial_text_prompt="unicorn"  # Not in image
)

# outputs will have empty lists:
assert len(outputs["pred_masks"]) == 0

Maximum Generations Exceeded

try:
    messages, outputs, rendered = agent_inference(
        img_path="complex.jpg",
        initial_text_prompt="find the specific object",
        max_generations=10  # Low limit
    )
except ValueError as e:
    print(f"Agent exceeded generation limit: {e}")

Requirements

  • SAM 3 Service: Running SAM 3 segmentation service
  • MLLM Service: Multimodal LLM with vision capabilities (e.g., Qwen-VL)
  • Client Functions:
    • send_generate_request(): Send prompts to MLLM
    • call_sam_service(): Call SAM 3 for segmentation

Notes

  • The agent uses simple noun phrases for best SAM 3 performance
  • MLLM must support vision and tool calling
  • Debug mode saves full conversation history for analysis
  • Agent autonomously decides when to stop iterating
  • Supports both single and multiple object queries

Build docs developers (and LLMs) love