Overview
The SAM 3 Agent combines SAM 3 with a multimodal large language model (MLLM) to provide an agentic interface for interactive segmentation. The agent iteratively refines segmentation results through tool calls and multi-turn reasoning.Function
agent_inference
Perform agentic segmentation with iterative refinement.Parameters
Path to the input image file.
Initial text prompt describing what to segment (e.g., “the dog in the image”).
Enable debug mode to save conversation history.
Function to send requests to the multimodal LLM. Defaults to the built-in implementation.
Function to call SAM 3 segmentation service. Defaults to the built-in implementation.
Maximum number of MLLM generation rounds allowed.
Directory to save SAM 3 outputs and debug information.
Returns
Conversation history between user and agent.
Final segmentation results containing:
original_image_path: Path to original imageorig_img_h: Original image heightorig_img_w: Original image widthpred_boxes: List of bounding boxespred_scores: List of confidence scorespred_masks: List of segmentation masks (RLE format)
Visualization with all selected masks rendered.
Agent Tools
The agent has access to four tools:segment_phrase
Call SAM 3 with a text prompt to generate segmentation masks. Parameters:text_prompt(str): Simple noun phrase describing objects to segment
examine_each_mask
Examine each generated mask individually using MLLM to filter out incorrect predictions. Parameters: None Usage:select_masks_and_return
Select final masks to return as the answer. Parameters:final_answer_masks(list[int]): Mask indices to return (1-indexed)
report_no_mask
Report that no valid masks exist for the query. Parameters: None Usage:Example Usage
Basic Usage
With Debug Mode
Handling Multiple Objects
Custom Output Directory
Agent Workflow
The agent follows this workflow:- Initial Segmentation: Calls
segment_phrasewith a text prompt - Evaluation: MLLM examines generated masks
- Refinement (if needed):
- Call
examine_each_maskto filter masks - Or call
segment_phrasewith different prompt
- Call
- Selection: Call
select_masks_and_returnwith chosen masks
Output Structure
final_outputs Dictionary
Mask Format
Masks are stored in RLE (Run-Length Encoding) format:System Prompts
The agent uses two system prompts:- Main System Prompt (
system_prompt.txt): Guides tool selection and reasoning - Iterative Checking Prompt (
system_prompt_iterative_checking.txt): Used byexamine_each_maskfor mask filtering
sam3/agent/system_prompts/.
Advanced Features
Iterative Mask Examination
Theexamine_each_mask tool uses a separate MLLM call for each mask:
Prompt Deduplication
The agent tracks used text prompts and prevents reusing them:Context Pruning
To manage context length, the conversation history is pruned:- Always keeps: First 2 messages (system + initial user message)
- Always keeps: Latest segment_phrase tool call and subsequent messages
- Adds warnings about previously failed prompts
Error Handling
No Masks Found
Maximum Generations Exceeded
Requirements
- SAM 3 Service: Running SAM 3 segmentation service
- MLLM Service: Multimodal LLM with vision capabilities (e.g., Qwen-VL)
- Client Functions:
send_generate_request(): Send prompts to MLLMcall_sam_service(): Call SAM 3 for segmentation
Notes
- The agent uses simple noun phrases for best SAM 3 performance
- MLLM must support vision and tool calling
- Debug mode saves full conversation history for analysis
- Agent autonomously decides when to stop iterating
- Supports both single and multiple object queries