Skip to main content
SAM 3 Agent enables complex segmentation queries by integrating Multi-modal Large Language Models (MLLMs) as a reasoning layer. The MLLM breaks down complex prompts into simpler queries that SAM 3 can process.

What is SAM 3 Agent?

SAM 3 Agent allows you to use natural, complex language to describe objects:
  • ❌ Simple: “person”, “blue vest”
  • ✅ Complex: “the leftmost child wearing blue vest”
  • ✅ Relational: “the person standing behind the dog”
  • ✅ Descriptive: “the tallest building in the background”
The agent workflow:
  1. MLLM analyzes the image and your complex query
  2. MLLM generates simpler prompts for SAM 3 (text/box)
  3. SAM 3 performs the actual segmentation
  4. Results are returned with visual overlays

Setup

1

Install SAM 3

Follow the installation instructions in the repository.
2

Configure PyTorch

import torch

# Turn on tfloat32 for Ampere GPUs
torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.allow_tf32 = True

# Use bfloat16 for the entire notebook
torch.autocast("cuda", dtype=torch.bfloat16).__enter__()

# Inference mode for the whole notebook
torch.inference_mode().__enter__()
3

Build SAM 3 model

import os
import sam3
from sam3 import build_sam3_image_model
from sam3.model.sam3_image_processor import Sam3Processor

sam3_root = os.path.dirname(sam3.__file__)
bpe_path = f"{sam3_root}/assets/bpe_simple_vocab_16e6.txt.gz"
model = build_sam3_image_model(bpe_path=bpe_path)
processor = Sam3Processor(model, confidence_threshold=0.5)

MLLM Configuration

SAM 3 Agent supports various MLLMs. You can use either:
  • vLLM-served models (self-hosted)
  • External APIs (Gemini, GPT, Claude, etc.)

Option 1: vLLM (Self-Hosted)

LLM_CONFIGS = {
    "qwen3_vl_8b_thinking": {
        "provider": "vllm",
        "model": "Qwen/Qwen3-VL-8B-Thinking",
    },
}

model = "qwen3_vl_8b_thinking"
LLM_API_KEY = "DUMMY_API_KEY"  # Not used for vLLM
LLM_SERVER_URL = "http://0.0.0.0:8001/v1"

llm_config = LLM_CONFIGS[model]
llm_config["api_key"] = LLM_API_KEY
llm_config["name"] = model

Option 2: External API

LLM_CONFIGS = {
    "gemini_flash": {
        "provider": "google",
        "model": "gemini-2.0-flash-exp",
        "base_url": "https://generativelanguage.googleapis.com/v1beta/",
    },
}

model = "gemini_flash"
LLM_API_KEY = "your-api-key-here"  # Set your actual API key
LLM_SERVER_URL = LLM_CONFIGS[model]["base_url"]

llm_config = LLM_CONFIGS[model]
llm_config["api_key"] = LLM_API_KEY
llm_config["name"] = model
Never commit API keys to version control. Use environment variables:
import os
LLM_API_KEY = os.getenv("GEMINI_API_KEY")

Running Agent Inference

from functools import partial
from sam3.agent.client_llm import send_generate_request as send_generate_request_orig
from sam3.agent.client_sam3 import call_sam_service as call_sam_service_orig
from sam3.agent.inference import run_single_image_inference

# Prepare input
image = "assets/images/test_image.jpg"
prompt = "the leftmost child wearing blue vest"
image = os.path.abspath(image)

# Create service clients
send_generate_request = partial(
    send_generate_request_orig,
    server_url=LLM_SERVER_URL,
    model=llm_config["model"],
    api_key=llm_config["api_key"]
)
call_sam_service = partial(call_sam_service_orig, sam3_processor=processor)

# Run inference
output_image_path = run_single_image_inference(
    image,
    prompt,
    llm_config,
    send_generate_request,
    call_sam_service,
    debug=True,
    output_dir="agent_output"
)

# Display result
if output_image_path is not None:
    from IPython.display import display, Image
    display(Image(filename=output_image_path))

How It Works

1

Query Understanding

The MLLM analyzes your complex prompt:
  • Identifies spatial relationships (“leftmost”, “behind”)
  • Extracts visual attributes (“blue vest”, “wearing”)
  • Understands context and object relationships
2

Prompt Decomposition

The MLLM generates structured prompts for SAM 3:
{
  "text_prompts": ["child", "blue vest"],
  "spatial_filter": "leftmost",
  "relationship": "wearing"
}
3

SAM 3 Segmentation

SAM 3 processes the simplified prompts:
  • Segments all children in the image
  • Segments all blue vests
  • Returns candidates with confidence scores
4

Result Filtering

The MLLM filters and ranks results:
  • Applies spatial constraints (“leftmost”)
  • Verifies relationships (“wearing”)
  • Returns the best match

Debugging Output

Enable debug mode to see the agent’s reasoning:
output_image_path = run_single_image_inference(
    image, prompt, llm_config,
    send_generate_request, call_sam_service,
    debug=True,  # Enable debug output
    output_dir="agent_output"
)
Debug output shows:
  • MLLM’s interpretation of your query
  • Generated SAM 3 prompts
  • Intermediate segmentation results
  • Final filtering decisions

Example Queries

# Directional
"the rightmost person"
"the object in the top-left corner"
"the car furthest from the camera"

# Positional
"the person standing behind the table"
"the object between the two chairs"
"the animal closest to the door"

Supported MLLMs

Tested models (add your own to LLM_CONFIGS):
ProviderModelBest For
vLLMQwen/Qwen3-VL-8B-ThinkingSelf-hosted, good reasoning
Googlegemini-2.0-flash-expFast, API-based
OpenAIgpt-4-vision-previewHigh accuracy
Anthropicclaude-3-opus-20240229Complex reasoning

Tips for Best Results

Be specific but natural:
  • ✅ “the leftmost child wearing blue vest”
  • ❌ “child blue vest left” (too terse)
  • ❌ “I want to segment the child who is positioned on the left side and is currently wearing clothing that appears to be blue and vest-like” (too verbose)
Use relative positions:
  • ✅ “the person on the right”
  • ✅ “the second from the left”
  • ❌ “the person at pixel coordinates (450, 230)” (use box prompts instead)
Combine multiple cues:
  • ✅ “the red car behind the truck”
  • ❌ “the thing” (too vague)

Troubleshooting

  • Check if your query is too ambiguous
  • Try breaking complex queries into simpler parts
  • Verify the MLLM can see the image (check debug output)
  • Add more specific attributes to your query
  • Use spatial relationships to disambiguate
  • Check SAM 3’s confidence threshold (lower if needed)
  • Ensure server is running: curl http://localhost:8001/health
  • Check GPU memory availability
  • Verify --allowed-local-media-path includes your image directory
  • Implement exponential backoff for retries
  • Use local vLLM for high-volume processing
  • Cache MLLM responses for repeated queries

Next Steps

Image Inference

Learn direct SAM 3 prompting without MLLMs

Interactive Refinement

Combine agent results with interactive refinement

Build docs developers (and LLMs) love