What is SAM 3 Agent?
SAM 3 Agent allows you to use natural, complex language to describe objects:- ❌ Simple: “person”, “blue vest”
- ✅ Complex: “the leftmost child wearing blue vest”
- ✅ Relational: “the person standing behind the dog”
- ✅ Descriptive: “the tallest building in the background”
- MLLM analyzes the image and your complex query
- MLLM generates simpler prompts for SAM 3 (text/box)
- SAM 3 performs the actual segmentation
- Results are returned with visual overlays
Setup
Install SAM 3
Follow the installation instructions in the repository.
MLLM Configuration
SAM 3 Agent supports various MLLMs. You can use either:- vLLM-served models (self-hosted)
- External APIs (Gemini, GPT, Claude, etc.)
Option 1: vLLM (Self-Hosted)
- Configuration
- Installation
- Start Server
Option 2: External API
Running Agent Inference
How It Works
Query Understanding
The MLLM analyzes your complex prompt:
- Identifies spatial relationships (“leftmost”, “behind”)
- Extracts visual attributes (“blue vest”, “wearing”)
- Understands context and object relationships
SAM 3 Segmentation
SAM 3 processes the simplified prompts:
- Segments all children in the image
- Segments all blue vests
- Returns candidates with confidence scores
Debugging Output
Enable debug mode to see the agent’s reasoning:- MLLM’s interpretation of your query
- Generated SAM 3 prompts
- Intermediate segmentation results
- Final filtering decisions
Example Queries
- Spatial Relations
- Visual Attributes
- Actions and States
- Complex Combinations
Supported MLLMs
Tested models (add your own toLLM_CONFIGS):
| Provider | Model | Best For |
|---|---|---|
| vLLM | Qwen/Qwen3-VL-8B-Thinking | Self-hosted, good reasoning |
| gemini-2.0-flash-exp | Fast, API-based | |
| OpenAI | gpt-4-vision-preview | High accuracy |
| Anthropic | claude-3-opus-20240229 | Complex reasoning |
Tips for Best Results
Troubleshooting
MLLM returns no results
MLLM returns no results
- Check if your query is too ambiguous
- Try breaking complex queries into simpler parts
- Verify the MLLM can see the image (check debug output)
Wrong object segmented
Wrong object segmented
- Add more specific attributes to your query
- Use spatial relationships to disambiguate
- Check SAM 3’s confidence threshold (lower if needed)
vLLM server errors
vLLM server errors
- Ensure server is running:
curl http://localhost:8001/health - Check GPU memory availability
- Verify
--allowed-local-media-pathincludes your image directory
API rate limits
API rate limits
- Implement exponential backoff for retries
- Use local vLLM for high-volume processing
- Cache MLLM responses for repeated queries
Next Steps
Image Inference
Learn direct SAM 3 prompting without MLLMs
Interactive Refinement
Combine agent results with interactive refinement