Overview
Thebuild_sam3_video_predictor() function creates the SAM 3 video predictor, which combines detection and tracking for video instance segmentation and tracking tasks.
Function Signature
Parameters
Optional path to checkpoint file. If
None and load_from_HF=True, downloads from Hugging Face.Whether to automatically download pretrained checkpoint from Hugging Face (
facebook/sam3).Path to BPE tokenizer file. If
None, uses default tokenizer.Whether the model includes presence token for object existence prediction.
Whether geometry encoder uses image cross-attention.
Whether to enforce strict state dict loading (all keys must match).
Whether to apply temporal disambiguation heuristics for improved tracking quality.
Device to load the model on.
Whether to compile the model for improved performance.
List of GPU device IDs to use for multi-GPU inference. If
None, uses current device only.Returns
The video predictor wrapper that supports multi-GPU inference and session management.
Example Usage
Basic Usage
Multi-GPU Setup
Custom Configuration
Without Temporal Disambiguation
Model Architecture
The video predictor consists of two main components:-
Detector: SAM 3 image model for per-frame detection
- Vision-language backbone
- Transformer encoder-decoder
- Segmentation head
-
Tracker: Memory-based tracking module
- Memory encoder for mask features
- Cross-attention transformer
- Temporal association
Temporal Disambiguation
Whenapply_temporal_disambiguation=True, the model applies:
- Hotstart delay: 15-frame delay before yielding outputs
- Duplicate suppression: Remove duplicates within hotstart period
- Reconditioning: Re-initialize tracks every 16 frames
- Occlusion handling: Suppress overlapping masks (IoU > 0.7)
Multi-GPU Inference
The predictor supports distributed inference across multiple GPUs:- Each GPU processes different frames in parallel
- Automatic load balancing across devices
- NCCL-based synchronization
See Also
- Sam3VideoPredictor - API documentation
- handle_request - Request handling