Skip to main content

Overview

The build_sam3_video_predictor() function creates the SAM 3 video predictor, which combines detection and tracking for video instance segmentation and tracking tasks.

Function Signature

from sam3.model_builder import build_sam3_video_predictor

predictor = build_sam3_video_predictor(
    checkpoint_path=None,
    load_from_HF=True,
    bpe_path=None,
    has_presence_token=True,
    geo_encoder_use_img_cross_attn=True,
    strict_state_dict_loading=True,
    apply_temporal_disambiguation=True,
    device="cuda",
    compile=False,
    gpus_to_use=None
)

Parameters

checkpoint_path
str | None
default:"None"
Optional path to checkpoint file. If None and load_from_HF=True, downloads from Hugging Face.
load_from_HF
bool
default:"True"
Whether to automatically download pretrained checkpoint from Hugging Face (facebook/sam3).
bpe_path
str | None
default:"None"
Path to BPE tokenizer file. If None, uses default tokenizer.
has_presence_token
bool
default:"True"
Whether the model includes presence token for object existence prediction.
geo_encoder_use_img_cross_attn
bool
default:"True"
Whether geometry encoder uses image cross-attention.
strict_state_dict_loading
bool
default:"True"
Whether to enforce strict state dict loading (all keys must match).
apply_temporal_disambiguation
bool
default:"True"
Whether to apply temporal disambiguation heuristics for improved tracking quality.
device
str
default:"'cuda' if available else 'cpu'"
Device to load the model on.
compile
bool
default:"False"
Whether to compile the model for improved performance.
gpus_to_use
list[int] | None
default:"None"
List of GPU device IDs to use for multi-GPU inference. If None, uses current device only.

Returns

predictor
Sam3VideoPredictorMultiGPU
The video predictor wrapper that supports multi-GPU inference and session management.

Example Usage

Basic Usage

from sam3.model_builder import build_sam3_video_predictor

# Build video predictor with default settings
predictor = build_sam3_video_predictor()

Multi-GPU Setup

# Use multiple GPUs for parallel frame processing
predictor = build_sam3_video_predictor(
    gpus_to_use=[0, 1, 2, 3],  # Use 4 GPUs
    apply_temporal_disambiguation=True
)

Custom Configuration

# Build with custom settings
predictor = build_sam3_video_predictor(
    checkpoint_path="/path/to/checkpoint.pt",
    load_from_HF=False,
    apply_temporal_disambiguation=True,
    compile=True,
    device="cuda"
)

Without Temporal Disambiguation

# Disable temporal disambiguation for ablation studies
predictor = build_sam3_video_predictor(
    apply_temporal_disambiguation=False
)

Model Architecture

The video predictor consists of two main components:
  1. Detector: SAM 3 image model for per-frame detection
    • Vision-language backbone
    • Transformer encoder-decoder
    • Segmentation head
  2. Tracker: Memory-based tracking module
    • Memory encoder for mask features
    • Cross-attention transformer
    • Temporal association

Temporal Disambiguation

When apply_temporal_disambiguation=True, the model applies:
  • Hotstart delay: 15-frame delay before yielding outputs
  • Duplicate suppression: Remove duplicates within hotstart period
  • Reconditioning: Re-initialize tracks every 16 frames
  • Occlusion handling: Suppress overlapping masks (IoU > 0.7)

Multi-GPU Inference

The predictor supports distributed inference across multiple GPUs:
  • Each GPU processes different frames in parallel
  • Automatic load balancing across devices
  • NCCL-based synchronization

See Also

Build docs developers (and LLMs) love