Overview
Thebuild_sam3_image_model() function creates and initializes the SAM 3 image model for segmentation tasks. It assembles all model components including the vision-language backbone, transformer encoder-decoder, and segmentation head.
Function Signature
Parameters
Path to the BPE tokenizer vocabulary file. If
None, uses the default tokenizer included with SAM 3.Device to load the model on. Options:
"cuda" or "cpu".Whether to set the model to evaluation mode. Set to
True for inference, False for training.Optional path to model checkpoint file. If provided along with
load_from_HF=False, loads weights from this path.Whether to automatically download and load the pretrained checkpoint from Hugging Face. Downloads from
facebook/sam3 repository.Whether to enable the segmentation head for mask prediction.
Whether to enable instance-level interactivity for SAM 1-style interactive segmentation tasks.
Whether to compile the model using PyTorch 2.0+ compilation for improved performance.
Returns
The initialized SAM 3 image model with loaded weights (if specified).
Example Usage
Basic Usage
Custom Checkpoint
Training Mode
With Compilation
Model Components
The function assembles the following components:- Vision Encoder: ViT-based backbone (1024 embed dim, 32 layers)
- Text Encoder: Transformer-based language encoder (1024 width, 24 layers)
- Transformer: 6-layer encoder-decoder with cross-attention
- Segmentation Head: Pixel decoder with 3 upsampling stages
- Geometry Encoder: Processes point and box prompts
Notes
- By default, the model downloads the pretrained checkpoint from Hugging Face (
facebook/sam3) - The model uses a resolution of 1008×1008 pixels
- Supports both CUDA and CPU inference
- Instance interactivity enables SAM 1-style point/box prompting for interactive segmentation