Skip to main content

Overview

The build_sam3_image_model() function creates and initializes the SAM 3 image model for segmentation tasks. It assembles all model components including the vision-language backbone, transformer encoder-decoder, and segmentation head.

Function Signature

from sam3.model_builder import build_sam3_image_model

model = build_sam3_image_model(
    bpe_path=None,
    device="cuda",
    eval_mode=True,
    checkpoint_path=None,
    load_from_HF=True,
    enable_segmentation=True,
    enable_inst_interactivity=False,
    compile=False
)

Parameters

bpe_path
str | None
default:"None"
Path to the BPE tokenizer vocabulary file. If None, uses the default tokenizer included with SAM 3.
device
str
default:"'cuda' if available else 'cpu'"
Device to load the model on. Options: "cuda" or "cpu".
eval_mode
bool
default:"True"
Whether to set the model to evaluation mode. Set to True for inference, False for training.
checkpoint_path
str | None
default:"None"
Optional path to model checkpoint file. If provided along with load_from_HF=False, loads weights from this path.
load_from_HF
bool
default:"True"
Whether to automatically download and load the pretrained checkpoint from Hugging Face. Downloads from facebook/sam3 repository.
enable_segmentation
bool
default:"True"
Whether to enable the segmentation head for mask prediction.
enable_inst_interactivity
bool
default:"False"
Whether to enable instance-level interactivity for SAM 1-style interactive segmentation tasks.
compile
bool
default:"False"
Whether to compile the model using PyTorch 2.0+ compilation for improved performance.

Returns

model
Sam3Image
The initialized SAM 3 image model with loaded weights (if specified).

Example Usage

Basic Usage

from sam3.model_builder import build_sam3_image_model

# Build model with default settings (loads from Hugging Face)
model = build_sam3_image_model()

# The model is ready for inference
model.eval()

Custom Checkpoint

# Load from a custom checkpoint
model = build_sam3_image_model(
    checkpoint_path="/path/to/checkpoint.pt",
    load_from_HF=False,
    device="cuda"
)

Training Mode

# Build model for training
model = build_sam3_image_model(
    eval_mode=False,
    enable_segmentation=True,
    device="cuda"
)

With Compilation

# Build with PyTorch compilation for faster inference
model = build_sam3_image_model(
    compile=True,
    device="cuda"
)

Model Components

The function assembles the following components:
  • Vision Encoder: ViT-based backbone (1024 embed dim, 32 layers)
  • Text Encoder: Transformer-based language encoder (1024 width, 24 layers)
  • Transformer: 6-layer encoder-decoder with cross-attention
  • Segmentation Head: Pixel decoder with 3 upsampling stages
  • Geometry Encoder: Processes point and box prompts

Notes

  • By default, the model downloads the pretrained checkpoint from Hugging Face (facebook/sam3)
  • The model uses a resolution of 1008×1008 pixels
  • Supports both CUDA and CPU inference
  • Instance interactivity enables SAM 1-style point/box prompting for interactive segmentation

Build docs developers (and LLMs) love