build_sam3_image_model

Overview

The build_sam3_image_model() function creates and initializes the SAM 3 image model for segmentation tasks. It assembles all model components including the vision-language backbone, transformer encoder-decoder, and segmentation head.

Function Signature

from sam3.model_builder import build_sam3_image_model

model = build_sam3_image_model(
    bpe_path=None,
    device="cuda",
    eval_mode=True,
    checkpoint_path=None,
    load_from_HF=True,
    enable_segmentation=True,
    enable_inst_interactivity=False,
    compile=False
)

Parameters

bpe_path

str | None

default:"None"

Path to the BPE tokenizer vocabulary file. If None, uses the default tokenizer included with SAM 3.

device

str

default:"'cuda' if available else 'cpu'"

Device to load the model on. Options: "cuda" or "cpu".

eval_mode

bool

default:"True"

Whether to set the model to evaluation mode. Set to True for inference, False for training.

checkpoint_path

str | None

default:"None"

Optional path to model checkpoint file. If provided along with load_from_HF=False, loads weights from this path.

load_from_HF

bool

default:"True"

Whether to automatically download and load the pretrained checkpoint from Hugging Face. Downloads from facebook/sam3 repository.

enable_segmentation

bool

default:"True"

Whether to enable the segmentation head for mask prediction.

enable_inst_interactivity

bool

default:"False"

Whether to enable instance-level interactivity for SAM 1-style interactive segmentation tasks.

compile

bool

default:"False"

Whether to compile the model using PyTorch 2.0+ compilation for improved performance.

Returns

model

Sam3Image

The initialized SAM 3 image model with loaded weights (if specified).

Example Usage

Basic Usage

from sam3.model_builder import build_sam3_image_model

# Build model with default settings (loads from Hugging Face)
model = build_sam3_image_model()

# The model is ready for inference
model.eval()

Custom Checkpoint

# Load from a custom checkpoint
model = build_sam3_image_model(
    checkpoint_path="/path/to/checkpoint.pt",
    load_from_HF=False,
    device="cuda"
)

Training Mode

# Build model for training
model = build_sam3_image_model(
    eval_mode=False,
    enable_segmentation=True,
    device="cuda"
)

With Compilation

# Build with PyTorch compilation for faster inference
model = build_sam3_image_model(
    compile=True,
    device="cuda"
)

Model Components

The function assembles the following components:

Vision Encoder: ViT-based backbone (1024 embed dim, 32 layers)
Text Encoder: Transformer-based language encoder (1024 width, 24 layers)
Transformer: 6-layer encoder-decoder with cross-attention
Segmentation Head: Pixel decoder with 3 upsampling stages
Geometry Encoder: Processes point and box prompts

Notes

By default, the model downloads the pretrained checkpoint from Hugging Face (facebook/sam3)
The model uses a resolution of 1008×1008 pixels
Supports both CUDA and CPU inference
Instance interactivity enables SAM 1-style point/box prompting for interactive segmentation

Model Builders

Image Inference

Video Inference

Agent

Evaluation

build_sam3_image_model

Overview

Function Signature

Parameters

Returns

Example Usage

Basic Usage

Custom Checkpoint

Training Mode

With Compilation

Model Components

Notes

Build docs developers (and LLMs) love

Model Builders

Image Inference

Video Inference

Agent

Evaluation

​Overview

​Function Signature

​Parameters

​Returns

​Example Usage

​Basic Usage

​Custom Checkpoint

​Training Mode

​With Compilation

​Model Components

​Notes

Build docs developers (and LLMs) love

Overview

Function Signature

Parameters

Returns

Example Usage

Basic Usage

Custom Checkpoint

Training Mode

With Compilation

Model Components

Notes