Sam3VideoPredictor

Overview

The Sam3VideoPredictor class provides a session-based API for video instance segmentation and tracking. It manages inference states across video frames and supports text, point, and box prompts.

Class Initialization

from sam3.model.sam3_video_predictor import Sam3VideoPredictor

predictor = Sam3VideoPredictor(
    checkpoint_path=None,
    bpe_path=None,
    has_presence_token=True,
    geo_encoder_use_img_cross_attn=True,
    strict_state_dict_loading=True,
    async_loading_frames=False,
    video_loader_type="cv2",
    apply_temporal_disambiguation=True,
    compile=False
)

Parameters

checkpoint_path

str | None

default:"None"

Path to model checkpoint. If None, loads from Hugging Face.

bpe_path

str | None

default:"None"

Path to BPE tokenizer file.

has_presence_token

bool

default:"True"

Whether to use presence token for object detection.

geo_encoder_use_img_cross_attn

bool

default:"True"

Whether geometry encoder uses image cross-attention.

strict_state_dict_loading

bool

default:"True"

Whether to enforce strict checkpoint loading.

async_loading_frames

bool

default:"False"

Whether to load video frames asynchronously.

video_loader_type

str

default:"'cv2'"

Video loader backend: "cv2" or "pyav".

apply_temporal_disambiguation

bool

default:"True"

Whether to apply temporal disambiguation heuristics.

compile

bool

default:"False"

Whether to compile the model for better performance.

Methods

handle_request

Dispatch a request to the predictor.

response = predictor.handle_request(request)

request

dict

required

Request dictionary with type field and type-specific parameters.

response

dict

Response dictionary with request-specific fields.

handle_stream_request

Dispatch a streaming request (yields results).

for response in predictor.handle_stream_request(request):
    # Process each frame's results
    process(response)

Request Types

start_session

Start a new inference session on a video or image.

request = {
    "type": "start_session",
    "resource_path": "/path/to/video.mp4",
    "session_id": "optional-session-id"  # Auto-generated if not provided
}

response = predictor.handle_request(request)
session_id = response["session_id"]

Parameters:

resource_path (str): Path to video file (MP4) or directory with JPEG frames, or image file
session_id (str, optional): Session identifier (auto-generated if not provided)

Returns:

session_id (str): The session identifier

add_prompt

Add text, point, or box prompt on a specific frame.

request = {
    "type": "add_prompt",
    "session_id": session_id,
    "frame_index": 0,
    "text": "person",  # Optional
    "points": [[500, 375]],  # Optional: list of [x, y]
    "point_labels": [1],  # Optional: 1=foreground, 0=background
    "bounding_boxes": [[0.3, 0.4, 0.2, 0.3]],  # Optional: [cx, cy, w, h] normalized
    "bounding_box_labels": [1],  # Optional: 1=positive, 0=negative
    "obj_id": None  # Optional: assign to existing object ID
}

response = predictor.handle_request(request)
frame_idx = response["frame_index"]
outputs = response["outputs"]  # Segmentation results

Parameters:

session_id (str, required): Session identifier
frame_index (int, required): Frame index to add prompt on
text (str, optional): Text description of object
points (list, optional): List of [x, y] point coordinates
point_labels (list, optional): List of point labels (1 or 0)
bounding_boxes (list, optional): List of boxes in [cx, cy, w, h] normalized format
bounding_box_labels (list, optional): List of box labels (1 or 0)
obj_id (int, optional): Object ID to assign prompt to

Returns:

frame_index (int): The frame index
outputs (dict): Segmentation masks and metadata

propagate_in_video

Propagate prompts across video frames (streaming).

request = {
    "type": "propagate_in_video",
    "session_id": session_id,
    "propagation_direction": "both",  # "forward", "backward", or "both"
    "start_frame_index": 0,  # Optional: default is first prompted frame
    "max_frame_num_to_track": None  # Optional: default is all frames
}

for response in predictor.handle_stream_request(request):
    frame_idx = response["frame_index"]
    outputs = response["outputs"]
    # Process masks for this frame
    process_frame(frame_idx, outputs)

Parameters:

session_id (str, required): Session identifier
propagation_direction (str): Direction to propagate: "forward", "backward", or "both"
start_frame_index (int, optional): Starting frame index
max_frame_num_to_track (int, optional): Maximum number of frames to track

Yields:

frame_index (int): Current frame index
outputs (dict): Segmentation outputs for the frame

remove_object

Remove an object from tracking.

request = {
    "type": "remove_object",
    "session_id": session_id,
    "obj_id": 1,
    "is_user_action": True
}

response = predictor.handle_request(request)

Parameters:

session_id (str, required): Session identifier
obj_id (int, required): Object ID to remove
is_user_action (bool): Whether this is a user-initiated action

reset_session

Reset session to initial state.

request = {
    "type": "reset_session",
    "session_id": session_id
}

response = predictor.handle_request(request)

close_session

Close and clean up a session.

request = {
    "type": "close_session",
    "session_id": session_id
}

response = predictor.handle_request(request)

Example Usage

Basic Video Segmentation

from sam3.model.sam3_video_predictor import Sam3VideoPredictor

# Initialize predictor
predictor = Sam3VideoPredictor()

# Start session
response = predictor.handle_request({
    "type": "start_session",
    "resource_path": "video.mp4"
})
session_id = response["session_id"]

# Add text prompt on first frame
predictor.handle_request({
    "type": "add_prompt",
    "session_id": session_id,
    "frame_index": 0,
    "text": "person"
})

# Propagate through video
for response in predictor.handle_stream_request({
    "type": "propagate_in_video",
    "session_id": session_id,
    "propagation_direction": "forward"
}):
    frame_idx = response["frame_index"]
    outputs = response["outputs"]
    print(f"Frame {frame_idx}: {len(outputs)} objects tracked")

# Clean up
predictor.handle_request({
    "type": "close_session",
    "session_id": session_id
})

Interactive Tracking with Points

# Start session
response = predictor.handle_request({
    "type": "start_session",
    "resource_path": "video.mp4"
})
session_id = response["session_id"]

# Add point prompt
response = predictor.handle_request({
    "type": "add_prompt",
    "session_id": session_id,
    "frame_index": 0,
    "points": [[640, 360]],  # Click at center
    "point_labels": [1]  # Foreground
})

# Get object ID from response
obj_id = list(response["outputs"].keys())[0]

# Propagate
for response in predictor.handle_stream_request({
    "type": "propagate_in_video",
    "session_id": session_id
}):
    # Process results
    pass

Box Prompting

# Add box prompt (normalized coordinates)
predictor.handle_request({
    "type": "add_prompt",
    "session_id": session_id,
    "frame_index": 0,
    "bounding_boxes": [[0.5, 0.5, 0.3, 0.4]],  # cx, cy, w, h
    "bounding_box_labels": [1]  # Positive box
})

Bidirectional Propagation

# Add prompt on middle frame
predictor.handle_request({
    "type": "add_prompt",
    "session_id": session_id,
    "frame_index": 50,
    "text": "car"
})

# Propagate both forward and backward
for response in predictor.handle_stream_request({
    "type": "propagate_in_video",
    "session_id": session_id,
    "propagation_direction": "both",
    "start_frame_index": 50
}):
    process(response)

Output Format

The outputs dictionary from add_prompt and propagate_in_video contains:

outputs = {
    obj_id: {
        "mask": binary_mask,  # bool array, (H, W)
        "score": confidence_score,  # float
        "bbox": [x0, y0, x1, y1],  # pixel coordinates
    },
    # ... more objects
}

Notes

Each session manages its own state independently
Sessions should be closed to free GPU memory
Point/box coordinates for add_prompt are in pixel space
Bounding box format is [center_x, center_y, width, height], normalized to [0, 1]
The predictor supports both images and videos
Use async_loading_frames=True for faster video loading

Model Builders

Image Inference

Video Inference

Agent

Evaluation

Sam3VideoPredictor

Overview

Class Initialization

Parameters

Methods

handle_request

handle_stream_request

Request Types

start_session

add_prompt

propagate_in_video

remove_object

reset_session

close_session

Example Usage

Basic Video Segmentation

Interactive Tracking with Points

Box Prompting

Bidirectional Propagation

Output Format

Notes

Build docs developers (and LLMs) love

Model Builders

Image Inference

Video Inference

Agent

Evaluation

​Overview

​Class Initialization

​Parameters

​Methods

​handle_request

​handle_stream_request

​Request Types

​start_session

​add_prompt

​propagate_in_video

​remove_object

​reset_session

​close_session

​Example Usage

​Basic Video Segmentation

​Interactive Tracking with Points

​Box Prompting

​Bidirectional Propagation

​Output Format

​Notes

Build docs developers (and LLMs) love

Overview

Class Initialization

Parameters

Methods

handle_request

handle_stream_request

Request Types

start_session

add_prompt

propagate_in_video

remove_object

reset_session

close_session

Example Usage

Basic Video Segmentation

Interactive Tracking with Points

Box Prompting

Bidirectional Propagation

Output Format

Notes