Skip to main content

Overview

The Sam3VideoPredictor class provides a session-based API for video instance segmentation and tracking. It manages inference states across video frames and supports text, point, and box prompts.

Class Initialization

from sam3.model.sam3_video_predictor import Sam3VideoPredictor

predictor = Sam3VideoPredictor(
    checkpoint_path=None,
    bpe_path=None,
    has_presence_token=True,
    geo_encoder_use_img_cross_attn=True,
    strict_state_dict_loading=True,
    async_loading_frames=False,
    video_loader_type="cv2",
    apply_temporal_disambiguation=True,
    compile=False
)

Parameters

checkpoint_path
str | None
default:"None"
Path to model checkpoint. If None, loads from Hugging Face.
bpe_path
str | None
default:"None"
Path to BPE tokenizer file.
has_presence_token
bool
default:"True"
Whether to use presence token for object detection.
geo_encoder_use_img_cross_attn
bool
default:"True"
Whether geometry encoder uses image cross-attention.
strict_state_dict_loading
bool
default:"True"
Whether to enforce strict checkpoint loading.
async_loading_frames
bool
default:"False"
Whether to load video frames asynchronously.
video_loader_type
str
default:"'cv2'"
Video loader backend: "cv2" or "pyav".
apply_temporal_disambiguation
bool
default:"True"
Whether to apply temporal disambiguation heuristics.
compile
bool
default:"False"
Whether to compile the model for better performance.

Methods

handle_request

Dispatch a request to the predictor.
response = predictor.handle_request(request)
request
dict
required
Request dictionary with type field and type-specific parameters.
response
dict
Response dictionary with request-specific fields.

handle_stream_request

Dispatch a streaming request (yields results).
for response in predictor.handle_stream_request(request):
    # Process each frame's results
    process(response)

Request Types

start_session

Start a new inference session on a video or image.
request = {
    "type": "start_session",
    "resource_path": "/path/to/video.mp4",
    "session_id": "optional-session-id"  # Auto-generated if not provided
}

response = predictor.handle_request(request)
session_id = response["session_id"]
Parameters:
  • resource_path (str): Path to video file (MP4) or directory with JPEG frames, or image file
  • session_id (str, optional): Session identifier (auto-generated if not provided)
Returns:
  • session_id (str): The session identifier

add_prompt

Add text, point, or box prompt on a specific frame.
request = {
    "type": "add_prompt",
    "session_id": session_id,
    "frame_index": 0,
    "text": "person",  # Optional
    "points": [[500, 375]],  # Optional: list of [x, y]
    "point_labels": [1],  # Optional: 1=foreground, 0=background
    "bounding_boxes": [[0.3, 0.4, 0.2, 0.3]],  # Optional: [cx, cy, w, h] normalized
    "bounding_box_labels": [1],  # Optional: 1=positive, 0=negative
    "obj_id": None  # Optional: assign to existing object ID
}

response = predictor.handle_request(request)
frame_idx = response["frame_index"]
outputs = response["outputs"]  # Segmentation results
Parameters:
  • session_id (str, required): Session identifier
  • frame_index (int, required): Frame index to add prompt on
  • text (str, optional): Text description of object
  • points (list, optional): List of [x, y] point coordinates
  • point_labels (list, optional): List of point labels (1 or 0)
  • bounding_boxes (list, optional): List of boxes in [cx, cy, w, h] normalized format
  • bounding_box_labels (list, optional): List of box labels (1 or 0)
  • obj_id (int, optional): Object ID to assign prompt to
Returns:
  • frame_index (int): The frame index
  • outputs (dict): Segmentation masks and metadata

propagate_in_video

Propagate prompts across video frames (streaming).
request = {
    "type": "propagate_in_video",
    "session_id": session_id,
    "propagation_direction": "both",  # "forward", "backward", or "both"
    "start_frame_index": 0,  # Optional: default is first prompted frame
    "max_frame_num_to_track": None  # Optional: default is all frames
}

for response in predictor.handle_stream_request(request):
    frame_idx = response["frame_index"]
    outputs = response["outputs"]
    # Process masks for this frame
    process_frame(frame_idx, outputs)
Parameters:
  • session_id (str, required): Session identifier
  • propagation_direction (str): Direction to propagate: "forward", "backward", or "both"
  • start_frame_index (int, optional): Starting frame index
  • max_frame_num_to_track (int, optional): Maximum number of frames to track
Yields:
  • frame_index (int): Current frame index
  • outputs (dict): Segmentation outputs for the frame

remove_object

Remove an object from tracking.
request = {
    "type": "remove_object",
    "session_id": session_id,
    "obj_id": 1,
    "is_user_action": True
}

response = predictor.handle_request(request)
Parameters:
  • session_id (str, required): Session identifier
  • obj_id (int, required): Object ID to remove
  • is_user_action (bool): Whether this is a user-initiated action

reset_session

Reset session to initial state.
request = {
    "type": "reset_session",
    "session_id": session_id
}

response = predictor.handle_request(request)

close_session

Close and clean up a session.
request = {
    "type": "close_session",
    "session_id": session_id
}

response = predictor.handle_request(request)

Example Usage

Basic Video Segmentation

from sam3.model.sam3_video_predictor import Sam3VideoPredictor

# Initialize predictor
predictor = Sam3VideoPredictor()

# Start session
response = predictor.handle_request({
    "type": "start_session",
    "resource_path": "video.mp4"
})
session_id = response["session_id"]

# Add text prompt on first frame
predictor.handle_request({
    "type": "add_prompt",
    "session_id": session_id,
    "frame_index": 0,
    "text": "person"
})

# Propagate through video
for response in predictor.handle_stream_request({
    "type": "propagate_in_video",
    "session_id": session_id,
    "propagation_direction": "forward"
}):
    frame_idx = response["frame_index"]
    outputs = response["outputs"]
    print(f"Frame {frame_idx}: {len(outputs)} objects tracked")

# Clean up
predictor.handle_request({
    "type": "close_session",
    "session_id": session_id
})

Interactive Tracking with Points

# Start session
response = predictor.handle_request({
    "type": "start_session",
    "resource_path": "video.mp4"
})
session_id = response["session_id"]

# Add point prompt
response = predictor.handle_request({
    "type": "add_prompt",
    "session_id": session_id,
    "frame_index": 0,
    "points": [[640, 360]],  # Click at center
    "point_labels": [1]  # Foreground
})

# Get object ID from response
obj_id = list(response["outputs"].keys())[0]

# Propagate
for response in predictor.handle_stream_request({
    "type": "propagate_in_video",
    "session_id": session_id
}):
    # Process results
    pass

Box Prompting

# Add box prompt (normalized coordinates)
predictor.handle_request({
    "type": "add_prompt",
    "session_id": session_id,
    "frame_index": 0,
    "bounding_boxes": [[0.5, 0.5, 0.3, 0.4]],  # cx, cy, w, h
    "bounding_box_labels": [1]  # Positive box
})

Bidirectional Propagation

# Add prompt on middle frame
predictor.handle_request({
    "type": "add_prompt",
    "session_id": session_id,
    "frame_index": 50,
    "text": "car"
})

# Propagate both forward and backward
for response in predictor.handle_stream_request({
    "type": "propagate_in_video",
    "session_id": session_id,
    "propagation_direction": "both",
    "start_frame_index": 50
}):
    process(response)

Output Format

The outputs dictionary from add_prompt and propagate_in_video contains:
outputs = {
    obj_id: {
        "mask": binary_mask,  # bool array, (H, W)
        "score": confidence_score,  # float
        "bbox": [x0, y0, x1, y1],  # pixel coordinates
    },
    # ... more objects
}

Notes

  • Each session manages its own state independently
  • Sessions should be closed to free GPU memory
  • Point/box coordinates for add_prompt are in pixel space
  • Bounding box format is [center_x, center_y, width, height], normalized to [0, 1]
  • The predictor supports both images and videos
  • Use async_loading_frames=True for faster video loading

Build docs developers (and LLMs) love