Overview
The Sam3VideoPredictor class provides a session-based API for video instance segmentation and tracking. It manages inference states across video frames and supports text, point, and box prompts.
Class Initialization
from sam3.model.sam3_video_predictor import Sam3VideoPredictor
predictor = Sam3VideoPredictor(
checkpoint_path=None,
bpe_path=None,
has_presence_token=True,
geo_encoder_use_img_cross_attn=True,
strict_state_dict_loading=True,
async_loading_frames=False,
video_loader_type="cv2",
apply_temporal_disambiguation=True,
compile=False
)
Parameters
Path to model checkpoint. If None, loads from Hugging Face.
Path to BPE tokenizer file.
Whether to use presence token for object detection.
geo_encoder_use_img_cross_attn
Whether geometry encoder uses image cross-attention.
strict_state_dict_loading
Whether to enforce strict checkpoint loading.
Whether to load video frames asynchronously.
Video loader backend: "cv2" or "pyav".
apply_temporal_disambiguation
Whether to apply temporal disambiguation heuristics.
Whether to compile the model for better performance.
Methods
handle_request
Dispatch a request to the predictor.
response = predictor.handle_request(request)
Request dictionary with type field and type-specific parameters.
Response dictionary with request-specific fields.
handle_stream_request
Dispatch a streaming request (yields results).
for response in predictor.handle_stream_request(request):
# Process each frame's results
process(response)
Request Types
start_session
Start a new inference session on a video or image.
request = {
"type": "start_session",
"resource_path": "/path/to/video.mp4",
"session_id": "optional-session-id" # Auto-generated if not provided
}
response = predictor.handle_request(request)
session_id = response["session_id"]
Parameters:
resource_path (str): Path to video file (MP4) or directory with JPEG frames, or image file
session_id (str, optional): Session identifier (auto-generated if not provided)
Returns:
session_id (str): The session identifier
add_prompt
Add text, point, or box prompt on a specific frame.
request = {
"type": "add_prompt",
"session_id": session_id,
"frame_index": 0,
"text": "person", # Optional
"points": [[500, 375]], # Optional: list of [x, y]
"point_labels": [1], # Optional: 1=foreground, 0=background
"bounding_boxes": [[0.3, 0.4, 0.2, 0.3]], # Optional: [cx, cy, w, h] normalized
"bounding_box_labels": [1], # Optional: 1=positive, 0=negative
"obj_id": None # Optional: assign to existing object ID
}
response = predictor.handle_request(request)
frame_idx = response["frame_index"]
outputs = response["outputs"] # Segmentation results
Parameters:
session_id (str, required): Session identifier
frame_index (int, required): Frame index to add prompt on
text (str, optional): Text description of object
points (list, optional): List of [x, y] point coordinates
point_labels (list, optional): List of point labels (1 or 0)
bounding_boxes (list, optional): List of boxes in [cx, cy, w, h] normalized format
bounding_box_labels (list, optional): List of box labels (1 or 0)
obj_id (int, optional): Object ID to assign prompt to
Returns:
frame_index (int): The frame index
outputs (dict): Segmentation masks and metadata
propagate_in_video
Propagate prompts across video frames (streaming).
request = {
"type": "propagate_in_video",
"session_id": session_id,
"propagation_direction": "both", # "forward", "backward", or "both"
"start_frame_index": 0, # Optional: default is first prompted frame
"max_frame_num_to_track": None # Optional: default is all frames
}
for response in predictor.handle_stream_request(request):
frame_idx = response["frame_index"]
outputs = response["outputs"]
# Process masks for this frame
process_frame(frame_idx, outputs)
Parameters:
session_id (str, required): Session identifier
propagation_direction (str): Direction to propagate: "forward", "backward", or "both"
start_frame_index (int, optional): Starting frame index
max_frame_num_to_track (int, optional): Maximum number of frames to track
Yields:
frame_index (int): Current frame index
outputs (dict): Segmentation outputs for the frame
remove_object
Remove an object from tracking.
request = {
"type": "remove_object",
"session_id": session_id,
"obj_id": 1,
"is_user_action": True
}
response = predictor.handle_request(request)
Parameters:
session_id (str, required): Session identifier
obj_id (int, required): Object ID to remove
is_user_action (bool): Whether this is a user-initiated action
reset_session
Reset session to initial state.
request = {
"type": "reset_session",
"session_id": session_id
}
response = predictor.handle_request(request)
close_session
Close and clean up a session.
request = {
"type": "close_session",
"session_id": session_id
}
response = predictor.handle_request(request)
Example Usage
Basic Video Segmentation
from sam3.model.sam3_video_predictor import Sam3VideoPredictor
# Initialize predictor
predictor = Sam3VideoPredictor()
# Start session
response = predictor.handle_request({
"type": "start_session",
"resource_path": "video.mp4"
})
session_id = response["session_id"]
# Add text prompt on first frame
predictor.handle_request({
"type": "add_prompt",
"session_id": session_id,
"frame_index": 0,
"text": "person"
})
# Propagate through video
for response in predictor.handle_stream_request({
"type": "propagate_in_video",
"session_id": session_id,
"propagation_direction": "forward"
}):
frame_idx = response["frame_index"]
outputs = response["outputs"]
print(f"Frame {frame_idx}: {len(outputs)} objects tracked")
# Clean up
predictor.handle_request({
"type": "close_session",
"session_id": session_id
})
Interactive Tracking with Points
# Start session
response = predictor.handle_request({
"type": "start_session",
"resource_path": "video.mp4"
})
session_id = response["session_id"]
# Add point prompt
response = predictor.handle_request({
"type": "add_prompt",
"session_id": session_id,
"frame_index": 0,
"points": [[640, 360]], # Click at center
"point_labels": [1] # Foreground
})
# Get object ID from response
obj_id = list(response["outputs"].keys())[0]
# Propagate
for response in predictor.handle_stream_request({
"type": "propagate_in_video",
"session_id": session_id
}):
# Process results
pass
Box Prompting
# Add box prompt (normalized coordinates)
predictor.handle_request({
"type": "add_prompt",
"session_id": session_id,
"frame_index": 0,
"bounding_boxes": [[0.5, 0.5, 0.3, 0.4]], # cx, cy, w, h
"bounding_box_labels": [1] # Positive box
})
Bidirectional Propagation
# Add prompt on middle frame
predictor.handle_request({
"type": "add_prompt",
"session_id": session_id,
"frame_index": 50,
"text": "car"
})
# Propagate both forward and backward
for response in predictor.handle_stream_request({
"type": "propagate_in_video",
"session_id": session_id,
"propagation_direction": "both",
"start_frame_index": 50
}):
process(response)
The outputs dictionary from add_prompt and propagate_in_video contains:
outputs = {
obj_id: {
"mask": binary_mask, # bool array, (H, W)
"score": confidence_score, # float
"bbox": [x0, y0, x1, y1], # pixel coordinates
},
# ... more objects
}
Notes
- Each session manages its own state independently
- Sessions should be closed to free GPU memory
- Point/box coordinates for
add_prompt are in pixel space
- Bounding box format is [center_x, center_y, width, height], normalized to [0, 1]
- The predictor supports both images and videos
- Use
async_loading_frames=True for faster video loading