Skip to main content

Overview

The SAM 3 video API uses a request-response pattern for all operations. This page documents the request and response formats for each operation type.

Request Structure

All requests are Python dictionaries with a type field:
request = {
    "type": "request_type",
    # ... type-specific parameters
}

Session Management

start_session

Start a new inference session on a video or image. Request:
{
    "type": "start_session",
    "resource_path": str,  # Path to video/image file or JPEG frame directory
    "session_id": Optional[str]  # Optional session ID (auto-generated if omitted)
}
Response:
{
    "session_id": str  # Session identifier for subsequent requests
}
Example:
response = predictor.handle_request({
    "type": "start_session",
    "resource_path": "/path/to/video.mp4"
})
session_id = response["session_id"]

reset_session

Reset session to its initial state (removes all prompts and results). Request:
{
    "type": "reset_session",
    "session_id": str
}
Response:
{
    "is_success": bool
}

close_session

Close and clean up a session (frees GPU memory). Request:
{
    "type": "close_session",
    "session_id": str
}
Response:
{
    "is_success": bool
}

Prompting

add_prompt

Add text, point, or box prompt on a specific video frame. Request:
{
    "type": "add_prompt",
    "session_id": str,
    "frame_index": int,  # 0-based frame index
    
    # Optional: text prompt
    "text": Optional[str],
    
    # Optional: point prompts
    "points": Optional[List[List[float]]],  # [[x1, y1], [x2, y2], ...]
    "point_labels": Optional[List[int]],  # [1, 0, ...] (1=foreground, 0=background)
    
    # Optional: box prompts
    "bounding_boxes": Optional[List[List[float]]],  # [[cx, cy, w, h], ...] (normalized)
    "bounding_box_labels": Optional[List[int]],  # [1, 0, ...] (1=positive, 0=negative)
    
    # Optional: object assignment
    "obj_id": Optional[int]  # Assign prompt to existing object
}
Response:
{
    "frame_index": int,
    "outputs": Dict[int, Dict]  # Object ID -> segmentation result
}
Output Format:
outputs = {
    obj_id: {
        "mask": np.ndarray,  # Binary mask (H, W), dtype=bool
        "score": float,  # Confidence score
        "bbox": List[float],  # [x0, y0, x1, y1] in pixels
    }
}
Examples: Text prompt:
predictor.handle_request({
    "type": "add_prompt",
    "session_id": session_id,
    "frame_index": 0,
    "text": "person"
})
Point prompt:
predictor.handle_request({
    "type": "add_prompt",
    "session_id": session_id,
    "frame_index": 0,
    "points": [[640, 360], [700, 400]],
    "point_labels": [1, 1]  # Both foreground
})
Box prompt:
predictor.handle_request({
    "type": "add_prompt",
    "session_id": session_id,
    "frame_index": 0,
    "bounding_boxes": [[0.5, 0.5, 0.3, 0.4]],  # center_x, center_y, width, height (0-1)
    "bounding_box_labels": [1]  # Positive box
})
Combined prompts:
predictor.handle_request({
    "type": "add_prompt",
    "session_id": session_id,
    "frame_index": 0,
    "text": "dog",
    "points": [[500, 300]],
    "point_labels": [1],
    "bounding_boxes": [[0.4, 0.3, 0.2, 0.3]],
    "bounding_box_labels": [1]
})

remove_object

Remove an object from tracking. Request:
{
    "type": "remove_object",
    "session_id": str,
    "obj_id": int,
    "is_user_action": bool  # Whether this is a user-initiated removal
}
Response:
{
    "is_success": bool
}
Example:
predictor.handle_request({
    "type": "remove_object",
    "session_id": session_id,
    "obj_id": 1,
    "is_user_action": True
})

Propagation

propagate_in_video

Propagate prompts to get segmentation results across video frames. Request:
{
    "type": "propagate_in_video",
    "session_id": str,
    "propagation_direction": str,  # "forward", "backward", or "both"
    "start_frame_index": Optional[int],  # Starting frame (default: first prompted frame)
    "max_frame_num_to_track": Optional[int]  # Max frames to track (default: all)
}
Response (streaming): This is a streaming request that yields responses:
for response in predictor.handle_stream_request(request):
    # response format:
    {
        "frame_index": int,
        "outputs": Dict[int, Dict]  # Same format as add_prompt outputs
    }
Examples: Forward propagation:
for response in predictor.handle_stream_request({
    "type": "propagate_in_video",
    "session_id": session_id,
    "propagation_direction": "forward",
    "start_frame_index": 0
}):
    frame_idx = response["frame_index"]
    outputs = response["outputs"]
    # Process frame
Backward propagation:
for response in predictor.handle_stream_request({
    "type": "propagate_in_video",
    "session_id": session_id,
    "propagation_direction": "backward",
    "start_frame_index": 100
}):
    # Process frames 99, 98, 97, ...
    pass
Bidirectional:
for response in predictor.handle_stream_request({
    "type": "propagate_in_video",
    "session_id": session_id,
    "propagation_direction": "both",
    "start_frame_index": 50,
    "max_frame_num_to_track": 100
}):
    # Processes frames 50->149, then 49->0
    pass

Coordinate Systems

Points

Points are in pixel coordinates (x, y):
  • x: horizontal position (0 to image_width)
  • y: vertical position (0 to image_height)
"points": [[320, 240]]  # x=320 pixels, y=240 pixels

Bounding Boxes

Boxes in requests use normalized center-width-height format:
  • center_x: horizontal center (0.0 to 1.0)
  • center_y: vertical center (0.0 to 1.0)
  • width: box width (0.0 to 1.0)
  • height: box height (0.0 to 1.0)
"bounding_boxes": [[0.5, 0.5, 0.3, 0.4]]  # Center at 50%, 50%, size 30%x40%
Boxes in responses use pixel XYXY format:
  • [x0, y0, x1, y1]: top-left and bottom-right corners in pixels
outputs[obj_id]["bbox"] = [100, 150, 300, 400]  # x0, y0, x1, y1

Label Conventions

Point Labels

  • 1: Foreground point (include this region)
  • 0: Background point (exclude this region)

Box Labels

  • 1: Positive box (include objects in this box)
  • 0: Negative box (exclude objects in this box)

Error Handling

Invalid requests raise RuntimeError:
try:
    response = predictor.handle_request(request)
except RuntimeError as e:
    print(f"Request failed: {e}")
Common errors:
  • Session not found: Invalid or expired session_id
  • Invalid frame index: frame_index out of range
  • Missing prompts: Propagation before adding any prompts

Complete Workflow Example

from sam3.model.sam3_video_predictor import Sam3VideoPredictor

predictor = Sam3VideoPredictor()

# 1. Start session
response = predictor.handle_request({
    "type": "start_session",
    "resource_path": "video.mp4"
})
session_id = response["session_id"]

# 2. Add prompt on first frame
response = predictor.handle_request({
    "type": "add_prompt",
    "session_id": session_id,
    "frame_index": 0,
    "text": "person",
    "points": [[640, 360]],
    "point_labels": [1]
})

# 3. Propagate through video
results = []
for response in predictor.handle_stream_request({
    "type": "propagate_in_video",
    "session_id": session_id,
    "propagation_direction": "forward"
}):
    results.append(response)

# 4. Process results
for result in results:
    frame_idx = result["frame_index"]
    for obj_id, obj_data in result["outputs"].items():
        mask = obj_data["mask"]
        score = obj_data["score"]
        bbox = obj_data["bbox"]
        # Save or visualize

# 5. Clean up
predictor.handle_request({
    "type": "close_session",
    "session_id": session_id
})

Build docs developers (and LLMs) love