Video Request Types

Overview

The SAM 3 video API uses a request-response pattern for all operations. This page documents the request and response formats for each operation type.

Request Structure

All requests are Python dictionaries with a type field:

request = {
    "type": "request_type",
    # ... type-specific parameters
}

Session Management

start_session

Start a new inference session on a video or image. Request:

{
    "type": "start_session",
    "resource_path": str,  # Path to video/image file or JPEG frame directory
    "session_id": Optional[str]  # Optional session ID (auto-generated if omitted)
}

Response:

{
    "session_id": str  # Session identifier for subsequent requests
}

Example:

response = predictor.handle_request({
    "type": "start_session",
    "resource_path": "/path/to/video.mp4"
})
session_id = response["session_id"]

reset_session

Reset session to its initial state (removes all prompts and results). Request:

{
    "type": "reset_session",
    "session_id": str
}

Response:

{
    "is_success": bool
}

close_session

Close and clean up a session (frees GPU memory). Request:

{
    "type": "close_session",
    "session_id": str
}

Response:

{
    "is_success": bool
}

Prompting

add_prompt

Add text, point, or box prompt on a specific video frame. Request:

{
    "type": "add_prompt",
    "session_id": str,
    "frame_index": int,  # 0-based frame index
    
    # Optional: text prompt
    "text": Optional[str],
    
    # Optional: point prompts
    "points": Optional[List[List[float]]],  # [[x1, y1], [x2, y2], ...]
    "point_labels": Optional[List[int]],  # [1, 0, ...] (1=foreground, 0=background)
    
    # Optional: box prompts
    "bounding_boxes": Optional[List[List[float]]],  # [[cx, cy, w, h], ...] (normalized)
    "bounding_box_labels": Optional[List[int]],  # [1, 0, ...] (1=positive, 0=negative)
    
    # Optional: object assignment
    "obj_id": Optional[int]  # Assign prompt to existing object
}

Response:

{
    "frame_index": int,
    "outputs": Dict[int, Dict]  # Object ID -> segmentation result
}

Output Format:

outputs = {
    obj_id: {
        "mask": np.ndarray,  # Binary mask (H, W), dtype=bool
        "score": float,  # Confidence score
        "bbox": List[float],  # [x0, y0, x1, y1] in pixels
    }
}

Examples: Text prompt:

predictor.handle_request({
    "type": "add_prompt",
    "session_id": session_id,
    "frame_index": 0,
    "text": "person"
})

Point prompt:

predictor.handle_request({
    "type": "add_prompt",
    "session_id": session_id,
    "frame_index": 0,
    "points": [[640, 360], [700, 400]],
    "point_labels": [1, 1]  # Both foreground
})

Box prompt:

predictor.handle_request({
    "type": "add_prompt",
    "session_id": session_id,
    "frame_index": 0,
    "bounding_boxes": [[0.5, 0.5, 0.3, 0.4]],  # center_x, center_y, width, height (0-1)
    "bounding_box_labels": [1]  # Positive box
})

Combined prompts:

predictor.handle_request({
    "type": "add_prompt",
    "session_id": session_id,
    "frame_index": 0,
    "text": "dog",
    "points": [[500, 300]],
    "point_labels": [1],
    "bounding_boxes": [[0.4, 0.3, 0.2, 0.3]],
    "bounding_box_labels": [1]
})

remove_object

Remove an object from tracking. Request:

{
    "type": "remove_object",
    "session_id": str,
    "obj_id": int,
    "is_user_action": bool  # Whether this is a user-initiated removal
}

Response:

{
    "is_success": bool
}

Example:

predictor.handle_request({
    "type": "remove_object",
    "session_id": session_id,
    "obj_id": 1,
    "is_user_action": True
})

Propagation

propagate_in_video

Propagate prompts to get segmentation results across video frames. Request:

{
    "type": "propagate_in_video",
    "session_id": str,
    "propagation_direction": str,  # "forward", "backward", or "both"
    "start_frame_index": Optional[int],  # Starting frame (default: first prompted frame)
    "max_frame_num_to_track": Optional[int]  # Max frames to track (default: all)
}

Response (streaming): This is a streaming request that yields responses:

for response in predictor.handle_stream_request(request):
    # response format:
    {
        "frame_index": int,
        "outputs": Dict[int, Dict]  # Same format as add_prompt outputs
    }

Examples: Forward propagation:

for response in predictor.handle_stream_request({
    "type": "propagate_in_video",
    "session_id": session_id,
    "propagation_direction": "forward",
    "start_frame_index": 0
}):
    frame_idx = response["frame_index"]
    outputs = response["outputs"]
    # Process frame

Backward propagation:

for response in predictor.handle_stream_request({
    "type": "propagate_in_video",
    "session_id": session_id,
    "propagation_direction": "backward",
    "start_frame_index": 100
}):
    # Process frames 99, 98, 97, ...
    pass

Bidirectional:

for response in predictor.handle_stream_request({
    "type": "propagate_in_video",
    "session_id": session_id,
    "propagation_direction": "both",
    "start_frame_index": 50,
    "max_frame_num_to_track": 100
}):
    # Processes frames 50->149, then 49->0
    pass

Coordinate Systems

Points

Points are in pixel coordinates (x, y):

x: horizontal position (0 to image_width)
y: vertical position (0 to image_height)

"points": [[320, 240]]  # x=320 pixels, y=240 pixels

Bounding Boxes

Boxes in requests use normalized center-width-height format:

center_x: horizontal center (0.0 to 1.0)
center_y: vertical center (0.0 to 1.0)
width: box width (0.0 to 1.0)
height: box height (0.0 to 1.0)

"bounding_boxes": [[0.5, 0.5, 0.3, 0.4]]  # Center at 50%, 50%, size 30%x40%

Boxes in responses use pixel XYXY format:

[x0, y0, x1, y1]: top-left and bottom-right corners in pixels

outputs[obj_id]["bbox"] = [100, 150, 300, 400]  # x0, y0, x1, y1

Label Conventions

Point Labels

1: Foreground point (include this region)
0: Background point (exclude this region)

Box Labels

1: Positive box (include objects in this box)
0: Negative box (exclude objects in this box)

Error Handling

Invalid requests raise RuntimeError:

try:
    response = predictor.handle_request(request)
except RuntimeError as e:
    print(f"Request failed: {e}")

Common errors:

Session not found: Invalid or expired session_id
Invalid frame index: frame_index out of range
Missing prompts: Propagation before adding any prompts

Complete Workflow Example

from sam3.model.sam3_video_predictor import Sam3VideoPredictor

predictor = Sam3VideoPredictor()

# 1. Start session
response = predictor.handle_request({
    "type": "start_session",
    "resource_path": "video.mp4"
})
session_id = response["session_id"]

# 2. Add prompt on first frame
response = predictor.handle_request({
    "type": "add_prompt",
    "session_id": session_id,
    "frame_index": 0,
    "text": "person",
    "points": [[640, 360]],
    "point_labels": [1]
})

# 3. Propagate through video
results = []
for response in predictor.handle_stream_request({
    "type": "propagate_in_video",
    "session_id": session_id,
    "propagation_direction": "forward"
}):
    results.append(response)

# 4. Process results
for result in results:
    frame_idx = result["frame_index"]
    for obj_id, obj_data in result["outputs"].items():
        mask = obj_data["mask"]
        score = obj_data["score"]
        bbox = obj_data["bbox"]
        # Save or visualize

# 5. Clean up
predictor.handle_request({
    "type": "close_session",
    "session_id": session_id
})

Model Builders

Image Inference

Video Inference

Agent

Evaluation

Video Request Types

Overview

Request Structure

Session Management

start_session

reset_session

close_session

Prompting

add_prompt

remove_object

Propagation

propagate_in_video

Coordinate Systems

Points

Bounding Boxes

Label Conventions

Point Labels

Box Labels

Error Handling

Complete Workflow Example

Build docs developers (and LLMs) love

Model Builders

Image Inference

Video Inference

Agent

Evaluation

​Overview

​Request Structure

​Session Management

​start_session

​reset_session

​close_session

​Prompting

​add_prompt

​remove_object

​Propagation

​propagate_in_video

​Coordinate Systems

​Points

​Bounding Boxes

​Label Conventions

​Point Labels

​Box Labels

​Error Handling

​Complete Workflow Example

Build docs developers (and LLMs) love

Overview

Request Structure

Session Management

start_session

reset_session

close_session

Prompting

add_prompt

remove_object

Propagation

propagate_in_video

Coordinate Systems

Points

Bounding Boxes

Label Conventions

Point Labels

Box Labels

Error Handling

Complete Workflow Example