Skip to main content

Overview

CVAT’s serverless architecture allows you to integrate custom AI models by packaging them as Nuclio functions. This guide covers creating functions for detectors, interactors, trackers, and ReID models.

Function Structure

Each serverless function consists of:
my-custom-function/
├── nuclio/
│   ├── function.yaml       # Nuclio function configuration
│   ├── main.py            # Handler and request processing
│   ├── model_handler.py   # Model loading and inference
│   └── requirements.txt   # Python dependencies (optional)
└── README.md              # Documentation (optional)

Creating a Detector Function

Step 1: Define Function Metadata

Create function.yaml with function configuration:
metadata:
  name: my-custom-detector
  namespace: cvat
  annotations:
    name: My Custom Detector
    type: detector
    version: 1
    spec: |
      [
        { "id": 0, "name": "class1", "type": "rectangle" },
        { "id": 1, "name": "class2", "type": "polygon" },
        { "id": 2, "name": "class3", "type": "mask" }
      ]

spec:
  description: Custom object detector
  runtime: 'python:3.10'
  handler: main:handler
  eventTimeout: 30s
  
  build:
    image: cvat.custom.detector
    baseImage: ubuntu:22.04
    directives:
      preCopy:
        - kind: RUN
          value: apt-get update && apt-get install -y python3-pip
        - kind: WORKDIR
          value: /opt/nuclio
        - kind: RUN
          value: pip install torch torchvision opencv-python-headless pillow numpy

  triggers:
    myHttpTrigger:
      numWorkers: 2
      kind: 'http'
      workerAvailabilityTimeoutMilliseconds: 10000
      attributes:
        maxRequestBodySize: 33554432 # 32MB

  platform:
    attributes:
      restartPolicy:
        name: always
        maximumRetryCount: 3

Step 2: Implement Model Handler

Create model_handler.py to load and run your model:
import torch
import cv2
import numpy as np
from typing import List, Dict

class ModelHandler:
    def __init__(self):
        """Initialize and load the model."""
        self.device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
        self.model = self._load_model()
        self.model.eval()
        
    def _load_model(self):
        """Load your custom model."""
        # Load pre-trained weights
        model = YourCustomModel()
        model.load_state_dict(torch.load('model_weights.pth', map_location=self.device))
        model.to(self.device)
        return model
    
    def preprocess(self, image: np.ndarray) -> torch.Tensor:
        """Preprocess image for model input."""
        # Resize, normalize, convert to tensor
        image = cv2.resize(image, (640, 640))
        image = image.astype(np.float32) / 255.0
        image = torch.from_numpy(image).permute(2, 0, 1).unsqueeze(0)
        return image.to(self.device)
    
    def infer(self, image: np.ndarray, threshold: float = 0.5) -> List[Dict]:
        """Run inference and return detections."""
        preprocessed = self.preprocess(image)
        
        with torch.no_grad():
            predictions = self.model(preprocessed)
        
        return self._postprocess(predictions, image.shape, threshold)
    
    def _postprocess(self, predictions, original_shape, threshold):
        """Convert model output to CVAT format."""
        detections = []
        
        for pred in predictions:
            if pred['score'] < threshold:
                continue
                
            detection = {
                'label': self._get_label_name(pred['class_id']),
                'confidence': float(pred['score']),
                'type': 'rectangle',  # or 'polygon', 'mask'
                'points': self._convert_bbox(pred['bbox'], original_shape)
            }
            detections.append(detection)
        
        return detections
    
    def _get_label_name(self, class_id: int) -> str:
        """Map class ID to label name."""
        labels = ['class1', 'class2', 'class3']
        return labels[class_id]
    
    def _convert_bbox(self, bbox, original_shape):
        """Convert bbox to CVAT format [xtl, ytl, xbr, ybr]."""
        # Scale coordinates to original image size
        h, w = original_shape[:2]
        x1, y1, x2, y2 = bbox
        return [x1 * w, y1 * h, x2 * w, y2 * h]

Step 3: Create Request Handler

Create main.py to handle HTTP requests:
import json
import base64
import cv2
import numpy as np
from model_handler import ModelHandler

def init_context(context):
    """Initialize function context and load model."""
    context.logger.info("Initializing model...")
    context.user_data.model = ModelHandler()
    context.logger.info("Model initialized successfully")

def handler(context, event):
    """Handle inference requests."""
    try:
        # Parse request
        data = event.body
        image_data = base64.b64decode(data['image'])
        threshold = data.get('threshold', 0.5)
        
        # Decode image
        nparr = np.frombuffer(image_data, np.uint8)
        image = cv2.imdecode(nparr, cv2.IMREAD_COLOR)
        
        # Run inference
        results = context.user_data.model.infer(image, threshold)
        
        # Return response
        return context.Response(
            body=json.dumps(results),
            headers={},
            content_type='application/json',
            status_code=200
        )
    
    except Exception as e:
        context.logger.error(f"Error processing request: {str(e)}")
        return context.Response(
            body=json.dumps({'error': str(e)}),
            headers={},
            content_type='application/json',
            status_code=500
        )

Creating an Interactor Function

Interactors receive user input (points, boxes) for guided segmentation:

Function Metadata

metadata:
  name: my-custom-interactor
  namespace: cvat
  annotations:
    name: My Custom Interactor
    type: interactor
    version: 1
    spec: |
      [
        { "name": "object", "type": "polygon" }
      ]
    min_pos_points: 1
    min_neg_points: 0
    startswith_box: false
    startswith_box_optional: true
    help_message: Click points inside the object to segment it

Handler Implementation

def handler(context, event):
    """Handle interactive segmentation requests."""
    data = event.body
    
    # Extract inputs
    image_data = base64.b64decode(data['image'])
    pos_points = np.array(data['pos_points'])  # [[x1, y1], [x2, y2], ...]
    neg_points = np.array(data['neg_points'])  # [[x1, y1], ...]
    obj_bbox = data.get('obj_bbox')  # Optional [xtl, ytl, xbr, ybr]
    
    # Decode image
    nparr = np.frombuffer(image_data, np.uint8)
    image = cv2.imdecode(nparr, cv2.IMREAD_COLOR)
    
    # Run interactive segmentation
    mask = context.user_data.model.segment(
        image, 
        pos_points, 
        neg_points, 
        obj_bbox
    )
    
    # Convert mask to polygon
    polygon = mask_to_polygon(mask)
    
    return context.Response(
        body=json.dumps([{
            'label': 'object',
            'type': 'polygon',
            'points': polygon.flatten().tolist()
        }]),
        headers={},
        content_type='application/json',
        status_code=200
    )

Creating a Tracker Function

Trackers maintain object state across frames:

Function Metadata

metadata:
  name: my-custom-tracker
  namespace: cvat
  annotations:
    name: My Custom Tracker
    type: tracker
    version: 1
    spec:
    supported_shape_types: rectangle,polygon

Handler Implementation

def handler(context, event):
    """Handle tracking requests."""
    data = event.body
    
    # Extract inputs
    image_data = base64.b64decode(data['image'])
    shapes = data['shapes']  # Initial shapes or None for continuation
    states = data['states']  # Tracking state or [] for initialization
    
    # Decode image
    nparr = np.frombuffer(image_data, np.uint8)
    image = cv2.imdecode(nparr, cv2.IMREAD_COLOR)
    
    if not states:
        # Initialize tracking
        new_states = []
        for shape in shapes:
            state = context.user_data.model.init_tracker(
                image,
                shape['points'],
                shape['type']
            )
            new_states.append(state)
    else:
        # Continue tracking
        new_states = []
        for state in states:
            updated_state = context.user_data.model.track(
                image,
                state
            )
            new_states.append(updated_state)
    
    # Extract updated shapes
    updated_shapes = [
        context.user_data.model.get_shape(state) 
        for state in new_states
    ]
    
    return context.Response(
        body=json.dumps({
            'shapes': updated_shapes,
            'states': new_states  # Opaque state to pass to next frame
        }),
        headers={},
        content_type='application/json',
        status_code=200
    )

Output Formats

Rectangle

{
    'label': 'car',
    'type': 'rectangle',
    'points': [x1, y1, x2, y2],  # [xtl, ytl, xbr, ybr]
    'confidence': 0.95,
    'rotation': 0.0  # Optional
}

Polygon

{
    'label': 'person',
    'type': 'polygon',
    'points': [x1, y1, x2, y2, x3, y3, ...],  # Flat list of coordinates
    'confidence': 0.88
}

Mask

{
    'label': 'dog',
    'type': 'mask',
    'points': [x1, y1, x2, y2, ..., xtl, ytl, xbr, ybr],  # RLE + bbox
    'confidence': 0.92
}

Skeleton (with Elements)

{
    'label': 'person',
    'type': 'skeleton',
    'points': [],
    'elements': [
        {'label': 'nose', 'type': 'points', 'points': [x, y]},
        {'label': 'left_eye', 'type': 'points', 'points': [x, y]},
        # ... more keypoints
    ]
}

Advanced Features

Attributes

Add attributes to detections:
{
    'label': 'car',
    'type': 'rectangle',
    'points': [100, 200, 300, 400],
    'attributes': [
        {'name': 'color', 'value': 'red'},
        {'name': 'occluded', 'value': 'false'}
    ]
}
Define attributes in function.yaml:
spec: |
  [
    {
      "name": "car",
      "type": "rectangle",
      "attributes": [
        {
          "name": "color",
          "input_type": "select",
          "values": ["red", "blue", "green", "white", "black"]
        },
        {
          "name": "occluded",
          "input_type": "checkbox",
          "values": ["true", "false"]
        }
      ]
    }
  ]

Group Annotations

Group related objects:
[
    {'label': 'person', 'points': [...], 'group_id': 0},
    {'label': 'bicycle', 'points': [...], 'group_id': 0},  # Same person
    {'label': 'person', 'points': [...], 'group_id': 1},
    {'label': 'car', 'points': [...], 'group_id': 1}  # Different person
]

Deployment

Build and Deploy

# Deploy function
nuctl deploy --project-name cvat \
  --path ./my-custom-function/nuclio \
  --file ./my-custom-function/nuclio/function.yaml \
  --platform local

# Verify deployment
nuctl get function my-custom-detector --platform local

Test Function

import requests
import base64
import json

# Read and encode image
with open('test_image.jpg', 'rb') as f:
    image_b64 = base64.b64encode(f.read()).decode()

# Call function
response = requests.post(
    'http://localhost:8080',  # Function port
    json={'image': image_b64, 'threshold': 0.5}
)

print(json.dumps(response.json(), indent=2))

Best Practices

Performance Optimization

  1. Lazy Loading: Load models in init_context, not in handler
  2. Batch Processing: Process multiple requests efficiently
  3. GPU Utilization: Use GPU when available for faster inference
  4. Model Optimization: Use ONNX, TensorRT, or quantization
  5. Caching: Cache preprocessed data when possible

Error Handling

def handler(context, event):
    try:
        # Validate input
        if 'image' not in event.body:
            raise ValueError("Missing 'image' field")
        
        # Process request
        results = process_image(event.body)
        
        return context.Response(
            body=json.dumps(results),
            status_code=200
        )
    
    except ValueError as e:
        context.logger.warning(f"Invalid input: {e}")
        return context.Response(
            body=json.dumps({'error': str(e)}),
            status_code=400
        )
    
    except Exception as e:
        context.logger.error(f"Unexpected error: {e}", exc_info=True)
        return context.Response(
            body=json.dumps({'error': 'Internal server error'}),
            status_code=500
        )

Logging

def handler(context, event):
    context.logger.info("Processing request")
    context.logger.debug(f"Threshold: {event.body.get('threshold')}")
    
    try:
        result = process(event.body)
        context.logger.info(f"Found {len(result)} objects")
        return context.Response(body=json.dumps(result), status_code=200)
    except Exception as e:
        context.logger.error(f"Error: {e}", exc_info=True)
        raise

Resource Management

# Limit function resources
platform:
  attributes:
    resources:
      requests:
        memory: "2Gi"
        cpu: "1"
      limits:
        memory: "4Gi"
        cpu: "2"
        nvidia.com/gpu: "1"  # Request GPU

Testing

Unit Tests

import pytest
from model_handler import ModelHandler

def test_model_loading():
    handler = ModelHandler()
    assert handler.model is not None

def test_inference():
    handler = ModelHandler()
    image = np.random.rand(640, 640, 3).astype(np.uint8)
    results = handler.infer(image, threshold=0.5)
    assert isinstance(results, list)

def test_output_format():
    handler = ModelHandler()
    image = np.random.rand(640, 640, 3).astype(np.uint8)
    results = handler.infer(image)
    
    for detection in results:
        assert 'label' in detection
        assert 'type' in detection
        assert 'points' in detection
        assert detection['type'] in ['rectangle', 'polygon', 'mask']

Integration Tests

# Deploy function locally
nuctl deploy --project-name cvat --path ./nuclio --file ./nuclio/function.yaml --platform local

# Get function URL
FUNC_URL=$(nuctl get function my-custom-detector -o json | jq -r '.status.httpPort')

# Test with sample image
python test_function.py --url "http://localhost:$FUNC_URL" --image test.jpg

Troubleshooting

Common Issues

Model Not Loading:
  • Check model file path in container
  • Verify dependencies in build directives
  • Increase memory limits
Slow Inference:
  • Use GPU-optimized function variant
  • Optimize model (ONNX, quantization)
  • Adjust numWorkers in function.yaml
Invalid Output Format:
  • Validate against CVAT expected format
  • Check coordinate scaling
  • Test with small dataset first
Memory Errors:
  • Increase container memory limits
  • Reduce batch size
  • Optimize image preprocessing

Examples

Explore existing functions in the CVAT repository:
  • SAM Interactor: serverless/pytorch/facebookresearch/sam/
  • YOLO Detector: serverless/onnx/WongKinYiu/yolov7/
  • TransT Tracker: serverless/pytorch/dschoerk/transt/
  • Mask R-CNN: serverless/openvino/omz/public/mask_rcnn_inception_resnet_v2_atrous_coco/

Next Steps

Deployment Guide

Learn how to deploy serverless functions

Overview

Understand serverless function types

Build docs developers (and LLMs) love