Skip to main content
CVAT provides powerful auto-annotation capabilities using state-of-the-art deep learning models. These tools can dramatically accelerate annotation workflows by automatically detecting, segmenting, and tracking objects.

Overview

CVAT supports multiple approaches to automatic annotation:

Interactive Tools

Click-based segmentation with SAM2 and other interactive models

Detector Models

Automatic object detection with YOLO, Detectron2, and Transformers

Tracking Models

Multi-frame tracking with SAM2 tracker and other temporal models

Custom Functions

Deploy your own models as auto-annotation functions

Interactive Segmentation

AI Tools (SAM2)

The AI Tools button in the controls sidebar provides access to interactive segmentation models. Using Interactive Segmentation:
  1. Click the AI Tools button in the left sidebar
  2. Select a label for the annotation
  3. Choose the interaction mode:
    • Positive points: Click inside the object you want to segment
    • Negative points: Click outside to exclude regions
    • Bounding box: Draw a box around the object
  4. The model generates a mask or polygon in real-time
  5. Adjust by adding more positive/negative points
  6. Click Done to create the annotation
  7. Press N to repeat with the same settings
For best results with SAM2, start with a single click in the center of the object. Add positive points in missed regions and negative points in incorrectly included areas.
Available Models:
  • SAM2 (Segment Anything Model 2) - Meta’s foundation model for promptable segmentation
  • IOG (Interactive Object Segmentation) - Efficient interactive segmentation
  • Custom deployed interactive models

OpenCV Tools

The OpenCV Tools provide classical computer vision algorithms: Intelligent Scissors
  1. Click the OpenCV Tools button
  2. Select “Intelligent Scissors”
  3. Click along the object boundary
  4. The tool automatically finds edges between clicks
  5. Close the polygon to finish
GrabCut
  1. Draw a rough rectangle around the object
  2. The algorithm segments the foreground
  3. Refine with additional markers
OpenCV tools work entirely in the browser and don’t require a server connection, but are less accurate than deep learning models.

Automatic Detection

Automatic detection runs models over entire frames or videos to detect all objects of specific classes.

Using Auto-Annotation

From the Task Page:
  1. Navigate to your task
  2. Click ActionsAutomatic annotation
  3. Select a model:
    • YOLO models (v5, v8, v11, v12) - Fast, accurate object detection
    • Detectron2 models - Research-grade detection and segmentation
    • Transformers models - Hugging Face model hub integration
  4. Configure detection parameters:
    • Threshold: Minimum confidence score (0.0-1.0)
    • Labels mapping: Map model classes to your task labels
  5. Click Annotate to start
  6. Monitor progress in the task details page
Auto-annotation can take considerable time for large videos or datasets. The task will be locked during processing.

YOLO Models

YOLO (You Only Look Once) models provide fast, accurate detection: Supported Tasks:
  • Object detection (bounding boxes)
  • Instance segmentation (polygons)
  • Pose estimation (skeletons)
  • Oriented object detection (rotated boxes)
  • Classification
Available YOLO Versions:
  • YOLOv5 - Lightweight, fast
  • YOLOv8 - Improved accuracy
  • YOLOv11 - Latest performance
  • YOLO12 - State-of-the-art results
Configuration:
# Using CVAT CLI
cvat-cli auto-annotate create \
  --function-file /path/to/yolo/func.py \
  -p model=str:yolo12n.pt \
  -p device=str:cuda \
  <task-id>
Nano models (n): Fastest, lowest accuracy - for real-time or CPU inferenceSmall models (s): Balanced speed and accuracyMedium models (m): Good accuracy, moderate speedLarge models (l): High accuracy, slowerExtra-large models (x): Best accuracy, slowest - for maximum quality

Detectron2 Models

Facebook Research’s Detectron2 provides state-of-the-art detection and segmentation: Available Architectures:
  • Faster R-CNN - Two-stage detection
  • RetinaNet - Single-stage dense detection
  • Mask R-CNN - Instance segmentation
  • Cascade R-CNN - Multi-stage refinement
Pretrained Datasets:
  • COCO (80 object classes)
  • LVIS (1000+ classes)
  • Custom trained models

Transformers Models

Hugging Face Transformers integration provides access to thousands of models: Supported Tasks:
  • object-detection - Bounding box detection (DETR, YOLOS, etc.)
  • image-segmentation - Semantic and instance segmentation
  • image-classification - Frame-level classification tags
Configuration:
# Using a Hugging Face model
cvat-cli auto-annotate create \
  --function-file /path/to/transformers/func.py \
  -p model=str:facebook/detr-resnet-50 \
  -p task=str:object-detection \
  -p device=str:cuda \
  <task-id>
Browse models at huggingface.co/models and use the model ID with the Transformers function.

Automatic Tracking

SAM2 Tracker

SAM2 can track segmented objects across video frames: Workflow:
  1. Annotate objects on the first frame (manually or with AI tools)
  2. Run SAM2 tracker auto-annotation function
  3. The model propagates masks/polygons across subsequent frames
  4. Review and correct as needed
Configuration:
cvat-cli auto-annotate create \
  --function-file /path/to/sam2/func.py \
  -p model_id=str:facebook/sam2.1-hiera-large \
  -p device=str:cuda \
  <task-id>
Model Options:
  • facebook/sam2.1-hiera-tiny - Fastest, 38.9M parameters
  • facebook/sam2.1-hiera-small - Balanced, 46M parameters
  • facebook/sam2.1-hiera-base-plus - High quality, 80.8M parameters
  • facebook/sam2.1-hiera-large - Best quality, 224.4M parameters
SAM2 tracking is particularly effective for:
  • Objects with complex boundaries
  • Partially occluded objects
  • Non-rigid deformations
  • Variable camera motion

TransT Tracker

TransT provides transformer-based object tracking:
  1. Draw initial bounding box on first frame
  2. Run TransT tracker
  3. The model tracks the object through the video
  4. Outputs interpolated tracks

Advanced Configuration

Model Parameters

Most auto-annotation functions accept parameters: Common Parameters:
ParameterTypeDescription
modelstringPath or identifier for the model
devicestringPyTorch device: cpu, cuda, cuda:0, etc.
thresholdfloatConfidence threshold (0.0-1.0)
labels_mappingdictMap model classes to CVAT labels
YOLO-Specific:
ParameterTypeDescription
keypoint_names_pathstringPath to keypoint names file (pose models)
imgszintInput image size (default: 640)
conffloatObject confidence threshold
ioufloatNMS IoU threshold
Transformers-Specific:
ParameterTypeDescription
taskstringModel task: object-detection, image-segmentation, etc.
thresholdfloatDetection threshold
top_kintMaximum detections per image

Label Mapping

When model classes don’t match your task labels:
-p labels_mapping=dict:'{"car": "vehicle", "truck": "vehicle", "person": "pedestrian"}'
Unmapped classes are ignored.

Filtering Results

By Confidence: Set a higher threshold to reduce false positives:
-p threshold=float:0.7  # Only keep detections with >70% confidence
By Class: Use label mapping to filter specific classes:
-p labels_mapping=dict:'{"person": "person"}'  # Only detect persons

Post-Processing

After auto-annotation:
  1. Review results: Navigate through frames to check annotations
  2. Adjust boundaries: Refine automatically generated shapes
  3. Remove false positives: Delete incorrect detections
  4. Add missed objects: Manually annotate objects the model missed
  5. Set attributes: Auto-annotation doesn’t set attributes, add them manually
  6. Verify tracks: Check that objects maintain consistent IDs across frames
Efficient Review Workflow:
  1. Use arrow keys to quickly navigate between frames
  2. Press H to hide correct objects and focus on errors
  3. Use Del to quickly remove false positives
  4. Press N to quickly add missed objects with repeat drawing

Custom Models

Deploying Custom Functions

You can deploy your own models as CVAT auto-annotation functions: Requirements:
  1. Python function implementing the auto-annotation interface
  2. Model weights and dependencies
  3. Deployment environment (local, cloud, or Nuclio)
Example Function Structure:
import cvat_sdk.auto_annotation as cvataa

def detect(
    context: cvataa.DetectionFunctionContext,
) -> list[cvataa.shape.Shape]:
    # Load model
    model = load_model()
    
    results = []
    for frame in context.frame_iterator():
        # Run inference
        detections = model(frame.data)
        
        # Convert to CVAT shapes
        for det in detections:
            results.append(
                cvataa.shape.Rectangle(
                    label=det.label,
                    points=[det.x1, det.y1, det.x2, det.y2],
                    frame=frame.index,
                )
            )
    
    return results

spec = cvataa.DetectionFunctionSpec(
    labels=[...],
    version="1.0",
)
See the Auto-annotation API reference for complete documentation.

Using Serverless Functions

Deploy functions to Nuclio serverless platform:
  1. Package function:
    nuclio deploy --path /path/to/function
    
  2. Register with CVAT:
    • Navigate to Models page
    • Click “Create Model”
    • Enter function URL and details
  3. Use from UI:
    • Functions appear in auto-annotation model list
    • Configure and run like built-in models
Serverless deployment allows:
  • GPU acceleration
  • Concurrent processing
  • Shared models across users
  • No local compute requirements

Performance Tips

Always use device=str:cuda for faster processing. A GPU can be 10-100x faster than CPU for deep learning models.
Auto-annotate multiple tasks together to amortize model loading time.
  • Use smaller models (e.g., YOLO nano) for initial annotation
  • Use larger models for final pass or difficult cases
  • Consider speed vs. accuracy tradeoff
  • Start with lower threshold (0.3-0.5) to catch all objects
  • Remove false positives manually
  • Higher threshold (0.7-0.9) for high-precision applications

Model Comparison

ModelSpeedAccuracyBest For
YOLO12nVery FastGoodReal-time, quick annotation
YOLO12lMediumExcellentBalanced production use
SAM2 (interactive)MediumExcellentComplex shapes, pixel accuracy
SAM2 (tracking)SlowExcellentVideo segmentation
Detectron2SlowExcellentResearch, maximum accuracy
TransformersVariesVariesSpecialized models

Troubleshooting

Check:
  • Model is properly deployed and accessible
  • Sufficient memory/VRAM available
  • Network connectivity to model server
  • Task labels match model output classes
  • Check server logs for detailed error messages
  • Threshold may be too high, try lowering to 0.3
  • Check label mapping configuration
  • Verify model is appropriate for your data (e.g., trained on similar objects)
  • Use a larger/better model
  • Adjust threshold
  • Ensure input images are high quality
  • Consider fine-tuning model on your specific domain
  • Enable GPU acceleration
  • Use a smaller model
  • Process shorter video segments
  • Check if other processes are using GPU

Next Steps

Manual Annotation

Refine auto-annotation results manually

Advanced Tools

Use propagation and interpolation

SDK Reference

Build custom auto-annotation functions

Serverless Models

Explore available models

Build docs developers (and LLMs) love