Skip to main content
CVAT supports over 30 dataset formats for import and export, making it compatible with most computer vision workflows and frameworks. All format conversions are powered by the Datumaro framework.

Format Overview

The table below shows all supported formats with import/export capabilities:
FormatImportExportUse Case
CVAT for images✔️✔️CVAT native format for image annotation
CVAT for video✔️✔️CVAT native format for video annotation
Datumaro✔️✔️Universal format with full feature support
PASCAL VOC✔️✔️Object detection, classification
PASCAL VOC Segmentation✔️✔️Semantic segmentation
YOLO✔️✔️Darknet YOLO object detection
MS COCO Object Detection✔️✔️Object detection with MS COCO
MS COCO Keypoints✔️✔️Keypoint/pose estimation
Cityscapes✔️✔️Urban scene segmentation
MOT✔️✔️Multi-object tracking
MOTS PNG✔️✔️Multi-object tracking with segmentation
LabelMe✔️✔️General-purpose annotation
ImageNet✔️✔️Image classification
CamVid✔️✔️Semantic segmentation for autonomous driving
WIDER Face✔️✔️Face detection
VGGFace2✔️✔️Face recognition
Market-1501✔️✔️Person re-identification
ICDAR13/15✔️✔️Text detection and recognition
Open Images V6✔️✔️Large-scale object detection
KITTI✔️✔️Autonomous driving benchmarks
KITTI Raw✔️✔️Raw KITTI sensor data
LFW✔️✔️Face verification
Supervisely Point Cloud✔️✔️3D point cloud annotation
Ultralytics YOLO Detection✔️✔️YOLOv8+ object detection
Ultralytics YOLO OBB✔️✔️Oriented bounding boxes
Ultralytics YOLO Segmentation✔️✔️Instance segmentation
Ultralytics YOLO Pose✔️✔️Pose estimation
Ultralytics YOLO Classification✔️✔️Image classification

Format Details

CVAT Formats

CVAT for images 1.1 and CVAT for video 1.1 are CVAT’s native XML-based formats. Features:
  • Full support for all CVAT annotation types
  • Preserves all metadata and attributes
  • Tracks, shapes, tags, and skeleton annotations
  • Best for backup and transfer between CVAT instances
When to use:
  • Backing up CVAT projects
  • Migrating tasks between CVAT servers
  • When you need complete annotation preservation
Export example:
task.export_dataset(
    format_name="CVAT for images 1.1",
    filename="cvat_backup.zip"
)
See the CVAT XML format specification in the CVAT GitHub repository for details.

Datumaro

Datumaro is a universal dataset framework providing lossless conversion between formats. Features:
  • Supports all CVAT annotation types
  • Python-based dataset manipulation
  • Format conversion and validation
  • Dataset versioning and comparison
When to use:
  • Complex dataset transformations
  • Converting between incompatible formats
  • Dataset analysis and statistics
  • Building custom ML pipelines
Export example:
task.export_dataset(
    format_name="Datumaro 1.0",
    filename="datumaro_dataset.zip",
    include_images=True
)
Learn more: Datumaro documentation

PASCAL VOC

PASCAL VOC is a classic format for object detection and segmentation. Features:
  • XML annotation files
  • Bounding boxes with classes
  • Segmentation masks (separate format)
  • Attributes stored in XML
When to use:
  • Object detection with classic frameworks
  • Academic research and benchmarks
  • Simple detection tasks
Structure:
dataset/
├── Annotations/
│   ├── image1.xml
│   └── image2.xml
├── JPEGImages/
│   ├── image1.jpg
│   └── image2.jpg
└── ImageSets/
    └── Main/
        └── train.txt
Export example:
cvat-cli task export-dataset \
  --format "PASCAL VOC 1.1" \
  --output voc_dataset.zip \
  123

YOLO Formats

CVAT supports multiple YOLO format variants:

YOLO 1.1 (Darknet)

Original YOLO format with text-based annotations. Features:
  • One .txt file per image
  • Normalized bounding box coordinates
  • Class ID, center_x, center_y, width, height
  • Requires obj.names and obj.data files
Format:
# image1.txt
0 0.5 0.5 0.3 0.4
1 0.2 0.3 0.1 0.2
Export example:
task.export_dataset(
    format_name="YOLO 1.1",
    filename="yolo_dataset.zip"
)

Ultralytics YOLO

Modern YOLO format (YOLOv8+) with YAML configuration. Variants:
  • Detection: Bounding boxes for object detection
  • Segmentation: Polygon annotations for instance segmentation
  • OBB: Oriented bounding boxes for rotated objects
  • Pose: Keypoint annotations for pose estimation
  • Classification: Image-level labels
Structure:
dataset/
├── data.yaml
├── train/
│   ├── images/
│   └── labels/
└── valid/
    ├── images/
    └── labels/
Export example:
task.export_dataset(
    format_name="Ultralytics YOLO Detection 1.0",
    filename="yolov8_dataset.zip"
)
Learn more: Ultralytics YOLO formats

COCO Formats

MS COCO is an industry-standard format for complex annotations.

COCO Object Detection

Features:
  • JSON-based annotations
  • Bounding boxes with categories
  • Image metadata (size, file name)
  • Supports attributes as custom fields
  • Crowd annotations for groups
Structure:
{
  "images": [{"id": 1, "file_name": "image1.jpg", "width": 640, "height": 480}],
  "annotations": [{"id": 1, "image_id": 1, "category_id": 1, "bbox": [x, y, w, h]}],
  "categories": [{"id": 1, "name": "person", "supercategory": "human"}]
}
When to use:
  • Training modern detection models (Faster R-CNN, YOLO, etc.)
  • Complex multi-class detection tasks
  • Integration with popular ML frameworks

COCO Keypoints

Features:
  • Keypoint annotations for pose estimation
  • Skeleton definitions
  • Visibility flags per keypoint
  • Compatible with pose estimation models
Export example:
task.export_dataset(
    format_name="COCO Keypoints 1.0",
    filename="coco_keypoints.zip"
)
Learn more: COCO format specification

Cityscapes

Cityscapes format for urban scene understanding. Features:
  • Semantic segmentation masks
  • Instance segmentation annotations
  • 19 standard classes for street scenes
  • Polygons and pixel-level masks
When to use:
  • Autonomous driving applications
  • Urban scene segmentation
  • Street-level computer vision
Export example:
task.export_dataset(
    format_name="Cityscapes 1.0",
    filename="cityscapes_dataset.zip"
)
Learn more: Cityscapes dataset

MOT Formats

MOT (Multiple Object Tracking) formats for tracking tasks.

MOT 1.1

Features:
  • Track annotations over time
  • CSV-based format
  • Frame number, track ID, bounding box
  • Compatible with MOT challenge
Format:
# frame, id, bb_left, bb_top, bb_width, bb_height, conf, class, visibility
1, 1, 100, 50, 30, 40, 1, 1, 1
2, 1, 105, 52, 30, 40, 1, 1, 1

MOTS PNG

Features:
  • Instance segmentation tracks
  • PNG masks for each frame
  • Pixel-level tracking annotations
Export example:
cvat-cli task export-dataset \
  --format "MOT 1.1" \
  --output mot_tracks.zip \
  456

LabelMe

LabelMe 3.0 format for general-purpose annotation. Features:
  • JSON annotations per image
  • Polygons, rectangles, and points
  • Flexible attribute system
  • Web-based annotation tool compatible
When to use:
  • General object detection and segmentation
  • Research projects
  • Legacy LabelMe tool compatibility
Export example:
task.export_dataset(
    format_name="LabelMe 3.0",
    filename="labelme_annotations.zip"
)

ImageNet

ImageNet format for image classification. Features:
  • Directory-based class organization
  • Image-level labels only
  • Standard classification dataset structure
Structure:
dataset/
├── train/
│   ├── class1/
│   │   ├── img1.jpg
│   │   └── img2.jpg
│   └── class2/
│       └── img3.jpg
└── val/
    └── ...
When to use:
  • Image classification tasks
  • Transfer learning
  • Training classification networks

CamVid

CamVid format for video segmentation. Features:
  • Semantic segmentation for video
  • 11 or 32 predefined classes
  • Per-frame segmentation masks
When to use:
  • Video segmentation tasks
  • Autonomous driving research
  • Sequential scene understanding

WIDER Face

WIDER Face format for face detection. Features:
  • Face bounding boxes
  • Occlusion and pose attributes
  • Specialized for face detection benchmarks
When to use:
  • Face detection model training
  • Benchmarking face detectors
  • Large-scale face recognition

VGGFace2

VGGFace2 format for face recognition. Features:
  • Face identity labels
  • Bounding boxes and landmarks
  • Identity-based organization
When to use:
  • Face recognition training
  • Face verification tasks
  • Identity classification

Market-1501

Market-1501 format for person re-identification. Features:
  • Person bounding boxes
  • Identity labels across cameras
  • Track IDs for same person
When to use:
  • Person re-identification
  • Multi-camera tracking
  • Pedestrian recognition

ICDAR

ICDAR13/15 formats for text detection. Features:
  • Text region bounding boxes or polygons
  • Transcription labels
  • Oriented text support
When to use:
  • Scene text detection
  • OCR training data
  • Text recognition tasks

Open Images

Open Images V6 format for large-scale detection. Features:
  • Hierarchical label taxonomy
  • Image-level and object-level labels
  • Relationship annotations
  • Attributes and groups
When to use:
  • Large-scale detection tasks
  • Multi-label classification
  • Complex object relationships

KITTI Formats

KITTI formats for autonomous driving.

KITTI Detection

Features:
  • 3D bounding boxes
  • Object detection in driving scenes
  • Occlusion and truncation flags

KITTI Raw

Features:
  • Raw sensor data format
  • Calibration information
  • Multi-modal data (camera, LiDAR)
When to use:
  • Autonomous driving research
  • 3D object detection
  • Sensor fusion tasks

LFW

Labeled Faces in the Wild format. Features:
  • Face verification pairs
  • Identity labels
  • Standard face recognition benchmark
When to use:
  • Face verification
  • Face recognition benchmarking

Supervisely Point Cloud

Supervisely Point Cloud Format for 3D annotation. Features:
  • 3D bounding boxes
  • Point cloud annotations
  • 3D object detection
When to use:
  • LiDAR data annotation
  • 3D object detection
  • Autonomous driving 3D perception

Ultralytics YOLO

See YOLO Formats above for detailed information on Ultralytics YOLO variants.

Choosing the Right Format

Consider these factors when selecting a format:

By Task Type

  • Object Detection: COCO, YOLO, Pascal VOC, Open Images
  • Instance Segmentation: COCO, YOLO Segmentation, MOTS
  • Semantic Segmentation: Cityscapes, CamVid, Pascal VOC Segmentation
  • Classification: ImageNet, YOLO Classification
  • Pose/Keypoints: COCO Keypoints, YOLO Pose
  • Tracking: MOT, MOTS PNG
  • 3D Detection: KITTI, Supervisely Point Cloud
  • Face Tasks: WIDER Face, VGGFace2, LFW
  • Text Detection: ICDAR

By Framework

  • PyTorch: COCO, YOLO, ImageNet
  • TensorFlow: COCO, Pascal VOC
  • Darknet: YOLO 1.1
  • Ultralytics: Ultralytics YOLO variants
  • MMDetection: COCO
  • Detectron2: COCO

By Complexity

  • Simple: ImageNet, YOLO
  • Moderate: Pascal VOC, LabelMe
  • Complex: COCO, Open Images, Datumaro

Format Limitations

Some formats have specific limitations:
  • YOLO: Only supports bounding boxes or polygons (depending on variant)
  • ImageNet: Only supports image-level classification
  • Pascal VOC: Limited attribute support compared to CVAT
  • COCO: Polygons only (no ellipses without conversion)
  • MOT: Primarily for tracking, limited object attributes
For format-specific conversions and workarounds, see Format Conversion.

Next Steps

Build docs developers (and LLMs) love