Format Overview
The table below shows all supported formats with import/export capabilities:| Format | Import | Export | Use Case |
|---|---|---|---|
| CVAT for images | ✔️ | ✔️ | CVAT native format for image annotation |
| CVAT for video | ✔️ | ✔️ | CVAT native format for video annotation |
| Datumaro | ✔️ | ✔️ | Universal format with full feature support |
| PASCAL VOC | ✔️ | ✔️ | Object detection, classification |
| PASCAL VOC Segmentation | ✔️ | ✔️ | Semantic segmentation |
| YOLO | ✔️ | ✔️ | Darknet YOLO object detection |
| MS COCO Object Detection | ✔️ | ✔️ | Object detection with MS COCO |
| MS COCO Keypoints | ✔️ | ✔️ | Keypoint/pose estimation |
| Cityscapes | ✔️ | ✔️ | Urban scene segmentation |
| MOT | ✔️ | ✔️ | Multi-object tracking |
| MOTS PNG | ✔️ | ✔️ | Multi-object tracking with segmentation |
| LabelMe | ✔️ | ✔️ | General-purpose annotation |
| ImageNet | ✔️ | ✔️ | Image classification |
| CamVid | ✔️ | ✔️ | Semantic segmentation for autonomous driving |
| WIDER Face | ✔️ | ✔️ | Face detection |
| VGGFace2 | ✔️ | ✔️ | Face recognition |
| Market-1501 | ✔️ | ✔️ | Person re-identification |
| ICDAR13/15 | ✔️ | ✔️ | Text detection and recognition |
| Open Images V6 | ✔️ | ✔️ | Large-scale object detection |
| KITTI | ✔️ | ✔️ | Autonomous driving benchmarks |
| KITTI Raw | ✔️ | ✔️ | Raw KITTI sensor data |
| LFW | ✔️ | ✔️ | Face verification |
| Supervisely Point Cloud | ✔️ | ✔️ | 3D point cloud annotation |
| Ultralytics YOLO Detection | ✔️ | ✔️ | YOLOv8+ object detection |
| Ultralytics YOLO OBB | ✔️ | ✔️ | Oriented bounding boxes |
| Ultralytics YOLO Segmentation | ✔️ | ✔️ | Instance segmentation |
| Ultralytics YOLO Pose | ✔️ | ✔️ | Pose estimation |
| Ultralytics YOLO Classification | ✔️ | ✔️ | Image classification |
Format Details
CVAT Formats
CVAT for images 1.1 and CVAT for video 1.1 are CVAT’s native XML-based formats. Features:- Full support for all CVAT annotation types
- Preserves all metadata and attributes
- Tracks, shapes, tags, and skeleton annotations
- Best for backup and transfer between CVAT instances
- Backing up CVAT projects
- Migrating tasks between CVAT servers
- When you need complete annotation preservation
Datumaro
Datumaro is a universal dataset framework providing lossless conversion between formats. Features:- Supports all CVAT annotation types
- Python-based dataset manipulation
- Format conversion and validation
- Dataset versioning and comparison
- Complex dataset transformations
- Converting between incompatible formats
- Dataset analysis and statistics
- Building custom ML pipelines
PASCAL VOC
PASCAL VOC is a classic format for object detection and segmentation. Features:- XML annotation files
- Bounding boxes with classes
- Segmentation masks (separate format)
- Attributes stored in XML
- Object detection with classic frameworks
- Academic research and benchmarks
- Simple detection tasks
YOLO Formats
CVAT supports multiple YOLO format variants:YOLO 1.1 (Darknet)
Original YOLO format with text-based annotations. Features:- One .txt file per image
- Normalized bounding box coordinates
- Class ID, center_x, center_y, width, height
- Requires obj.names and obj.data files
Ultralytics YOLO
Modern YOLO format (YOLOv8+) with YAML configuration. Variants:- Detection: Bounding boxes for object detection
- Segmentation: Polygon annotations for instance segmentation
- OBB: Oriented bounding boxes for rotated objects
- Pose: Keypoint annotations for pose estimation
- Classification: Image-level labels
COCO Formats
MS COCO is an industry-standard format for complex annotations.COCO Object Detection
Features:- JSON-based annotations
- Bounding boxes with categories
- Image metadata (size, file name)
- Supports attributes as custom fields
- Crowd annotations for groups
- Training modern detection models (Faster R-CNN, YOLO, etc.)
- Complex multi-class detection tasks
- Integration with popular ML frameworks
COCO Keypoints
Features:- Keypoint annotations for pose estimation
- Skeleton definitions
- Visibility flags per keypoint
- Compatible with pose estimation models
Cityscapes
Cityscapes format for urban scene understanding. Features:- Semantic segmentation masks
- Instance segmentation annotations
- 19 standard classes for street scenes
- Polygons and pixel-level masks
- Autonomous driving applications
- Urban scene segmentation
- Street-level computer vision
MOT Formats
MOT (Multiple Object Tracking) formats for tracking tasks.MOT 1.1
Features:- Track annotations over time
- CSV-based format
- Frame number, track ID, bounding box
- Compatible with MOT challenge
MOTS PNG
Features:- Instance segmentation tracks
- PNG masks for each frame
- Pixel-level tracking annotations
LabelMe
LabelMe 3.0 format for general-purpose annotation. Features:- JSON annotations per image
- Polygons, rectangles, and points
- Flexible attribute system
- Web-based annotation tool compatible
- General object detection and segmentation
- Research projects
- Legacy LabelMe tool compatibility
ImageNet
ImageNet format for image classification. Features:- Directory-based class organization
- Image-level labels only
- Standard classification dataset structure
- Image classification tasks
- Transfer learning
- Training classification networks
CamVid
CamVid format for video segmentation. Features:- Semantic segmentation for video
- 11 or 32 predefined classes
- Per-frame segmentation masks
- Video segmentation tasks
- Autonomous driving research
- Sequential scene understanding
WIDER Face
WIDER Face format for face detection. Features:- Face bounding boxes
- Occlusion and pose attributes
- Specialized for face detection benchmarks
- Face detection model training
- Benchmarking face detectors
- Large-scale face recognition
VGGFace2
VGGFace2 format for face recognition. Features:- Face identity labels
- Bounding boxes and landmarks
- Identity-based organization
- Face recognition training
- Face verification tasks
- Identity classification
Market-1501
Market-1501 format for person re-identification. Features:- Person bounding boxes
- Identity labels across cameras
- Track IDs for same person
- Person re-identification
- Multi-camera tracking
- Pedestrian recognition
ICDAR
ICDAR13/15 formats for text detection. Features:- Text region bounding boxes or polygons
- Transcription labels
- Oriented text support
- Scene text detection
- OCR training data
- Text recognition tasks
Open Images
Open Images V6 format for large-scale detection. Features:- Hierarchical label taxonomy
- Image-level and object-level labels
- Relationship annotations
- Attributes and groups
- Large-scale detection tasks
- Multi-label classification
- Complex object relationships
KITTI Formats
KITTI formats for autonomous driving.KITTI Detection
Features:- 3D bounding boxes
- Object detection in driving scenes
- Occlusion and truncation flags
KITTI Raw
Features:- Raw sensor data format
- Calibration information
- Multi-modal data (camera, LiDAR)
- Autonomous driving research
- 3D object detection
- Sensor fusion tasks
LFW
Labeled Faces in the Wild format. Features:- Face verification pairs
- Identity labels
- Standard face recognition benchmark
- Face verification
- Face recognition benchmarking
Supervisely Point Cloud
Supervisely Point Cloud Format for 3D annotation. Features:- 3D bounding boxes
- Point cloud annotations
- 3D object detection
- LiDAR data annotation
- 3D object detection
- Autonomous driving 3D perception
Ultralytics YOLO
See YOLO Formats above for detailed information on Ultralytics YOLO variants.Choosing the Right Format
Consider these factors when selecting a format:By Task Type
- Object Detection: COCO, YOLO, Pascal VOC, Open Images
- Instance Segmentation: COCO, YOLO Segmentation, MOTS
- Semantic Segmentation: Cityscapes, CamVid, Pascal VOC Segmentation
- Classification: ImageNet, YOLO Classification
- Pose/Keypoints: COCO Keypoints, YOLO Pose
- Tracking: MOT, MOTS PNG
- 3D Detection: KITTI, Supervisely Point Cloud
- Face Tasks: WIDER Face, VGGFace2, LFW
- Text Detection: ICDAR
By Framework
- PyTorch: COCO, YOLO, ImageNet
- TensorFlow: COCO, Pascal VOC
- Darknet: YOLO 1.1
- Ultralytics: Ultralytics YOLO variants
- MMDetection: COCO
- Detectron2: COCO
By Complexity
- Simple: ImageNet, YOLO
- Moderate: Pascal VOC, LabelMe
- Complex: COCO, Open Images, Datumaro
Format Limitations
Some formats have specific limitations:- YOLO: Only supports bounding boxes or polygons (depending on variant)
- ImageNet: Only supports image-level classification
- Pascal VOC: Limited attribute support compared to CVAT
- COCO: Polygons only (no ellipses without conversion)
- MOT: Primarily for tracking, limited object attributes
Next Steps
- Import & Export - Learn how to use these formats
- Format Conversion - Convert between formats
- Datumaro Documentation - Advanced dataset operations