Dataset Operations

Panlabel provides powerful commands for analyzing, comparing, and manipulating your annotation datasets.

Dataset Statistics

The stats command provides comprehensive dataset analysis:

panlabel stats --format coco dataset.json

Output:

Dataset Statistics

Overview:
  Images:       1,000
  Annotations:  4,523
  Categories:   5
  Avg. annotations/image: 4.52

Category Distribution:
  person        2,145 (47.4%) ████████████████████
  car          1,234 (27.3%) ███████████
  bicycle        678 (15.0%) ██████
  truck          321 (7.1%)  ███
  motorcycle     145 (3.2%)  █

Top Co-occurring Pairs:
  person + car           456 images
  person + bicycle       234 images
  car + truck            123 images

Bounding Box Statistics:
  Width:  mean=125.3px, median=98.2px, std=67.4px
  Height: mean=156.7px, median=132.1px, std=89.2px
  Area:   mean=19,637px², median=12,894px², std=15,234px²

Quality Checks:
  ✓ No out-of-bounds boxes
  ✓ No degenerate boxes (zero area)
  ⚠ 12 images with no annotations

Auto-Detection

Omit --format to auto-detect the format:

panlabel stats dataset.json

For JSON files where detection fails, stats falls back to reading as ir-json.

Customizing Output

Control the number of top items displayed:

panlabel stats --format coco --top 20 dataset.json

Adjust out-of-bounds tolerance (in pixels):

panlabel stats --format coco --tolerance 1.0 dataset.json

Output Formats

JSON Output

Get machine-readable statistics for automation:

panlabel stats --format coco --output json dataset.json

Output:

{
  "overview": {
    "images": 1000,
    "annotations": 4523,
    "categories": 5,
    "avg_annotations_per_image": 4.52
  },
  "categories": [
    {"name": "person", "count": 2145, "percentage": 47.4},
    {"name": "car", "count": 1234, "percentage": 27.3}
  ],
  "bbox_stats": {
    "width": {"mean": 125.3, "median": 98.2, "std": 67.4},
    "height": {"mean": 156.7, "median": 132.1, "std": 89.2}
  },
  "quality": {
    "out_of_bounds": 0,
    "degenerate_boxes": 0,
    "images_without_annotations": 12
  }
}

HTML Reports

Generate self-contained HTML reports for sharing:

panlabel stats --format coco --output html dataset.json > report.html

The HTML report includes:

Interactive charts and visualizations
Sortable tables
Embedded styling (no external dependencies)
Easy sharing with stakeholders

HTML reports are perfect for documentation, stakeholder presentations, and dataset release notes.

Dataset Comparison

The diff command semantically compares two datasets:

panlabel diff dataset_v1.json dataset_v2.json

Output:

Dataset Diff: dataset_v1.json vs dataset_v2.json

Summary:
  Images:       1000 → 1050 (+50)
  Annotations:  4523 → 4789 (+266)
  Categories:   5 → 6 (+1)

Image Changes:
  Added:    50 images
  Removed:  0 images
  Modified: 23 images (annotation changes)

Annotation Changes:
  Added:    289 annotations
  Removed:  23 annotations
  Modified: 45 annotations (category or bbox changes)

Category Changes:
  Added: "bus"

Format Auto-Detection

Diff automatically detects input formats:

panlabel diff ./yolo_dataset annotations.json

Or specify formats explicitly:

panlabel diff \
  --format-a yolo \
  --format-b coco \
  dataset_a dataset_b.json

Matching Strategies

Match by ID (Default)

Match annotations by their ID fields:

panlabel diff --match-by id dataset_a.json dataset_b.json

Best for:

Comparing versions of the same dataset
Tracking changes in annotation pipelines
When IDs are stable across versions

Match by IoU

Match annotations by spatial overlap (Intersection over Union):

panlabel diff --match-by iou --iou-threshold 0.5 dataset_a.json dataset_b.json

Best for:

Comparing datasets from different labeling tools
When IDs are not preserved
Evaluating inter-annotator agreement
Comparing predictions vs. ground truth

IoU threshold must be in (0.0, 1.0]. A threshold of 0.5 means annotations must overlap by at least 50% to be considered a match.

Detailed Diff Output

Include item-level details in the report:

panlabel diff --detail dataset_a.json dataset_b.json

Output:

Dataset Diff: dataset_a.json vs dataset_b.json

Summary:
  Images:       1000 → 1050 (+50)
  Annotations:  4523 → 4789 (+266)

Detailed Changes:

Added Images (showing first 20):
  • img_1001.jpg
  • img_1002.jpg
  • img_1003.jpg
  ...

Modified Images (showing first 20):
  • img_0042.jpg: 5 annotations → 6 annotations
  • img_0123.jpg: category distribution changed
  • img_0234.jpg: 2 annotations removed, 1 added
  ...

Modified Annotations (showing first 20):
  • ann_123: bbox changed from [100, 100, 50, 50] to [102, 98, 52, 51]
  • ann_456: category changed from "car" to "truck"
  • ann_789: bbox changed (IoU with original: 0.87)
  ...

JSON Output

For programmatic consumption:

panlabel diff --output json dataset_a.json dataset_b.json

Output:

{
  "summary": {
    "images": {"before": 1000, "after": 1050, "delta": 50},
    "annotations": {"before": 4523, "after": 4789, "delta": 266},
    "categories": {"before": 5, "after": 6, "delta": 1}
  },
  "changes": {
    "images_added": 50,
    "images_removed": 0,
    "images_modified": 23,
    "annotations_added": 289,
    "annotations_removed": 23,
    "annotations_modified": 45
  }
}

Dataset Sampling

The sample command creates subsets of your dataset:

panlabel sample \
  --input dataset.json \
  --output sample.json \
  --n 100

This creates a random sample of 100 images.

Sampling Strategies

Random Sampling (Default)

Uniform random sampling:

panlabel sample \
  --input dataset.json \
  --output sample.json \
  --n 100 \
  --strategy random

Stratified Sampling

Category-aware sampling that preserves class distribution:

panlabel sample \
  --input dataset.json \
  --output sample.json \
  --n 100 \
  --strategy stratified

Stratified sampling is better for:

Creating representative train/val/test splits
Maintaining class balance in small samples
Avoiding over-representation of common classes

Sample Size Options

Fixed Count

panlabel sample --input dataset.json --output sample.json --n 100

Fraction

Sample 20% of the dataset:

panlabel sample --input dataset.json --output sample.json --fraction 0.2

Deterministic Sampling

Use a random seed for reproducible samples:

panlabel sample \
  --input dataset.json \
  --output sample.json \
  --n 100 \
  --seed 42

Always use --seed for train/val/test splits to ensure reproducibility in experiments.

Category Filtering

Filter by Categories

Sample only images containing specific categories:

panlabel sample \
  --input dataset.json \
  --output sample.json \
  --n 100 \
  --categories "person,car,bicycle"

Image Mode (Default): Keep whole images that contain at least one selected category:

panlabel sample \
  --input dataset.json \
  --output sample.json \
  --categories "person" \
  --category-mode images

An image with both “person” and “car” annotations will keep all annotations. Annotation Mode: Keep only annotations of selected categories:

panlabel sample \
  --input dataset.json \
  --output sample.json \
  --categories "person" \
  --category-mode annotations

An image with “person” and “car” annotations will only keep “person” annotations, dropping “car”.

Format Conversion During Sampling

Combine sampling with format conversion:

panlabel sample \
  --from coco \
  --to yolo \
  --input dataset.json \
  --output ./yolo_sample \
  --n 100 \
  --allow-lossy

If --to is omitted, the output format matches the input format (or defaults to ir-json for auto-detected inputs).

Practical Use Cases

Dataset Quality Analysis

Combine commands to assess dataset quality:

# 1. Validate structure
panlabel validate --format coco dataset.json

# 2. Analyze statistics
panlabel stats --format coco dataset.json > stats.txt

# 3. Create QA sample
panlabel sample --input dataset.json --output qa_sample.json --n 50 --seed 42

# 4. Review the sample manually or with visualization tools

Create Train/Val/Test Splits

# Training set: 70%
panlabel sample \
  --input dataset.json \
  --output train.json \
  --fraction 0.7 \
  --seed 42 \
  --strategy stratified

# Validation set: 15%
panlabel sample \
  --input dataset.json \
  --output val.json \
  --fraction 0.15 \
  --seed 43 \
  --strategy stratified

# Test set: 15%
panlabel sample \
  --input dataset.json \
  --output test.json \
  --fraction 0.15 \
  --seed 44 \
  --strategy stratified

This approach may result in overlapping splits. For guaranteed non-overlapping splits, use a custom script that tracks which images have been assigned to each split.

Compare Annotation Versions

# Compare before and after cleanup
panlabel diff \
  --detail \
  dataset_raw.json \
  dataset_cleaned.json > changes.txt

# Get statistics on both
panlabel stats dataset_raw.json > stats_before.txt
panlabel stats dataset_cleaned.json > stats_after.txt

Evaluate Inter-Annotator Agreement

# Compare two annotators' work on the same images
panlabel diff \
  --match-by iou \
  --iou-threshold 0.5 \
  --detail \
  annotator_a.json \
  annotator_b.json

Low match rates indicate disagreement and may require:

Clearer annotation guidelines
Additional training for annotators
Review and adjudication process

Subset Creation for Development

# Create a small subset for fast iteration
panlabel sample \
  --input large_dataset.json \
  --output dev_subset.json \
  --n 50 \
  --seed 42

# Convert to multiple formats for testing
panlabel convert --from coco --to yolo \
  --input dev_subset.json \
  --output ./dev_yolo \
  --allow-lossy

panlabel convert --from coco --to voc \
  --input dev_subset.json \
  --output ./dev_voc \
  --allow-lossy

Category-Specific Analysis

# Extract only "person" annotations
panlabel sample \
  --input dataset.json \
  --output person_only.json \
  --categories "person" \
  --category-mode annotations

# Analyze the subset
panlabel stats person_only.json

convert - Convert between formats
validate - Validate dataset structure

Get Started

CLI Commands

Guides

Format Reference

Advanced

Dataset Operations

Dataset Statistics

Auto-Detection

Customizing Output

Output Formats

JSON Output

HTML Reports

Dataset Comparison

Format Auto-Detection

Matching Strategies

Match by ID (Default)

Match by IoU

Detailed Diff Output

JSON Output

Dataset Sampling

Sampling Strategies

Random Sampling (Default)

Stratified Sampling

Sample Size Options

Fixed Count

Fraction

Deterministic Sampling

Category Filtering

Filter by Categories

Category Filter Modes

Format Conversion During Sampling

Practical Use Cases

Dataset Quality Analysis

Create Train/Val/Test Splits

Compare Annotation Versions

Evaluate Inter-Annotator Agreement

Subset Creation for Development

Category-Specific Analysis

Build docs developers (and LLMs) love

Get Started

CLI Commands

Guides

Format Reference

Advanced

​Dataset Statistics

​Auto-Detection

​Customizing Output

​Output Formats

​JSON Output

​HTML Reports

​Dataset Comparison

​Format Auto-Detection

​Matching Strategies

​Match by ID (Default)

​Match by IoU

​Detailed Diff Output

​JSON Output

​Dataset Sampling

​Sampling Strategies

​Random Sampling (Default)

​Stratified Sampling

​Sample Size Options

​Fixed Count

​Fraction

​Deterministic Sampling

​Category Filtering

​Filter by Categories

​Category Filter Modes

​Format Conversion During Sampling

​Practical Use Cases

​Dataset Quality Analysis

​Create Train/Val/Test Splits

​Compare Annotation Versions

​Evaluate Inter-Annotator Agreement

​Subset Creation for Development

​Category-Specific Analysis

​Related Commands

Build docs developers (and LLMs) love

Dataset Statistics

Auto-Detection

Customizing Output

Output Formats

JSON Output

HTML Reports

Dataset Comparison

Format Auto-Detection

Matching Strategies

Match by ID (Default)

Match by IoU

Detailed Diff Output

JSON Output

Dataset Sampling

Sampling Strategies

Random Sampling (Default)

Stratified Sampling

Sample Size Options

Fixed Count

Fraction

Deterministic Sampling

Category Filtering

Filter by Categories

Category Filter Modes

Format Conversion During Sampling

Practical Use Cases

Dataset Quality Analysis

Create Train/Val/Test Splits

Compare Annotation Versions

Evaluate Inter-Annotator Agreement

Subset Creation for Development

Category-Specific Analysis

Related Commands