Most distributional metrics require manual installation from their respective code sources. These metrics typically require a reference dataset for comparison.
Overview
Distributional metrics compare the statistical properties of two audio datasets (e.g., generated vs. real audio). They are particularly valuable for:- Evaluating generative models (GANs, diffusion models, etc.)
- Detecting mode collapse in generation
- Assessing dataset diversity and coverage
- Comparing distributions across different audio corpora
These metrics are currently in verification status in VERSA. They require careful setup and typically need large datasets for meaningful evaluation.
Distance-Based Metrics
Metrics that measure the distance between embedding distributions.| Number | Auto-Install | Metric Name | Key in Config | Key in Report | Code Source | References |
|---|---|---|---|---|---|---|
| 1 | Frechet Audio Distance (FAD) | fad | fad | fadtk | paper | |
| 2 | Kullback-Leibler Divergence on Embedding Distribution | kl_embedding | kl_embedding | Stability-AI | ||
| 5 | KID: Kernel Distance Metric for Audio/Music Quality | - | - | KID | Paper |
Density & Coverage Metrics
Metrics that assess the density and coverage of generated audio distributions.| Number | Auto-Install | Metric Name | Key in Config | Key in Report | Code Source | References |
|---|---|---|---|---|---|---|
| 3 | Audio Density Score | audio_density_coverage | audio_density | Sony-audio-metrics | paper | |
| 4 | Audio Coverage Score | audio_density_coverage | audio_coverage | Sony-audio-metrics | paper |
Detailed Metric Descriptions
Frechet Audio Distance (FAD)
Frechet Audio Distance (FAD)
Frechet Audio Distance measures the distance between feature distributions of real and generated audio.
- Inspired by Frechet Inception Distance (FID) for images
- Uses pre-trained audio embeddings (typically from VGGish or similar models)
- Lower FAD scores indicate better similarity to real audio
- Widely used for evaluating audio generation models
pip install fadtkUse cases:- Evaluating text-to-audio models
- Comparing different generative model architectures
- Tracking training progress of generative models
KL Divergence on Embeddings
KL Divergence on Embeddings
Kullback-Leibler Divergence measures the difference between two probability distributions.
- Computes divergence between embedding distributions
- Asymmetric metric (KL(P||Q) ≠ KL(Q||P))
- Lower values indicate more similar distributions
- Sensitive to distribution outliers
- Fine-grained distribution comparison
- Detecting distribution shift
- Quality control for audio datasets
Audio Density Score
Audio Density Score
Audio Density measures how concentrated the generated samples are in the feature space.
- High density: Generated samples are similar to each other
- Low density: Generated samples are diverse
- Helps detect mode collapse
- Detecting mode collapse in GANs
- Evaluating diversity of generated audio
- Balancing quality vs. diversity trade-offs
Audio Coverage Score
Audio Coverage Score
Audio Coverage measures how well the generated distribution covers the real distribution.
- High coverage: Generated samples cover most modes in real data
- Low coverage: Generated samples miss important modes
- Complementary to density metrics
- Ensuring comprehensive dataset generation
- Detecting missing modes in generation
- Evaluating conditional generation models
Kernel Inception Distance (KID)
Kernel Inception Distance (KID)
KID is an alternative to FAD that uses kernel methods.
- More robust to outliers than FAD
- Unbiased estimator of distance between distributions
- Computationally efficient
- Alternative to FAD for smaller datasets
- When robustness to outliers is important
- Evaluating music generation quality
Metric Interpretation
FAD Scores
- < 1.0: Excellent - Generated audio very similar to real
- 1.0 - 5.0: Good - Noticeable but acceptable differences
- 5.0 - 15.0: Fair - Significant differences, room for improvement
- > 15.0: Poor - Large distribution gap
Density vs. Coverage Trade-off
Ideal generative models achieve both high density (quality) and high coverage (diversity). Monitoring both metrics together helps identify:
- High density, low coverage: Mode collapse - good quality but limited diversity
- Low density, high coverage: Poor quality but good diversity
- High density, high coverage: Ideal - good quality and diversity
- Low density, low coverage: Poor performance overall
Usage Recommendations
Evaluating Generative Models
Evaluating Generative Models
For comprehensive evaluation of audio generation models:
- Primary metric: Use FAD as the main benchmark
- Diversity check: Monitor density and coverage scores
- Robustness: Also compute KID for comparison
Dataset Requirements
Dataset Requirements
Distributional metrics require sufficient data:
- Minimum: 100+ samples per distribution
- Recommended: 1000+ samples for stable estimates
- Optimal: 10,000+ samples for fine-grained comparison
Feature Extractors
Feature Extractors
Choice of feature extractor significantly impacts results:
- VGGish: General audio classification features
- PANNs: Pre-trained audio neural networks
- CLAP: Joint audio-language embeddings
- Domain-specific: Custom embeddings for specialized tasks
Installation Guide
Since most distributional metrics are not auto-installed:Best Practices
- Use multiple metrics: Don’t rely on a single distributional metric
- Track over time: Monitor metrics throughout training to detect issues early
- Compare to baselines: Establish baseline scores with known good models
- Match preprocessing: Ensure reference and generated audio use identical preprocessing
- Large sample sizes: Use as many samples as feasible for stable estimates
Related Resources
- FAD Toolkit Documentation
- Sony Audio Metrics Paper
- Original FID Paper (image domain inspiration)