Function Signature
Parameters
Predicted/generated audio signal (1D array). Multi-channel audio will use first channel only
Ground truth/reference audio signal (1D array)
Sampling rate in Hz
Minimum F0 in Hz for extraction (e.g., 80 for male, 100 for female)
Maximum F0 in Hz for extraction (e.g., 400 for male, 600 for female)
Frame shift in milliseconds for mel-cepstral analysis
FFT length for spectral analysis
Dimension of mel-cepstral coefficients
All-pass constant for mel-cepstral analysis. Common values:
- 0.466 for 16 kHz
- 0.410 for 22.05 kHz
- 0.544 for 48 kHz
Maximum allowed sequence length mismatch ratio when dtw=False (0.1 = 10%)
Power threshold in dB for Voice Activity Detection (VAD) when using DTW
Whether to use Dynamic Time Warping for alignment:
True: Apply DTW alignment (handles timing differences)False: Direct frame-by-frame comparison (requires similar lengths)
Returns
Mel-Cepstral Distortion in dB (lower is better)
- Measures spectral envelope similarity
- Typical range: 4-8 dB for good systems
Root Mean Square Error of F0 in Hz (lower is better)
- Measures pitch accuracy
- Returns NaN if no voiced frames found
Pearson correlation coefficient of F0 (-1 to 1, higher is better)
- Measures pitch contour similarity
- Returns NaN if no voiced frames found
Usage Examples
Basic Usage (No DTW)
With Dynamic Time Warping
Custom MCEP Configuration
Technical Details
Feature Extraction Pipeline
-
Preprocessing
- Scale audio to int16 range
- Apply low-cut filter (70 Hz cutoff)
- Handle multi-channel by using first channel
-
WORLD Vocoder Analysis
- F0 extraction using HARVEST algorithm
- Spectral envelope via CheapTrick
- Aperiodicity estimation with D4C
- Convert spectral envelope to mel-cepstrum
-
MCD Calculation
- Compute frame-wise Euclidean distance
- Apply conversion factor:
10/ln(10) * sqrt(2 * sum(diff^2)) - Average across all frames
-
F0 Metrics
- Extract only voiced frames (F0 > 0)
- Calculate RMSE and correlation
DTW vs Non-DTW
Without DTW (dtw=False):- Direct frame-by-frame comparison
- Requires similar sequence lengths (within tolerance)
- Faster computation
- Best for well-aligned signals
- Applies Voice Activity Detection
- Aligns sequences using Dynamic Time Warping
- Handles timing variations
- Power-based alignment for MCD
- F0-based alignment for F0 metrics
Parameter Guidelines
F0 Range by Voice Type
| Voice Type | f0min (Hz) | f0max (Hz) |
|---|---|---|
| Male | 80 | 400 |
| Female | 100 | 600 |
| Child | 150 | 800 |
| Mixed | 80 | 600 |
Alpha Values by Sampling Rate
| Sampling Rate | mcep_alpha |
|---|---|
| 16 kHz | 0.466 |
| 22.05 kHz | 0.410 |
| 24 kHz | 0.395 |
| 44.1 kHz | 0.510 |
| 48 kHz | 0.544 |
FFT Length Recommendations
| Sampling Rate | mcep_fftl |
|---|---|
| 16 kHz | 1024 |
| 22.05 kHz | 2048 |
| 48 kHz | 4096 |
Interpretation
MCD Values
- < 4.5 dB: Excellent similarity (professional quality)
- 4.5 - 6.5 dB: Good similarity (acceptable quality)
- 6.5 - 8.5 dB: Moderate similarity (noticeable differences)
- > 8.5 dB: Poor similarity (significant degradation)
F0 Correlation
- > 0.9: Excellent pitch contour matching
- 0.7 - 0.9: Good pitch tracking
- 0.5 - 0.7: Moderate correlation
- < 0.5: Poor pitch matching
Dependencies
Use Cases
- Voice conversion quality evaluation
- Speech synthesis assessment
- Singing voice synthesis validation
- Speaker adaptation evaluation
- Prosody transfer measurement
Error Handling
Multi-channel Audio: Automatically uses first channel with a warning if multi-channel detected.