Skip to main content
MCD (Mel-Cepstral Distortion) and F0 metrics measure the acoustic similarity between two audio sequences, commonly used for evaluating voice conversion and speech synthesis systems.

Function Signature

mcd_f0(
    pred_x,
    gt_x,
    fs,
    f0min,
    f0max,
    mcep_shift=5,
    mcep_fftl=1024,
    mcep_dim=39,
    mcep_alpha=0.466,
    seq_mismatch_tolerance=0.1,
    power_threshold=-20,
    dtw=False
)

Parameters

pred_x
numpy.ndarray
required
Predicted/generated audio signal (1D array). Multi-channel audio will use first channel only
gt_x
numpy.ndarray
required
Ground truth/reference audio signal (1D array)
fs
int
required
Sampling rate in Hz
f0min
float
required
Minimum F0 in Hz for extraction (e.g., 80 for male, 100 for female)
f0max
float
required
Maximum F0 in Hz for extraction (e.g., 400 for male, 600 for female)
mcep_shift
int
default:"5"
Frame shift in milliseconds for mel-cepstral analysis
mcep_fftl
int
default:"1024"
FFT length for spectral analysis
mcep_dim
int
default:"39"
Dimension of mel-cepstral coefficients
mcep_alpha
float
default:"0.466"
All-pass constant for mel-cepstral analysis. Common values:
  • 0.466 for 16 kHz
  • 0.410 for 22.05 kHz
  • 0.544 for 48 kHz
seq_mismatch_tolerance
float
default:"0.1"
Maximum allowed sequence length mismatch ratio when dtw=False (0.1 = 10%)
power_threshold
float
default:"-20"
Power threshold in dB for Voice Activity Detection (VAD) when using DTW
dtw
bool
default:"false"
Whether to use Dynamic Time Warping for alignment:
  • True: Apply DTW alignment (handles timing differences)
  • False: Direct frame-by-frame comparison (requires similar lengths)

Returns

mcd
float
Mel-Cepstral Distortion in dB (lower is better)
  • Measures spectral envelope similarity
  • Typical range: 4-8 dB for good systems
f0rmse
float
Root Mean Square Error of F0 in Hz (lower is better)
  • Measures pitch accuracy
  • Returns NaN if no voiced frames found
f0corr
float
Pearson correlation coefficient of F0 (-1 to 1, higher is better)
  • Measures pitch contour similarity
  • Returns NaN if no voiced frames found

Usage Examples

Basic Usage (No DTW)

import numpy as np
from versa import mcd_f0

# Load audio signals
reference = np.random.random(16000)  # Replace with actual reference
generated = np.random.random(16000)  # Replace with actual generated audio
fs = 16000

# Calculate MCD and F0 metrics
results = mcd_f0(
    pred_x=generated,
    gt_x=reference,
    fs=fs,
    f0min=80,   # Male voice range
    f0max=400,
    dtw=False
)

print(f"MCD: {results['mcd']:.2f} dB")
print(f"F0 RMSE: {results['f0rmse']:.2f} Hz")
print(f"F0 Correlation: {results['f0corr']:.3f}")

With Dynamic Time Warping

import numpy as np
from versa import mcd_f0

# For sequences with timing differences
reference = np.random.random(24000)
generated = np.random.random(20000)  # Different length OK with DTW
fs = 16000

results = mcd_f0(
    pred_x=generated,
    gt_x=reference,
    fs=fs,
    f0min=100,  # Female voice range
    f0max=600,
    dtw=True,   # Enable DTW alignment
    power_threshold=-20
)

print(f"MCD (DTW): {results['mcd']:.2f} dB")
print(f"F0 RMSE: {results['f0rmse']:.2f} Hz")
print(f"F0 Correlation: {results['f0corr']:.3f}")

Custom MCEP Configuration

import numpy as np
from versa import mcd_f0

reference = np.random.random(22050)
generated = np.random.random(22050)
fs = 22050

# Custom configuration for 22.05 kHz
results = mcd_f0(
    pred_x=generated,
    gt_x=reference,
    fs=fs,
    f0min=80,
    f0max=500,
    mcep_shift=5,
    mcep_fftl=2048,      # Larger FFT for higher SR
    mcep_dim=39,
    mcep_alpha=0.410,    # Alpha for 22.05 kHz
    dtw=False
)

print(f"MCD: {results['mcd']:.2f} dB")

Technical Details

Feature Extraction Pipeline

  1. Preprocessing
    • Scale audio to int16 range
    • Apply low-cut filter (70 Hz cutoff)
    • Handle multi-channel by using first channel
  2. WORLD Vocoder Analysis
    • F0 extraction using HARVEST algorithm
    • Spectral envelope via CheapTrick
    • Aperiodicity estimation with D4C
    • Convert spectral envelope to mel-cepstrum
  3. MCD Calculation
    • Compute frame-wise Euclidean distance
    • Apply conversion factor: 10/ln(10) * sqrt(2 * sum(diff^2))
    • Average across all frames
  4. F0 Metrics
    • Extract only voiced frames (F0 > 0)
    • Calculate RMSE and correlation

DTW vs Non-DTW

Without DTW (dtw=False):
  • Direct frame-by-frame comparison
  • Requires similar sequence lengths (within tolerance)
  • Faster computation
  • Best for well-aligned signals
With DTW (dtw=True):
  • Applies Voice Activity Detection
  • Aligns sequences using Dynamic Time Warping
  • Handles timing variations
  • Power-based alignment for MCD
  • F0-based alignment for F0 metrics

Parameter Guidelines

F0 Range by Voice Type

Voice Typef0min (Hz)f0max (Hz)
Male80400
Female100600
Child150800
Mixed80600

Alpha Values by Sampling Rate

Sampling Ratemcep_alpha
16 kHz0.466
22.05 kHz0.410
24 kHz0.395
44.1 kHz0.510
48 kHz0.544

FFT Length Recommendations

Sampling Ratemcep_fftl
16 kHz1024
22.05 kHz2048
48 kHz4096

Interpretation

MCD Values

  • < 4.5 dB: Excellent similarity (professional quality)
  • 4.5 - 6.5 dB: Good similarity (acceptable quality)
  • 6.5 - 8.5 dB: Moderate similarity (noticeable differences)
  • > 8.5 dB: Poor similarity (significant degradation)

F0 Correlation

  • > 0.9: Excellent pitch contour matching
  • 0.7 - 0.9: Good pitch tracking
  • 0.5 - 0.7: Moderate correlation
  • < 0.5: Poor pitch matching

Dependencies

pip install pyworld pysptk scipy fastdtw librosa

Use Cases

  • Voice conversion quality evaluation
  • Speech synthesis assessment
  • Singing voice synthesis validation
  • Speaker adaptation evaluation
  • Prosody transfer measurement

Error Handling

No Voiced Frames: If no F0 > 0 is detected, f0rmse and f0corr return NaN. This may occur with:
  • Unconverged model training
  • Silent or whispered speech
  • Incorrect f0min/f0max range
Sequence Length Mismatch: When dtw=False, sequences must be within seq_mismatch_tolerance ratio. Use dtw=True for variable-length sequences.
Multi-channel Audio: Automatically uses first channel with a warning if multi-channel detected.

Build docs developers (and LLMs) love