MCD & F0

MCD (Mel-Cepstral Distortion) and F0 metrics measure the acoustic similarity between two audio sequences, commonly used for evaluating voice conversion and speech synthesis systems.

Function Signature

mcd_f0(
    pred_x,
    gt_x,
    fs,
    f0min,
    f0max,
    mcep_shift=5,
    mcep_fftl=1024,
    mcep_dim=39,
    mcep_alpha=0.466,
    seq_mismatch_tolerance=0.1,
    power_threshold=-20,
    dtw=False
)

Parameters

pred_x

numpy.ndarray

required

Predicted/generated audio signal (1D array). Multi-channel audio will use first channel only

gt_x

numpy.ndarray

required

Ground truth/reference audio signal (1D array)

int

required

Sampling rate in Hz

f0min

float

required

Minimum F0 in Hz for extraction (e.g., 80 for male, 100 for female)

f0max

float

required

Maximum F0 in Hz for extraction (e.g., 400 for male, 600 for female)

mcep_shift

int

default:"5"

Frame shift in milliseconds for mel-cepstral analysis

mcep_fftl

int

default:"1024"

FFT length for spectral analysis

mcep_dim

int

default:"39"

Dimension of mel-cepstral coefficients

mcep_alpha

float

default:"0.466"

All-pass constant for mel-cepstral analysis. Common values:

0.466 for 16 kHz
0.410 for 22.05 kHz
0.544 for 48 kHz

seq_mismatch_tolerance

float

default:"0.1"

Maximum allowed sequence length mismatch ratio when dtw=False (0.1 = 10%)

power_threshold

float

default:"-20"

Power threshold in dB for Voice Activity Detection (VAD) when using DTW

dtw

bool

default:"false"

Whether to use Dynamic Time Warping for alignment:

True: Apply DTW alignment (handles timing differences)
False: Direct frame-by-frame comparison (requires similar lengths)

Returns

mcd

float

Mel-Cepstral Distortion in dB (lower is better)

Measures spectral envelope similarity
Typical range: 4-8 dB for good systems

f0rmse

float

Root Mean Square Error of F0 in Hz (lower is better)

Measures pitch accuracy
Returns NaN if no voiced frames found

f0corr

float

Pearson correlation coefficient of F0 (-1 to 1, higher is better)

Measures pitch contour similarity
Returns NaN if no voiced frames found

Usage Examples

Basic Usage (No DTW)

import numpy as np
from versa import mcd_f0

# Load audio signals
reference = np.random.random(16000)  # Replace with actual reference
generated = np.random.random(16000)  # Replace with actual generated audio
fs = 16000

# Calculate MCD and F0 metrics
results = mcd_f0(
    pred_x=generated,
    gt_x=reference,
    fs=fs,
    f0min=80,   # Male voice range
    f0max=400,
    dtw=False
)

print(f"MCD: {results['mcd']:.2f} dB")
print(f"F0 RMSE: {results['f0rmse']:.2f} Hz")
print(f"F0 Correlation: {results['f0corr']:.3f}")

With Dynamic Time Warping

import numpy as np
from versa import mcd_f0

# For sequences with timing differences
reference = np.random.random(24000)
generated = np.random.random(20000)  # Different length OK with DTW
fs = 16000

results = mcd_f0(
    pred_x=generated,
    gt_x=reference,
    fs=fs,
    f0min=100,  # Female voice range
    f0max=600,
    dtw=True,   # Enable DTW alignment
    power_threshold=-20
)

print(f"MCD (DTW): {results['mcd']:.2f} dB")
print(f"F0 RMSE: {results['f0rmse']:.2f} Hz")
print(f"F0 Correlation: {results['f0corr']:.3f}")

Custom MCEP Configuration

import numpy as np
from versa import mcd_f0

reference = np.random.random(22050)
generated = np.random.random(22050)
fs = 22050

# Custom configuration for 22.05 kHz
results = mcd_f0(
    pred_x=generated,
    gt_x=reference,
    fs=fs,
    f0min=80,
    f0max=500,
    mcep_shift=5,
    mcep_fftl=2048,      # Larger FFT for higher SR
    mcep_dim=39,
    mcep_alpha=0.410,    # Alpha for 22.05 kHz
    dtw=False
)

print(f"MCD: {results['mcd']:.2f} dB")

Technical Details

Feature Extraction Pipeline

Preprocessing
- Scale audio to int16 range
- Apply low-cut filter (70 Hz cutoff)
- Handle multi-channel by using first channel
WORLD Vocoder Analysis
- F0 extraction using HARVEST algorithm
- Spectral envelope via CheapTrick
- Aperiodicity estimation with D4C
- Convert spectral envelope to mel-cepstrum
MCD Calculation
- Compute frame-wise Euclidean distance
- Apply conversion factor: 10/ln(10) * sqrt(2 * sum(diff^2))
- Average across all frames
F0 Metrics
- Extract only voiced frames (F0 > 0)
- Calculate RMSE and correlation

DTW vs Non-DTW

Without DTW (dtw=False):

Direct frame-by-frame comparison
Requires similar sequence lengths (within tolerance)
Faster computation
Best for well-aligned signals

With DTW (dtw=True):

Applies Voice Activity Detection
Aligns sequences using Dynamic Time Warping
Handles timing variations
Power-based alignment for MCD
F0-based alignment for F0 metrics

Parameter Guidelines

F0 Range by Voice Type

Voice Type	f0min (Hz)	f0max (Hz)
Male	80	400
Female	100	600
Child	150	800
Mixed	80	600

Alpha Values by Sampling Rate

Sampling Rate	mcep_alpha
16 kHz	0.466
22.05 kHz	0.410
24 kHz	0.395
44.1 kHz	0.510
48 kHz	0.544

FFT Length Recommendations

Sampling Rate	mcep_fftl
16 kHz	1024
22.05 kHz	2048
48 kHz	4096

Interpretation

MCD Values

< 4.5 dB: Excellent similarity (professional quality)
4.5 - 6.5 dB: Good similarity (acceptable quality)
6.5 - 8.5 dB: Moderate similarity (noticeable differences)
> 8.5 dB: Poor similarity (significant degradation)

F0 Correlation

> 0.9: Excellent pitch contour matching
0.7 - 0.9: Good pitch tracking
0.5 - 0.7: Moderate correlation
< 0.5: Poor pitch matching

Dependencies

pip install pyworld pysptk scipy fastdtw librosa

Use Cases

Voice conversion quality evaluation
Speech synthesis assessment
Singing voice synthesis validation
Speaker adaptation evaluation
Prosody transfer measurement

Error Handling

No Voiced Frames: If no F0 > 0 is detected, f0rmse and f0corr return NaN. This may occur with:

Unconverged model training
Silent or whispered speech
Incorrect f0min/f0max range

Sequence Length Mismatch: When dtw=False, sequences must be within seq_mismatch_tolerance ratio. Use dtw=True for variable-length sequences.

Multi-channel Audio: Automatically uses first channel with a warning if multi-channel detected.

Core API

Utterance Metrics

Sequence Metrics

Corpus Metrics

Function Signature

Parameters

Returns

Usage Examples

Basic Usage (No DTW)

With Dynamic Time Warping

Custom MCEP Configuration

Technical Details

Feature Extraction Pipeline

DTW vs Non-DTW

Parameter Guidelines

F0 Range by Voice Type

Alpha Values by Sampling Rate

FFT Length Recommendations

Interpretation

MCD Values

F0 Correlation

Dependencies

Use Cases

Error Handling

Build docs developers (and LLMs) love

Core API

Utterance Metrics

Sequence Metrics

Corpus Metrics

​Function Signature

​Parameters

​Returns

​Usage Examples

​Basic Usage (No DTW)

​With Dynamic Time Warping

​Custom MCEP Configuration

​Technical Details

​Feature Extraction Pipeline

​DTW vs Non-DTW

​Parameter Guidelines

​F0 Range by Voice Type

​Alpha Values by Sampling Rate

​FFT Length Recommendations

​Interpretation

​MCD Values

​F0 Correlation

​Dependencies

​Use Cases

​Error Handling

Build docs developers (and LLMs) love

Function Signature

Parameters

Returns

Usage Examples

Basic Usage (No DTW)

With Dynamic Time Warping

Custom MCEP Configuration

Technical Details

Feature Extraction Pipeline

DTW vs Non-DTW

Parameter Guidelines

F0 Range by Voice Type

Alpha Values by Sampling Rate

FFT Length Recommendations

Interpretation

MCD Values

F0 Correlation

Dependencies

Use Cases

Error Handling