Function Signature
Parameters
The input audio. Can be:
- A string path to an audio file (will be loaded using
load_audio) - A NumPy array containing the audio waveform at 16 kHz
- A PyTorch Tensor containing the audio waveform at 16 kHz
The number of Mel-frequency filters to use. Only
80 and 128 are supported. The default 80 matches Whisper’s standard configuration.Number of zero samples to pad to the right of the audio waveform.
If specified, the audio tensor is moved to this device before computing the STFT. Use
"cuda" for GPU acceleration or "cpu" for CPU processing.Returns
A Tensor containing the log-Mel spectrogram. Values are normalized to approximately the range [0, 1].
Example
Processing Pipeline
The function performs the following steps:-
Input Conversion: If the input is a file path, it loads the audio using
load_audio(). NumPy arrays are converted to PyTorch tensors. - Device Transfer: If a device is specified, the audio tensor is moved to that device.
-
Padding: If
padding > 0, zero samples are added to the right. -
STFT Computation: Applies Short-Time Fourier Transform with:
- Window: Hann window of size
N_FFT(400) - Hop length:
HOP_LENGTH(160 samples) - Returns complex-valued spectrogram
- Window: Hann window of size
- Magnitude Calculation: Computes squared magnitude of the STFT (power spectrum), excluding the last frequency bin.
- Mel Filtering: Projects the power spectrum onto Mel scale using pre-computed filterbanks.
-
Log Scaling and Normalization:
- Clamps minimum values to 1e-10 to avoid log(0)
- Converts to log10 scale
- Applies dynamic range compression (maximum 80 dB range)
- Normalizes:
(log_spec + 4.0) / 4.0
Audio Constants Used
Notes
- The function uses pre-computed Mel filterbanks stored in
mel_filters.npzto avoid dependency on librosa. - The STFT uses a Hann window for smooth frequency resolution.
- The dynamic range is limited to 80 dB by clamping:
torch.maximum(log_spec, log_spec.max() - 8.0) - The final normalization
(log_spec + 4.0) / 4.0centers the values around a suitable range for the neural network.