pad_or_trim

Function Signature

pad_or_trim(
    array: Union[np.ndarray, torch.Tensor],
    length: int = N_SAMPLES,
    *,
    axis: int = -1
) -> Union[np.ndarray, torch.Tensor]

Pads or trims an audio array to a fixed length, as expected by the Whisper encoder. This ensures all audio inputs have consistent dimensions for batch processing.

Parameters

array

Union[np.ndarray, torch.Tensor]

required

The input audio array to pad or trim. Can be either a NumPy array or PyTorch tensor.

length

int

default:"480000"

The target length for the array. Defaults to N_SAMPLES (480000), which represents 30 seconds of audio at 16 kHz sample rate.

axis

int

default:"-1"

The axis along which to pad or trim. Defaults to -1 (last axis). This is a keyword-only argument.

Returns

result

Union[np.ndarray, torch.Tensor]

The padded or trimmed array with shape matching the input type. If a NumPy array was provided, returns NumPy array. If a PyTorch tensor was provided, returns PyTorch tensor.

Example

import numpy as np
import torch
from whisper.audio import load_audio, pad_or_trim, N_SAMPLES

# Load audio (may be shorter or longer than 30 seconds)
audio = load_audio("speech.mp3")
print(audio.shape)  # (unknown_length,)

# Pad or trim to exactly 30 seconds (480000 samples)
audio = pad_or_trim(audio)
print(audio.shape)  # (480000,)

# Works with PyTorch tensors too
audio_tensor = torch.from_numpy(audio)
audio_tensor = pad_or_trim(audio_tensor)
print(audio_tensor.shape)  # torch.Size([480000])

# Custom length
audio_10s = pad_or_trim(audio, length=16000 * 10)  # 10 seconds
print(audio_10s.shape)  # (160000,)

# Works on multi-dimensional arrays (specify axis)
batch_audio = np.random.randn(4, 100000)  # 4 samples
batch_audio = pad_or_trim(batch_audio, length=N_SAMPLES, axis=1)
print(batch_audio.shape)  # (4, 480000)

Behavior

When Array is Too Long (Trimming)

If array.shape[axis] > length, the array is trimmed:

For PyTorch tensors: Uses index_select to select the first length elements along the specified axis.
For NumPy arrays: Uses take with indices=range(length) along the specified axis.

# Example: 60 seconds of audio trimmed to 30 seconds
audio_60s = np.random.randn(960000)  # 60 seconds
audio_30s = pad_or_trim(audio_60s)   # Keeps first 30 seconds
print(audio_30s.shape)  # (480000,)

When Array is Too Short (Padding)

If array.shape[axis] < length, the array is zero-padded on the right:

For PyTorch tensors: Uses F.pad to add zeros.
For NumPy arrays: Uses np.pad with zero padding.

# Example: 10 seconds of audio padded to 30 seconds
audio_10s = np.random.randn(160000)  # 10 seconds
audio_30s = pad_or_trim(audio_10s)   # Pads with 320000 zeros
print(audio_30s.shape)  # (480000,)

When Array is Exact Length

If array.shape[axis] == length, the array is returned as-is.

Audio Constants

SAMPLE_RATE = 16000      # 16 kHz
CHUNK_LENGTH = 30        # 30 seconds
N_SAMPLES = 480000       # CHUNK_LENGTH * SAMPLE_RATE
HOP_LENGTH = 160         # Samples between frames
N_FRAMES = 3000          # Frames in mel spectrogram (N_SAMPLES / HOP_LENGTH)

Multi-Dimensional Arrays

The function works on arrays of any dimensionality:

# 1D audio
audio_1d = np.random.randn(100000)
result = pad_or_trim(audio_1d, axis=-1)
print(result.shape)  # (480000,)

# 2D batch of audio
audio_2d = np.random.randn(8, 200000)  # 8 samples
result = pad_or_trim(audio_2d, axis=1)
print(result.shape)  # (8, 480000)

# 3D spectrogram-like
audio_3d = torch.randn(2, 80, 1000)  # batch, mels, frames
result = pad_or_trim(audio_3d, length=3000, axis=-1)
print(result.shape)  # torch.Size([2, 80, 3000])

Integration with Preprocessing Pipeline

import whisper

# Complete preprocessing pipeline
model = whisper.load_model("base")

# Step 1: Load audio
audio = whisper.load_audio("speech.mp3")

# Step 2: Pad or trim to 30 seconds
audio = whisper.pad_or_trim(audio)

# Step 3: Compute log-Mel spectrogram
mel = whisper.log_mel_spectrogram(audio).to(model.device)

# Step 4: Detect language or transcribe
_, probs = model.detect_language(mel)
result = model.decode(mel)

Performance Considerations

The function preserves the input type (NumPy or PyTorch) and device (for tensors).
For PyTorch tensors on GPU, padding/trimming operations remain on the same device.
Zero-padding is memory-efficient and does not require data copying for the original samples.

# GPU example
audio = torch.randn(100000).cuda()
audio_padded = pad_or_trim(audio)
print(audio_padded.device)  # cuda:0

Core Functions

Audio Processing

Model Classes

Utilities

Function Signature

Parameters

Returns

Example

Behavior

When Array is Too Long (Trimming)

When Array is Too Short (Padding)

When Array is Exact Length

Audio Constants

Multi-Dimensional Arrays

Integration with Preprocessing Pipeline

Performance Considerations

Build docs developers (and LLMs) love

Core Functions

Audio Processing

Model Classes

Utilities

​Function Signature

​Parameters

​Returns

​Example

​Behavior

​When Array is Too Long (Trimming)

​When Array is Too Short (Padding)

​When Array is Exact Length

​Audio Constants

​Multi-Dimensional Arrays

​Integration with Preprocessing Pipeline

​Performance Considerations

Build docs developers (and LLMs) love

Function Signature

Parameters

Returns

Example

Behavior

When Array is Too Long (Trimming)

When Array is Too Short (Padding)

When Array is Exact Length

Audio Constants

Multi-Dimensional Arrays

Integration with Preprocessing Pipeline

Performance Considerations