Whisper
The main Whisper model class that performs speech recognition. Inherits fromtorch.nn.Module.
Constructor
Model dimensions configuration containing architecture parameters
Attributes
The model dimensions used to initialize the encoder and decoder
The audio encoder that processes mel spectrograms into features
The text decoder that generates text tokens from audio features
Sparse tensor indicating which attention heads to use for time alignment. By default, uses the last half of decoder layers.
Properties
The device (CPU/GPU) where the model parameters are stored
Returns
True if the model supports multiple languages (vocab size >= 51865)Number of languages supported by the model
Methods
embed_audio
Mel spectrogram with shape
(batch_size, n_mels, n_ctx)Encoded audio features with shape
(batch_size, n_audio_ctx, n_audio_state)logits
Text tokens with shape
(batch_size, seq_len)Encoded audio features from
embed_audio()Logits for next token with shape
(batch_size, seq_len, vocab_size)forward
Mel spectrogram input
Text tokens
Decoder output logits
set_alignment_heads
Base85-encoded, gzip-compressed boolean array specifying which attention heads to use
install_kv_cache_hooks
Existing cache dictionary to extend, or
None to create new cacheDictionary mapping key/value projection modules to cached tensors
List of PyTorch hook handles that can be used to remove the hooks
transcribe
decode
detect_language
ModelDimensions
Dataclass containing model architecture dimensions.Fields
Number of mel filterbank channels (typically 80)
Audio context length - number of frames in the encoder
Hidden dimension size of the audio encoder
Number of attention heads in the audio encoder
Number of layers in the audio encoder
Vocabulary size - determines if model is multilingual (>= 51865)
Text context length - maximum sequence length for the decoder
Hidden dimension size of the text decoder
Number of attention heads in the text decoder
Number of layers in the text decoder
Usage Examples
Loading a Model
Using Model Components
Creating Custom Model
Notes
- The
Whisperclass is typically loaded usingwhisper.load_model()rather than instantiated directly - Models are available in sizes: tiny, base, small, medium, large
- Multilingual models have vocab size >= 51865 and support 99 languages
- English-only models are more accurate for English but don’t support other languages
- Use
model.to(device)to move the model to GPU for faster inference - The encoder processes 30-second audio chunks at a time
- KV cache hooks are used internally by the decode function for efficiency