ESPnet WER
espnet_wer_setup
Initialize ESPnet ASR model for WER calculation.ESPnet model identifier. Default uses
espnet/simpleoier_librispeech_asr_train_asr_conformer7_wavlm_large_raw_en_bpe5000_spBeam size for beam search decoding
Text cleaning method to normalize transcripts
Whether to use GPU acceleration
Dictionary containing model, text cleaner, and beam size for WER calculation
espnet_levenshtein_metric
Calculate WER and CER between reference and predicted transcripts.Utility dictionary from
espnet_wer_setup() containing model and configurationAudio signal array with shape
(time,)Reference transcript text
Sampling rate in Hz (automatically resampled to 16kHz if different)
Dictionary containing:
espnet_hyp_text(str): Predicted transcriptref_text(str): Cleaned reference textespnet_wer_delete(int): Number of deletions (WER)espnet_wer_insert(int): Number of insertions (WER)espnet_wer_replace(int): Number of substitutions (WER)espnet_wer_equal(int): Number of correct words (WER)espnet_cer_delete(int): Number of deletions (CER)espnet_cer_insert(int): Number of insertions (CER)espnet_cer_replace(int): Number of substitutions (CER)espnet_cer_equal(int): Number of correct characters (CER)
OWSM WER
owsm_wer_setup
Initialize OWSM (Open Whisper-style Speech Model) for multilingual WER calculation.OWSM model identifier. Default uses
espnet/owsm_v3.1_ebfBeam size for beam search decoding
Text cleaning method to normalize transcripts
Whether to use GPU acceleration
Dictionary containing model, text cleaner, and beam size for WER calculation
owsm_levenshtein_metric
Calculate WER and CER using OWSM model with automatic language detection and long-form support.Utility dictionary from
owsm_wer_setup() containing model and configurationAudio signal array with shape
(time,). Supports long-form audio > 30 secondsReference transcript text
Sampling rate in Hz (automatically resampled to 16kHz if different)
Dictionary containing:
owsm_hyp_text(str): Predicted transcriptref_text(str): Cleaned reference textowsm_wer_delete(int): Number of deletions (WER)owsm_wer_insert(int): Number of insertions (WER)owsm_wer_replace(int): Number of substitutions (WER)owsm_wer_equal(int): Number of correct words (WER)owsm_cer_delete(int): Number of deletions (CER)owsm_cer_insert(int): Number of insertions (CER)owsm_cer_replace(int): Number of substitutions (CER)owsm_cer_equal(int): Number of correct characters (CER)
OWSM automatically detects the language from the first 30 seconds and supports long-form decoding for audio longer than 30 seconds.
Whisper WER
whisper_wer_setup
Initialize OpenAI Whisper model for WER calculation.Whisper model size. Default is
"large". Options: tiny, base, small, medium, largeBeam size for beam search decoding
Text cleaning method to normalize transcripts
Whether to use GPU acceleration
Dictionary containing model, text cleaner, and beam size for WER calculation
whisper_levenshtein_metric
Calculate WER and CER using Whisper model with optional transcript caching.Utility dictionary from
whisper_wer_setup() containing model and configurationAudio signal array with shape
(time,)Reference transcript text
Sampling rate in Hz (automatically resampled to 16kHz if different)
Pre-computed transcription to skip ASR inference. Useful when transcription is already available
Dictionary containing:
whisper_hyp_text(str): Predicted transcriptref_text(str): Cleaned reference textwhisper_wer_delete(int): Number of deletions (WER)whisper_wer_insert(int): Number of insertions (WER)whisper_wer_replace(int): Number of substitutions (WER)whisper_wer_equal(int): Number of correct words (WER)whisper_cer_delete(int): Number of deletions (CER)whisper_cer_insert(int): Number of insertions (CER)whisper_cer_replace(int): Number of substitutions (CER)whisper_cer_equal(int): Number of correct characters (CER)
Calculating WER Score
- ESPnet
- OWSM
- Whisper
Understanding Edit Operations
The Levenshtein distance metrics track four types of operations:- Delete: Words in reference but missing in hypothesis
- Insert: Extra words in hypothesis not in reference
- Replace: Words substituted between reference and hypothesis
- Equal: Correctly matched words
All models automatically resample audio to 16kHz if a different sampling rate is provided.