Text Normalizers

The normalizers module provides utilities for normalizing and standardizing text output from Whisper models. These tools help compare transcriptions by removing formatting differences, converting numbers, and standardizing spellings.

BasicTextNormalizer

Basic text normalizer that lowercases text, removes brackets/parentheses content, and optionally removes diacritics.

from whisper.normalizers import BasicTextNormalizer

normalizer = BasicTextNormalizer()
text = normalizer("HELLO [background noise] (speaker coughs) café")
print(text)  # "hello café"

Initialization

remove_diacritics

bool

default:"False"

If True, removes diacritical marks from characters (e.g., “café” → “cafe”)

split_letters

bool

default:"False"

If True, splits text into individual Unicode grapheme clusters (useful for character-level analysis)

Normalization Process

The BasicTextNormalizer performs the following operations:

Converts text to lowercase
Removes content within angle brackets or square brackets: <...>, [...]
Removes content within parentheses: (...)
Optionally removes diacritics and symbols (if remove_diacritics=True)
Optionally splits into individual letters/graphemes (if split_letters=True)
Collapses multiple whitespace characters into single spaces

Examples

Basic Usage

from whisper.normalizers import BasicTextNormalizer

normalizer = BasicTextNormalizer()

# Remove brackets and parentheses
text = normalizer("Hello [music] world (coughs)")
print(text)  # "hello  world"

# Lowercase conversion
text = normalizer("The QUICK Brown Fox")
print(text)  # "the quick brown fox"

# Preserve diacritics by default
text = normalizer("café résumé naïve")
print(text)  # "café résumé naïve"

With Diacritics Removal

from whisper.normalizers import BasicTextNormalizer

normalizer = BasicTextNormalizer(remove_diacritics=True)

# Remove diacritics
text = normalizer("café résumé naïve")
print(text)  # "cafe resume naive"

# Handle various diacritical marks
text = normalizer("Zürich Österreich Köln")
print(text)  # "zurich osterreich koln"

# Special character mappings
text = normalizer("Schloß Æsop œuvre ø")
print(text)  # "schloss aesop oeuvre o"

With Letter Splitting

from whisper.normalizers import BasicTextNormalizer

normalizer = BasicTextNormalizer(split_letters=True)

# Split into individual characters
text = normalizer("Hello")
print(text)  # "h e l l o"

# Works with Unicode grapheme clusters
text = normalizer("café")
print(text)  # "c a f é"

EnglishTextNormalizer

Comprehensive English text normalizer that expands contractions, standardizes numbers, and normalizes spellings.

from whisper.normalizers import EnglishTextNormalizer

normalizer = EnglishTextNormalizer()
text = normalizer("I can't believe it's twenty-one dollars!")
print(text)  # "i can not believe it is 21 dollars"

Normalization Process

The EnglishTextNormalizer performs extensive normalization:

Converts text to lowercase
Removes content in brackets/parentheses: <...>, [...], (...)
Removes filler words: “hmm”, “mm”, “mhm”, “uh”, “um”
Expands contractions: “can’t” → “can not”, “won’t” → “will not”
Expands titles and abbreviations: “Mr.” → “mister”, “Dr.” → “doctor”
Standardizes numbers: converts spelled-out numbers to digits
Normalizes spellings: British to American English
Removes symbols and punctuation (keeping numeric symbols during processing)
Collapses whitespace

Examples

Contraction Expansion

from whisper.normalizers import EnglishTextNormalizer

normalizer = EnglishTextNormalizer()

# Common contractions
text = normalizer("I can't believe you won't come")
print(text)  # "i can not believe you will not come"

# Informal contractions
text = normalizer("I'm gonna wanna get some")
print(text)  # "i am going to want to get some"

# Perfect tenses
text = normalizer("He'd been there, she's done it")
print(text)  # "he had been there she has done it"

Title and Abbreviation Expansion

from whisper.normalizers import EnglishTextNormalizer

normalizer = EnglishTextNormalizer()

# Titles
text = normalizer("Mr. Smith and Mrs. Jones")
print(text)  # "mister smith and missus jones"

# Professional titles
text = normalizer("Dr. Brown, Prof. White")
print(text)  # "doctor brown professor white"

# Military and government titles
text = normalizer("Gen. Lee, Sen. Adams, Rep. Carter")
print(text)  # "general lee senator adams representative carter"

Number Normalization

from whisper.normalizers import EnglishTextNormalizer

normalizer = EnglishTextNormalizer()

# Spelled-out numbers to digits
text = normalizer("twenty-one")
print(text)  # "21"

text = normalizer("one hundred and fifty-three")
print(text)  # "153"

text = normalizer("three million two hundred thousand")
print(text)  # "3200000"

# Currency
text = normalizer("twenty dollars and fifty cents")
print(text)  # "$20.50"

text = normalizer("five euros")
print(text)  # "5 euros"

# Ordinals
text = normalizer("the twenty-first century")
print(text)  # "the 21st century"

# Percentages
text = normalizer("fifty percent")
print(text)  # "50%"

Spelling Normalization

from whisper.normalizers import EnglishTextNormalizer

normalizer = EnglishTextNormalizer()

# British to American English
text = normalizer("colour favour honour")
print(text)  # "color favor honor"

text = normalizer("organise recognise")
print(text)  # "organize recognize"

text = normalizer("centre theatre")
print(text)  # "center theater"

Complete Example

from whisper.normalizers import EnglishTextNormalizer

normalizer = EnglishTextNormalizer()

# Complex sentence with multiple normalizations
text = normalizer(
    "Mr. Smith said he'd invested twenty-three thousand, "
    "five hundred dollars in the 1990s. [applause] "
    "That's approximately fifty-two percent of his income!"
)
print(text)
# "mister smith said he had invested 23500 dollars in the 1990s that is approximately 52% of his income"

EnglishNumberNormalizer

Specialized normalizer for converting spelled-out numbers to Arabic numerals.

from whisper.normalizers.english import EnglishNumberNormalizer

normalizer = EnglishNumberNormalizer()
text = normalizer("twenty-one")
print(text)  # "21"

Features

Converts spelled-out numbers to digits
Removes commas from numbers
Preserves suffixes: “1960s”, “21st”, “32nd”
Handles currency symbols: “$20 million” → “20000000 dollars”
Interprets successive single digits: “one oh one” → “101”
Supports special patterns: “double three” → “33”, “triple five” → “555”

Examples

Basic Number Conversion

from whisper.normalizers.english import EnglishNumberNormalizer

normalizer = EnglishNumberNormalizer()

# Simple numbers
print(normalizer("one"))          # "one" (literal)
print(normalizer("twenty"))       # "20"
print(normalizer("ninety-nine"))  # "99"

# Large numbers
print(normalizer("one thousand"))              # "1000"
print(normalizer("five million"))              # "5000000"
print(normalizer("two billion"))               # "2000000000"
print(normalizer("three point five million"))  # "3500000"

Ordinals and Cardinals

from whisper.normalizers.english import EnglishNumberNormalizer

normalizer = EnglishNumberNormalizer()

# Ordinal numbers
print(normalizer("first"))        # "1st"
print(normalizer("second"))       # "2nd"
print(normalizer("third"))        # "3rd"
print(normalizer("twenty-first")) # "21st"
print(normalizer("hundredth"))    # "100th"

# Plural forms
print(normalizer("twenties"))     # "20s"
print(normalizer("hundreds"))     # "100s"
print(normalizer("thousands"))    # "1000s"

Currency and Symbols

from whisper.normalizers.english import EnglishNumberNormalizer

normalizer = EnglishNumberNormalizer()

# Dollar amounts
print(normalizer("twenty dollars"))              # "$20"
print(normalizer("fifty cents"))                 # "¢50"
print(normalizer("twenty dollars and fifty cents"))  # "$20.50"

# Other currencies
print(normalizer("ten euros"))   # "€10"
print(normalizer("five pounds"))  # "£5"

# Percentages
print(normalizer("fifty percent"))      # "50%"
print(normalizer("ninety-nine percent")) # "99%"

Special Patterns

from whisper.normalizers.english import EnglishNumberNormalizer

normalizer = EnglishNumberNormalizer()

# Phone numbers and codes
print(normalizer("one oh one"))          # "101"
print(normalizer("nine one one"))        # "911"
print(normalizer("double zero seven"))   # "007"

# Repetition patterns
print(normalizer("double three"))   # "33"
print(normalizer("triple five"))    # "555"

# Decimal numbers
print(normalizer("three point one four"))  # "3.14"
print(normalizer("zero point five"))       # "0.5"

# Fractions as decimals
print(normalizer("five and a half"))       # "5.5"
print(normalizer("ten and a half"))        # "10.5"

Signs and Prefixes

from whisper.normalizers.english import EnglishNumberNormalizer

normalizer = EnglishNumberNormalizer()

# Positive/negative
print(normalizer("minus five"))      # "-5"
print(normalizer("negative ten"))    # "-10"
print(normalizer("plus three"))      # "+3"
print(normalizer("positive seven"))  # "+7"

EnglishSpellingNormalizer

Normalizes British English spellings to American English.

from whisper.normalizers.english import EnglishSpellingNormalizer

normalizer = EnglishSpellingNormalizer()
text = normalizer("colour favour honour")
print(text)  # "color favor honor"

Examples

from whisper.normalizers.english import EnglishSpellingNormalizer

normalizer = EnglishSpellingNormalizer()

# -our to -or
print(normalizer("colour favour honour"))  # "color favor honor"

# -ise to -ize
print(normalizer("organise realise"))      # "organize realize"

# -re to -er
print(normalizer("centre theatre"))        # "center theater"

# -ll to -l
print(normalizer("travelled cancelled"))   # "traveled canceled"

Utility Functions

remove_symbols_and_diacritics()

Removes symbols, punctuation, and diacritical marks from text.

from whisper.normalizers.basic import remove_symbols_and_diacritics

text = remove_symbols_and_diacritics("café, naïve!")
print(text)  # "cafe  naive "

text = remove_symbols_and_diacritics("$100.50", keep=".$")
print(text)  # "$100.50"

str

required

Input text to process

keep

str

default:"\"\""

String of characters to preserve (e.g., ”.$” to keep dollar signs and periods)

return

str

Text with symbols and diacritics removed (replaced with spaces), except for characters in keep

remove_symbols()

Removes symbols and punctuation while preserving diacritics.

from whisper.normalizers.basic import remove_symbols

text = remove_symbols("café, naïve!")
print(text)  # "café  naïve "

text = remove_symbols("$100 (fifty)")
print(text)  # " 100  fifty "

str

required

Input text to process

return

str

Text with symbols and punctuation removed (replaced with spaces), but diacritics preserved

Use Cases

Comparing Transcriptions

from whisper.normalizers import EnglishTextNormalizer

normalizer = EnglishTextNormalizer()

# Compare two transcriptions
transcript1 = "Mr. Smith said he'd pay twenty-one dollars"
transcript2 = "Mister Smith said he would pay $21"

norm1 = normalizer(transcript1)
norm2 = normalizer(transcript2)

print(norm1)  # "mister smith said he had pay 21 dollars"
print(norm2)  # "mister smith said he would pay 21 dollars"
print(norm1 == norm2)  # False (close, but "had" vs "would")

Calculating Word Error Rate (WER)

from whisper.normalizers import EnglishTextNormalizer

normalizer = EnglishTextNormalizer()

# Normalize reference and hypothesis before WER calculation
reference = "I can't believe it's the twenty-first century!"
hypothesis = "I cannot believe its the 21st century"

ref_norm = normalizer(reference)
hyp_norm = normalizer(hypothesis)

print("Reference:", ref_norm)
print("Hypothesis:", hyp_norm)

# Now calculate WER on normalized texts
# (WER calculation code here)

Multilingual Text Cleaning

from whisper.normalizers import BasicTextNormalizer

# For non-English languages, use BasicTextNormalizer
normalizer = BasicTextNormalizer(remove_diacritics=False)

# Preserve diacritics in French
french_text = normalizer("C'est un café très élégant [musique]")
print(french_text)  # "c'est un café très élégant"

# Remove diacritics for ASCII compatibility
normalizer_ascii = BasicTextNormalizer(remove_diacritics=True)
ascii_text = normalizer_ascii("C'est un café très élégant")
print(ascii_text)  # "c'est un cafe tres elegant"

Preprocessing for Search/Indexing

from whisper.normalizers import EnglishTextNormalizer, BasicTextNormalizer

# English content
english_normalizer = EnglishTextNormalizer()
query = english_normalizer("Dr. Smith's twenty-first presentation")
print(query)  # "doctor smiths 21st presentation"

# Non-English content
basic_normalizer = BasicTextNormalizer(remove_diacritics=True)
query = basic_normalizer("Zürich café résumé")
print(query)  # "zurich cafe resume"

Diacritics Mapping

The following special characters are mapped when remove_diacritics=True:

Character	Replacement
œ, Œ	oe, OE
ø, Ø	o, O
æ, Æ	ae, AE
ß, ẞ	ss, SS
đ, Đ	d, D
ð, Ð	d, D
þ, Þ	th, th
ł, Ł	l, L

Additionally, all Unicode characters with the “Mn” (Mark, nonspacing) category are removed.

Core Functions

Audio Processing

Model Classes

Utilities

Text Normalizers

BasicTextNormalizer

Initialization

Normalization Process

Examples

Basic Usage

With Diacritics Removal

With Letter Splitting

EnglishTextNormalizer

Normalization Process

Examples

Contraction Expansion

Title and Abbreviation Expansion

Number Normalization

Spelling Normalization

Complete Example

EnglishNumberNormalizer

Features

Examples

Basic Number Conversion

Ordinals and Cardinals

Currency and Symbols

Special Patterns

Signs and Prefixes

EnglishSpellingNormalizer

Examples

Utility Functions

remove_symbols_and_diacritics()

remove_symbols()

Use Cases

Comparing Transcriptions

Calculating Word Error Rate (WER)

Multilingual Text Cleaning

Preprocessing for Search/Indexing

Diacritics Mapping

Build docs developers (and LLMs) love

Core Functions

Audio Processing

Model Classes

Utilities

​BasicTextNormalizer

​Initialization

​Normalization Process

​Examples

​Basic Usage

​With Diacritics Removal

​With Letter Splitting

​EnglishTextNormalizer

​Normalization Process

​Examples

​Contraction Expansion

​Title and Abbreviation Expansion

​Number Normalization

​Spelling Normalization

​Complete Example

​EnglishNumberNormalizer

​Features

​Examples

​Basic Number Conversion

​Ordinals and Cardinals

​Currency and Symbols

​Special Patterns

​Signs and Prefixes

​EnglishSpellingNormalizer

​Examples

​Utility Functions

​remove_symbols_and_diacritics()

​remove_symbols()

​Use Cases

​Comparing Transcriptions

​Calculating Word Error Rate (WER)

​Multilingual Text Cleaning

​Preprocessing for Search/Indexing

​Diacritics Mapping

Build docs developers (and LLMs) love

BasicTextNormalizer

Initialization

Normalization Process

Examples

Basic Usage

With Diacritics Removal

With Letter Splitting

EnglishTextNormalizer

Normalization Process

Examples

Contraction Expansion

Title and Abbreviation Expansion

Number Normalization

Spelling Normalization

Complete Example

EnglishNumberNormalizer

Features

Examples

Basic Number Conversion

Ordinals and Cardinals

Currency and Symbols

Special Patterns

Signs and Prefixes

EnglishSpellingNormalizer

Examples

Utility Functions

remove_symbols_and_diacritics()

remove_symbols()

Use Cases

Comparing Transcriptions

Calculating Word Error Rate (WER)

Multilingual Text Cleaning

Preprocessing for Search/Indexing

Diacritics Mapping