The tokenizer module provides tools for encoding and decoding text using Whisper’s tiktoken-based tokenization system. It supports both multilingual and English-only models with language-specific tokens and special control tokens.
Tokenizer Class
A thin wrapper around tiktoken providing quick access to special tokens and language-specific encoding.
Initialization
from whisper.tokenizer import Tokenizer
import tiktoken
# Typically created via get_tokenizer() function
tokenizer = Tokenizer(
encoding=encoding,
num_languages=99,
language="en",
task="transcribe"
)
Parameters
encoding
tiktoken.Encoding
required
The underlying tiktoken encoding instance
Number of languages supported by this tokenizer (typically 99)
The language code (e.g., “en”, “fr”, “es”) for this tokenizer instance
The task type: either “transcribe” or “translate”
Start-of-transcript token sequence (automatically generated in post_init)
special_tokens
Dict[str, int]
default:"{}"
Dictionary mapping special token strings to their token IDs (automatically populated)
Methods
encode()
Encode text into a list of token IDs.
tokens = tokenizer.encode("Hello, world!")
print(tokens) # [15496, 11, 1917, 0]
Additional keyword arguments passed to the underlying tiktoken encoding
List of token IDs representing the encoded text
decode()
Decode token IDs back into text, filtering out timestamp tokens.
text = tokenizer.decode([15496, 11, 1917, 0])
print(text) # "Hello, world!"
List of token IDs to decode
Additional keyword arguments passed to the underlying tiktoken decoder
The decoded text with timestamp tokens filtered out (tokens >= timestamp_begin are removed)
decode_with_timestamps()
Decode token IDs including timestamp annotations.
text = tokenizer.decode_with_timestamps(token_ids)
# Output: "Hello <|1.08|> world <|2.34|>"
List of token IDs to decode
Additional keyword arguments passed to the underlying tiktoken decoder
The decoded text with timestamp tokens annotated as <|1.08|> format
to_language_token()
Convert a language code to its corresponding token ID.
token_id = tokenizer.to_language_token("fr")
print(token_id) # Token ID for French
Language code (e.g., “en”, “fr”, “es”)
Token ID corresponding to the language
Raises: KeyError if the language is not found in the tokenizer.
split_to_word_tokens()
Split tokens into word-level tokens based on language-specific rules.
words, word_tokens = tokenizer.split_to_word_tokens(token_ids)
List of token IDs to split
return
Tuple[List[str], List[List[int]]]
A tuple containing:
- List of decoded words
- List of token ID lists corresponding to each word
Note: For languages without spaces (Chinese, Japanese, Thai, Lao, Burmese, Cantonese), uses Unicode-based splitting. For other languages, uses space-based splitting.
Special Token Properties
All special token properties are cached for performance.
eot
End-of-transcript token ID
sot
Start-of-transcript token ID for <|startoftranscript|>
transcribe
token_id = tokenizer.transcribe
Transcribe task token ID for <|transcribe|>
translate
token_id = tokenizer.translate
Translate task token ID for <|translate|>
sot_lm
token_id = tokenizer.sot_lm
Start-of-language-model token ID for <|startoflm|>
sot_prev
token_id = tokenizer.sot_prev
Start-of-previous token ID for <|startofprev|>
no_speech
token_id = tokenizer.no_speech
No-speech token ID for <|nospeech|>
no_timestamps
token_id = tokenizer.no_timestamps
No-timestamps token ID for <|notimestamps|>
timestamp_begin
token_id = tokenizer.timestamp_begin
Token ID for the first timestamp token <|0.00|>
language_token
token_id = tokenizer.language_token
Token ID for the language configured in this tokenizer instance
Raises: ValueError if no language is configured.
all_language_tokens
tokens = tokenizer.all_language_tokens
Tuple of all language token IDs supported by this tokenizer
all_language_codes
codes = tokenizer.all_language_codes
Tuple of all language codes (e.g., “en”, “fr”, “es”) supported by this tokenizer
sot_sequence_including_notimestamps
sequence = tokenizer.sot_sequence_including_notimestamps
sot_sequence_including_notimestamps
The start-of-transcript sequence with the no-timestamps token appended
non_speech_tokens
tokens = tokenizer.non_speech_tokens
Tuple of token IDs for non-speech annotations (e.g., speaker tags, music symbols) that should be suppressed during generation
Includes tokens for:
- Music notation:
♪♪♪
- Speaker tags:
[DAVID]
- Stage directions:
(SPEAKING FOREIGN LANGUAGE)
- Various symbols and brackets
get_tokenizer()
Factory function to create a tokenizer instance for Whisper models.
from whisper.tokenizer import get_tokenizer
# For multilingual model
tokenizer = get_tokenizer(
multilingual=True,
language="en",
task="transcribe"
)
# For English-only model
tokenizer = get_tokenizer(
multilingual=False
)
Parameters
Whether to use the multilingual tokenizer (True) or English-only tokenizer (False)
Number of languages to support (only relevant for multilingual models)
Language code or name (e.g., “en”, “english”, “fr”, “french”). If multilingual=True and language is None, defaults to “en”
Task type: “transcribe” or “translate”. If multilingual=True and task is None, defaults to “transcribe”
Configured tokenizer instance
Raises: ValueError if an unsupported language is provided.
Language Code Resolution
The function accepts both language codes and full language names:
# These are equivalent
tokenizer1 = get_tokenizer(multilingual=True, language="en")
tokenizer2 = get_tokenizer(multilingual=True, language="english")
# Language aliases are also supported
tokenizer3 = get_tokenizer(multilingual=True, language="mandarin") # -> "zh"
tokenizer4 = get_tokenizer(multilingual=True, language="castilian") # -> "es"
Language Constants
LANGUAGES
Dictionary mapping language codes to language names.
from whisper.tokenizer import LANGUAGES
print(LANGUAGES["en"]) # "english"
print(LANGUAGES["fr"]) # "french"
print(len(LANGUAGES)) # 99
Dictionary with 99 language code to name mappings
Supported languages include:
- Western European: English, French, German, Spanish, Italian, Portuguese, Dutch, etc.
- Eastern European: Russian, Polish, Czech, Ukrainian, Romanian, etc.
- Asian: Chinese, Japanese, Korean, Hindi, Thai, Vietnamese, etc.
- Middle Eastern: Arabic, Hebrew, Persian, Turkish, Urdu, etc.
- African: Swahili, Afrikaans, Amharic, Hausa, Yoruba, etc.
- Others: Latin, Sanskrit, Hawaiian, Maori, etc.
TO_LANGUAGE_CODE
Dictionary for looking up language codes by name, including aliases.
from whisper.tokenizer import TO_LANGUAGE_CODE
print(TO_LANGUAGE_CODE["english"]) # "en"
print(TO_LANGUAGE_CODE["mandarin"]) # "zh"
print(TO_LANGUAGE_CODE["burmese"]) # "my"
print(TO_LANGUAGE_CODE["castilian"]) # "es"
Dictionary mapping language names and aliases to their codes
Includes aliases such as:
- “burmese” → “my”
- “mandarin” → “zh”
- “castilian” → “es”
- “flemish” → “nl”
- “haitian” → “ht”
- “moldavian”/“moldovan” → “ro”
- “sinhalese” → “si”
Examples
Basic Transcription Setup
from whisper.tokenizer import get_tokenizer
# Create tokenizer for English transcription
tokenizer = get_tokenizer(
multilingual=True,
language="en",
task="transcribe"
)
# Encode text
tokens = tokenizer.encode("Hello, how are you?")
print(f"Tokens: {tokens}")
# Get start-of-transcript sequence
print(f"SOT sequence: {tokenizer.sot_sequence}")
print(f"With no_timestamps: {tokenizer.sot_sequence_including_notimestamps}")
Translation Task
from whisper.tokenizer import get_tokenizer
# Create tokenizer for French to English translation
tokenizer = get_tokenizer(
multilingual=True,
language="fr",
task="translate"
)
print(f"Language token: {tokenizer.language_token}")
print(f"Task token: {tokenizer.translate}")
print(f"SOT sequence: {tokenizer.sot_sequence}")
Working with Special Tokens
from whisper.tokenizer import get_tokenizer
tokenizer = get_tokenizer(multilingual=True)
# Get all special tokens
print(f"All special tokens: {tokenizer.special_tokens}")
# Access specific special tokens
print(f"EOT: {tokenizer.eot}")
print(f"No speech: {tokenizer.no_speech}")
print(f"Timestamp begin: {tokenizer.timestamp_begin}")
# Get non-speech tokens for suppression
suppress_tokens = tokenizer.non_speech_tokens
print(f"Tokens to suppress: {suppress_tokens[:10]}...") # First 10
Decoding with Timestamps
from whisper.tokenizer import get_tokenizer
tokenizer = get_tokenizer(multilingual=True)
# Token IDs including timestamp tokens
token_ids = [50364, 2425, 50414, 1917, 50464]
# Decode without timestamps (filters them out)
text = tokenizer.decode(token_ids)
print(f"Without timestamps: {text}")
# Decode with timestamp annotations
text_with_ts = tokenizer.decode_with_timestamps(token_ids)
print(f"With timestamps: {text_with_ts}")
Multi-Language Support
from whisper.tokenizer import get_tokenizer, LANGUAGES, TO_LANGUAGE_CODE
# List all supported languages
print(f"Total languages: {len(LANGUAGES)}")
for code, name in list(LANGUAGES.items())[:5]:
print(f" {code}: {name}")
# Get tokenizer for different languages
for lang_code in ["en", "fr", "es", "ja", "zh"]:
tokenizer = get_tokenizer(multilingual=True, language=lang_code)
lang_name = LANGUAGES[lang_code]
print(f"{lang_name}: token ID = {tokenizer.language_token}")
# Use language aliases
tokenizer = get_tokenizer(multilingual=True, language="mandarin")
print(f"Mandarin tokenizer language: {tokenizer.language}")
Word-Level Tokenization
from whisper.tokenizer import get_tokenizer
# English (space-separated)
tokenizer_en = get_tokenizer(multilingual=True, language="en")
token_ids = tokenizer_en.encode("Hello world")
words, word_tokens = tokenizer_en.split_to_word_tokens(token_ids)
print(f"English words: {words}")
print(f"Word tokens: {word_tokens}")
# Chinese (no spaces)
tokenizer_zh = get_tokenizer(multilingual=True, language="zh")
token_ids = tokenizer_zh.encode("你好世界")
words, word_tokens = tokenizer_zh.split_to_word_tokens(token_ids)
print(f"Chinese words: {words}")
print(f"Word tokens: {word_tokens}")