Skip to main content

TokenizerBase

The TokenizerBase class defines the tokenizer interface used by TensorRT-LLM. It extends the Hugging Face PreTrainedTokenizerBase protocol.

Overview

TensorRT-LLM uses tokenizers to convert text to token IDs (encoding) and token IDs back to text (decoding). The default implementation wraps Hugging Face transformers tokenizers.
from tensorrt_llm import LLM

llm = LLM(model="meta-llama/Llama-3.1-8B-Instruct")

# Access the tokenizer
tokenizer = llm.tokenizer

# Encode text
token_ids = tokenizer.encode("Hello, world!")
print(token_ids)  # [1, 9906, 11, 1917, 0]

# Decode token IDs
text = tokenizer.decode(token_ids)
print(text)  # "Hello, world!"

Loading Tokenizers

Automatic Loading

Tokenizers are automatically loaded when creating an LLM instance:
# Load from Hugging Face Hub
llm = LLM(model="meta-llama/Llama-3.1-8B-Instruct")

# Load from local directory
llm = LLM(model="/path/to/model")

# Custom tokenizer path
llm = LLM(
    model="/path/to/model",
    tokenizer="/path/to/tokenizer"
)

Skip Tokenizer Initialization

If you plan to work with token IDs directly:
llm = LLM(
    model="meta-llama/Llama-3.1-8B-Instruct",
    skip_tokenizer_init=True
)

# tokenizer will be None
assert llm.tokenizer is None

# Provide token IDs directly
token_ids = [1, 9906, 11, 1917, 0]  # "Hello, world!"
output = llm.generate(token_ids)

Using a Pre-loaded Tokenizer

from transformers import AutoTokenizer
from tensorrt_llm import LLM

# Load tokenizer separately
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")

# Pass to LLM
llm = LLM(
    model="meta-llama/Llama-3.1-8B-Instruct",
    tokenizer=tokenizer
)

Core Methods

encode()

Convert text to token IDs.
token_ids = tokenizer.encode(
    "Hello, world!",
    add_special_tokens=True
)
print(token_ids)  # [1, 9906, 11, 1917, 0]

Parameters

text
str
required
Input text to encode.
add_special_tokens
bool
default:"True"
Whether to add special tokens (BOS, EOS) during encoding.

Returns

token_ids
List[int]
List of token IDs.

decode()

Convert token IDs back to text.
text = tokenizer.decode(
    [1, 9906, 11, 1917, 0],
    skip_special_tokens=True
)
print(text)  # "Hello, world!"

Parameters

token_ids
List[int]
required
Token IDs to decode.
skip_special_tokens
bool
default:"False"
Whether to remove special tokens (BOS, EOS, PAD) from output.
spaces_between_special_tokens
bool
default:"True"
Whether to add spaces between special tokens in output.

Returns

text
str
Decoded text.

batch_encode_plus()

Encode multiple texts in a batch.
encoded = tokenizer.batch_encode_plus(
    ["Hello, world!", "How are you?"],
    padding=True,
    return_tensors="pt"
)
print(encoded["input_ids"])

Parameters

texts
List[str]
required
List of texts to encode.

Returns

encoded
dict
Dictionary containing:
  • input_ids: Token IDs
  • attention_mask: Attention mask
  • Other tokenizer-specific outputs

decode_incrementally()

Incremental decoding for streaming generation. This method is optimized for streaming scenarios where tokens are generated one at a time.
prev_text = ""
states = None

for new_token_ids in streaming_tokens:
    text, states = tokenizer.decode_incrementally(
        new_token_ids,
        prev_text=prev_text,
        states=states,
        skip_special_tokens=True
    )
    print(text[len(prev_text):], end="", flush=True)  # Print only new text
    prev_text = text

Parameters

token_ids
List[int]
required
Incremental token IDs to decode.
prev_text
str
default:"None"
Previously decoded text. None for first iteration.
states
dict
default:"None"
Internal decoding state from previous iteration. None for first iteration.
flush
bool
default:"False"
Force flush pending tokens to output.
skip_special_tokens
bool
default:"False"
Whether to skip special tokens in output.
spaces_between_special_tokens
bool
default:"True"
Whether to add spaces between special tokens.
stream_interval
int
default:"1"
Iteration interval for streaming updates.

Returns

result
Tuple[str, dict]
Tuple of:
  • text: Current decoded text
  • states: Updated decoding state (pass to next iteration)

Properties

eos_token_id
int
End-of-sequence token ID.
print(tokenizer.eos_token_id)  # 2
pad_token_id
int
Padding token ID.
print(tokenizer.pad_token_id)  # 0
name_or_path
str
Model name or path the tokenizer was loaded from.
print(tokenizer.name_or_path)  # "meta-llama/Llama-3.1-8B-Instruct"
is_fast
bool
Whether this is a fast (Rust-based) tokenizer.
print(tokenizer.is_fast)  # True

Chat Templates

apply_chat_template()

Format a conversation using the model’s chat template.
conversation = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is machine learning?"},
    {"role": "assistant", "content": "Machine learning is..."},
    {"role": "user", "content": "Can you explain more?"},
]

prompt = tokenizer.apply_chat_template(
    conversation,
    tokenize=False,
    add_generation_prompt=True
)

output = llm.generate(prompt)

Parameters

conversation
List[Dict[str, str]]
required
List of message dictionaries with "role" and "content" keys.
tokenize
bool
default:"True"
If True, return token IDs. If False, return formatted string.
add_generation_prompt
bool
default:"False"
Add prompt for the next assistant message.

Returns

result
str | List[int]
Formatted prompt as string or token IDs (depending on tokenize parameter).

Custom Tokenizers

You can implement custom tokenizers by inheriting from TokenizerBase:
from tensorrt_llm.llmapi.tokenizer import TokenizerBase
from typing import List

class CustomTokenizer(TokenizerBase):
    def __init__(self, vocab_file: str):
        # Load your custom vocabulary
        self.vocab = self._load_vocab(vocab_file)
        self._eos_token_id = 2
        self._pad_token_id = 0
    
    @property
    def eos_token_id(self) -> int:
        return self._eos_token_id
    
    @property
    def pad_token_id(self) -> int:
        return self._pad_token_id
    
    def encode(self, text: str, **kwargs) -> List[int]:
        # Your encoding logic
        return [self.vocab.get(word, 0) for word in text.split()]
    
    def decode(self, token_ids: List[int], **kwargs) -> str:
        # Your decoding logic
        return " ".join([self.vocab_inv.get(tid, "<unk>") for tid in token_ids])

# Use custom tokenizer
custom_tokenizer = CustomTokenizer("vocab.txt")
llm = LLM(
    model="/path/to/model",
    tokenizer=custom_tokenizer
)

TransformersTokenizer

The default tokenizer implementation that wraps Hugging Face transformers tokenizers:
from tensorrt_llm.llmapi.tokenizer import TransformersTokenizer
from transformers import AutoTokenizer

# Create from HF tokenizer
hf_tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")
tokenizer = TransformersTokenizer(hf_tokenizer)

# Or use from_pretrained class method
tokenizer = TransformersTokenizer.from_pretrained(
    "meta-llama/Llama-3.1-8B-Instruct",
    trust_remote_code=True
)

Utility Functions

load_hf_tokenizer()

Load a Hugging Face tokenizer directly:
from tensorrt_llm.llmapi.tokenizer import load_hf_tokenizer

tokenizer = load_hf_tokenizer(
    model_path="meta-llama/Llama-3.1-8B-Instruct",
    trust_remote_code=True,
    use_fast=True
)

Environment Variables

TLLM_INCREMENTAL_DETOKENIZATION_BACKEND
str
default:"'HF'"
Backend for incremental detokenization:
  • 'HF': Use Hugging Face tokenizers backend (faster for small stream intervals)
  • 'TRTLLM': Use TensorRT-LLM backend
TLLM_STREAM_INTERVAL_THRESHOLD
int
default:"24"
Threshold for switching between HF and TRTLLM incremental detokenization backends.

Usage Examples

Basic Encoding and Decoding

from tensorrt_llm import LLM

llm = LLM(model="meta-llama/Llama-3.1-8B-Instruct")
tokenizer = llm.tokenizer

# Encode
text = "The quick brown fox jumps over the lazy dog"
token_ids = tokenizer.encode(text)
print(f"Token IDs: {token_ids}")
print(f"Number of tokens: {len(token_ids)}")

# Decode
decoded = tokenizer.decode(token_ids, skip_special_tokens=True)
print(f"Decoded: {decoded}")

Streaming with Incremental Decoding

from tensorrt_llm import LLM
from tensorrt_llm.sampling_params import SamplingParams

llm = LLM(model="meta-llama/Llama-3.1-8B-Instruct")
tokenizer = llm.tokenizer

# Generate with streaming
future = llm.generate_async(
    "Write a poem about AI",
    sampling_params=SamplingParams(max_tokens=200),
    streaming=True
)

for partial_output in future:
    # text_diff uses incremental decoding internally
    new_text = partial_output.outputs[0].text_diff
    print(new_text, end="", flush=True)

Chat Template Formatting

from tensorrt_llm import LLM

llm = LLM(model="meta-llama/Llama-3.1-8B-Instruct")

messages = [
    {"role": "system", "content": "You are a helpful AI assistant."},
    {"role": "user", "content": "What's the capital of France?"},
]

# Format with chat template
prompt = llm.tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

print("Formatted prompt:")
print(prompt)

# Generate response
output = llm.generate(prompt)
print("\nResponse:")
print(output.outputs[0].text)

Batch Encoding

texts = [
    "Hello, world!",
    "How are you today?",
    "Machine learning is fascinating."
]

encoded = tokenizer.batch_encode_plus(
    texts,
    padding=True,
    truncation=True,
    max_length=512,
    return_tensors="pt"
)

print(f"Input IDs shape: {encoded['input_ids'].shape}")
print(f"Attention mask shape: {encoded['attention_mask'].shape}")

See Also

Build docs developers (and LLMs) love