Tokenization is the process of breaking down text into tokens that models can process. Understanding token counts is important for managing context windows and API costs.
Count tokens
Count the number of tokens in your content:
response = client.models.count_tokens(
model='gemini-2.5-flash',
contents='why is the sky blue?',
)
print(response)
The response includes:
- total_tokens - Total number of tokens in the content
- cached_tokens - Number of tokens from cached content (if applicable)
Compute tokens
Compute tokens is only supported in Vertex AI.
The compute_tokens method provides more detailed token information:
response = client.models.compute_tokens(
model='gemini-2.5-flash',
contents='why is the sky blue?',
)
print(response)
This returns additional details about the tokenization, including the actual token IDs.
Async token counting
Use async methods for non-blocking token counting:
response = await client.aio.models.count_tokens(
model='gemini-2.5-flash',
contents='why is the sky blue?',
)
print(response)
Local tokenizer
For offline token counting without making API calls, use the local tokenizer:
import genai
tokenizer = genai.LocalTokenizer(model_name='gemini-2.5-flash')
result = tokenizer.count_tokens("What is your name?")
print(result)
Local compute tokens
Compute detailed token information locally:
import genai
tokenizer = genai.LocalTokenizer(model_name='gemini-2.5-flash')
result = tokenizer.compute_tokens("What is your name?")
print(result)
The local tokenizer:
- Works offline without API calls
- Provides faster token counting
- Useful for preprocessing and validation
- Returns the same counts as the API
Token counting for different content types
Count tokens for various content types:
from google.genai import types
# Text content
response = client.models.count_tokens(
model='gemini-2.5-flash',
contents='Hello, world!',
)
# Multimodal content
response = client.models.count_tokens(
model='gemini-2.5-flash',
contents=[
types.Part.from_text('Describe this image'),
types.Part.from_uri('gs://bucket/image.jpg', 'image/jpeg'),
],
)
# Chat messages
response = client.models.count_tokens(
model='gemini-2.5-flash',
contents=[
types.Content(role='user', parts=[types.Part.from_text('Hello')]),
types.Content(role='model', parts=[types.Part.from_text('Hi there!')]),
],
)
Managing context windows
Use token counting to manage model context limits:
MAX_TOKENS = 32000 # Example context window size
# Count tokens before sending
token_count = client.models.count_tokens(
model='gemini-2.5-flash',
contents=long_text,
)
if token_count.total_tokens > MAX_TOKENS:
print(f"Content exceeds limit: {token_count.total_tokens} tokens")
# Truncate or split content
else:
response = client.models.generate_content(
model='gemini-2.5-flash',
contents=long_text,
)
Estimating costs
Use token counts to estimate API costs:
# Count input tokens
input_count = client.models.count_tokens(
model='gemini-2.5-flash',
contents=prompt,
)
print(f"Input tokens: {input_count.total_tokens}")
# Generate content
response = client.models.generate_content(
model='gemini-2.5-flash',
contents=prompt,
)
# Count output tokens
output_count = client.models.count_tokens(
model='gemini-2.5-flash',
contents=response.text,
)
print(f"Output tokens: {output_count.total_tokens}")
print(f"Total tokens: {input_count.total_tokens + output_count.total_tokens}")