OpenAIChatCompletionsTokenClient class extends OpenAIChatCompletionsClient to use vLLM’s custom /v1/chat/completions/tokens endpoint for token-level prompt stitching (TITO - token-in, token-out) instead of message-level inference (MITO - message-in, token-out).
Overview
This client optimizes multi-turn conversations by reusing tokenized prompts from previous turns rather than re-tokenizing the entire conversation history on each turn. It:- Detects message-level prefix matches in the conversation trajectory
- Reuses token IDs from previous turns when possible
- Handles chat template suffix tokens correctly
- Falls back to standard message-based inference for the first turn or when multimodal content is present
- Automatically manages token stitching across truncated turns
This client requires a vLLM server with the custom
/v1/chat/completions/tokens and /tokenize endpoints. It is designed for inference optimization and will fall back to standard behavior when necessary.Type Aliases
Inherits fromOpenAIChatCompletionsClient:
Class Definition
OpenAIChatCompletionsClient:
- ClientT:
AsyncOpenAI - MessagesT:
OpenAIChatMessages - ResponseT:
OpenAIChatResponse - ToolT:
OpenAITool
Constructor
Either a pre-configured
AsyncOpenAI client or a ClientConfig to create one. The base URL should point to a vLLM server.Example
Properties
token_client
AsyncOpenAI client with the base URL stripped of /v1 suffix for accessing vLLM’s /tokenize endpoint.
Returns: AsyncOpenAI instance configured for tokenization endpoints.
Methods
get_native_response
List of OpenAI message parameters.
Model identifier hosted on vLLM.
Sampling parameters.
max_tokens is automatically renamed to max_completion_tokens. logprobs is automatically set to True and return_token_ids=True is added to extra_body.Optional list of tools in OpenAI format.
Must include
state (type State) for accessing trajectory and managing cached tokens.ChatCompletion object.
Behavior:
- First turn (
len(state["trajectory"]) == 0): Uses standard/chat/completionsendpoint (MITO) - Multimodal content present: Falls back to standard endpoint because vLLM’s
/tokenizedoesn’t run multimodal processor - Subsequent text-only turns: Uses
/chat/completions/tokensendpoint (TITO) with token stitching - No prefix match found: Falls back to standard endpoint
get_prompt_ids
Current rollout state containing trajectory history.
Current prompt messages to convert to token IDs.
Tools in OpenAI format (affects tokenization).
None if no prefix match found.
Algorithm:
- Scans trajectory backwards to find the step whose messages form the longest prefix of
prompt_messages - Extracts token IDs from that step (prompt_ids + completion_ids)
- Computes and appends chat template suffix tokens (e.g., EOM tokens)
- Tokenizes the full prompt to derive environment response tokens
- Returns
prev_turn_ids + suffix_ids + env_response_ids
tokenize
/tokenize API.
Either a plain text string or a list of OpenAI message parameters.
Optional tools (affects tokenization of messages).
Model identifier for tokenization.
Additional parameters for tokenization (e.g.,
add_generation_prompt).Usage Example
Multi-Turn Example
State Keys
The client manages these keys instate:
Cached chat template suffix tokens computed once per rollout. Used to correctly handle message delimiter tokens across turns.
TITO vs MITO
Message-In Token-Out (MITO)
Standard behavior:- Sends full message history on each turn
- Server re-tokenizes everything
- Used for: first turn, multimodal content
Token-In Token-Out (TITO)
Optimized behavior:- Sends token IDs directly to skip re-tokenization
- Reuses cached tokens from previous turns
- Stitches new tokens for environment responses
- Used for: subsequent text-only turns
- Long conversation histories
- Models with complex chat templates
- High-throughput inference scenarios
Multimodal Content Handling
The client automatically detects multimodal content (images, audio) and falls back to MITO:/tokenize endpoint doesn’t run the multimodal processor, so image placeholders stay collapsed (1 token instead of N) and token-stitching produces broken prompts.
Chat Template Suffix Tokens
The client handles chat template suffix tokens (e.g., EOM tokens, newlines) correctly:- Computes suffix tokens once using dummy messages
- Caches them in
state["_cached_suffix_ids"] - For each turn, finds the largest overlap between previous turn tokens and suffix tokens
- Appends non-overlapping suffix tokens to handle truncated turns
Fallback Conditions
The client falls back to standard MITO when:- First turn:
len(state["trajectory"]) == 0 - Multimodal content: Current or any previous turn contains images/audio
- No prefix match:
get_prompt_ids()returnsNone
Error Handling
Inherits error handling fromOpenAIChatCompletionsClient:
- Context length errors →
OverlongPromptError - Empty responses →
EmptyModelResponseError - Invalid responses →
InvalidModelResponseError - Authentication errors → Re-raised from provider
See Also
- OpenAIChatCompletionsClient - Parent class with standard message-based inference
- Client - Base client interface
- State - State type with trajectory structure
- Response - Response type