Skip to main content
The OpenAIChatCompletionsTokenClient class extends OpenAIChatCompletionsClient to use vLLM’s custom /v1/chat/completions/tokens endpoint for token-level prompt stitching (TITO - token-in, token-out) instead of message-level inference (MITO - message-in, token-out).

Overview

This client optimizes multi-turn conversations by reusing tokenized prompts from previous turns rather than re-tokenizing the entire conversation history on each turn. It:
  • Detects message-level prefix matches in the conversation trajectory
  • Reuses token IDs from previous turns when possible
  • Handles chat template suffix tokens correctly
  • Falls back to standard message-based inference for the first turn or when multimodal content is present
  • Automatically manages token stitching across truncated turns
This client requires a vLLM server with the custom /v1/chat/completions/tokens and /tokenize endpoints. It is designed for inference optimization and will fall back to standard behavior when necessary.

Type Aliases

Inherits from OpenAIChatCompletionsClient:
OpenAIChatMessage = ChatCompletionMessageParam
OpenAIChatMessages = list[OpenAIChatMessage]
OpenAIChatResponse = ChatCompletion
OpenAITool = ChatCompletionToolParam
Additional response type:
class TokenizeResponse(BaseModel):
    count: int
    max_model_len: int
    tokens: list[int]
    token_strs: Optional[list[str]] = None

Class Definition

class OpenAIChatCompletionsTokenClient(OpenAIChatCompletionsClient)
Inherits all generic type parameters from OpenAIChatCompletionsClient:
  • ClientT: AsyncOpenAI
  • MessagesT: OpenAIChatMessages
  • ResponseT: OpenAIChatResponse
  • ToolT: OpenAITool

Constructor

OpenAIChatCompletionsTokenClient(client_or_config: AsyncOpenAI | ClientConfig)
client_or_config
AsyncOpenAI | ClientConfig
required
Either a pre-configured AsyncOpenAI client or a ClientConfig to create one. The base URL should point to a vLLM server.

Example

from verifiers.clients.openai_chat_completions_token_client import OpenAIChatCompletionsTokenClient
from verifiers.types import ClientConfig

# Using ClientConfig with vLLM server
client = OpenAIChatCompletionsTokenClient(
    ClientConfig(
        api_key="EMPTY",  # vLLM typically doesn't require real API key
        base_url="http://localhost:8000/v1"
    )
)

# Using pre-configured AsyncOpenAI client
from openai import AsyncOpenAI
client = OpenAIChatCompletionsTokenClient(
    AsyncOpenAI(api_key="EMPTY", base_url="http://localhost:8000/v1")
)

Properties

token_client

@property
def token_client(self) -> AsyncOpenAI
Returns an AsyncOpenAI client with the base URL stripped of /v1 suffix for accessing vLLM’s /tokenize endpoint. Returns: AsyncOpenAI instance configured for tokenization endpoints.

Methods

get_native_response

@handle_openai_overlong_prompt
async def get_native_response(
    self,
    prompt: OpenAIChatMessages,
    model: str,
    sampling_args: SamplingArgs,
    tools: list[OpenAITool] | None = None,
    **kwargs,
) -> OpenAIChatResponse
Calls the vLLM Chat Completions API, using token-level inference when possible.
prompt
OpenAIChatMessages
required
List of OpenAI message parameters.
model
str
required
Model identifier hosted on vLLM.
sampling_args
SamplingArgs
required
Sampling parameters. max_tokens is automatically renamed to max_completion_tokens. logprobs is automatically set to True and return_token_ids=True is added to extra_body.
tools
list[OpenAITool] | None
default:"None"
Optional list of tools in OpenAI format.
kwargs
dict
Must include state (type State) for accessing trajectory and managing cached tokens.
Returns: OpenAI ChatCompletion object. Behavior:
  • First turn (len(state["trajectory"]) == 0): Uses standard /chat/completions endpoint (MITO)
  • Multimodal content present: Falls back to standard endpoint because vLLM’s /tokenize doesn’t run multimodal processor
  • Subsequent text-only turns: Uses /chat/completions/tokens endpoint (TITO) with token stitching
  • No prefix match found: Falls back to standard endpoint

get_prompt_ids

async def get_prompt_ids(
    self,
    state: State,
    prompt_messages: OpenAIChatMessages,
    oai_tools: list[OpenAITool] | None,
) -> list[int] | None
Builds prompt token IDs by finding the longest message-level prefix match in the trajectory and stitching with new tokens.
state
State
required
Current rollout state containing trajectory history.
prompt_messages
OpenAIChatMessages
required
Current prompt messages to convert to token IDs.
oai_tools
list[OpenAITool] | None
required
Tools in OpenAI format (affects tokenization).
Returns: List of token IDs representing the full prompt, or None if no prefix match found. Algorithm:
  1. Scans trajectory backwards to find the step whose messages form the longest prefix of prompt_messages
  2. Extracts token IDs from that step (prompt_ids + completion_ids)
  3. Computes and appends chat template suffix tokens (e.g., EOM tokens)
  4. Tokenizes the full prompt to derive environment response tokens
  5. Returns prev_turn_ids + suffix_ids + env_response_ids

tokenize

async def tokenize(
    self,
    messages: str | OpenAIChatMessages,
    tools: list[OpenAITool] | None,
    model: str,
    extra_kwargs: dict = {},
    **kwargs,
) -> list[int]
Tokenizes messages or text using the vLLM /tokenize API.
messages
str | OpenAIChatMessages
required
Either a plain text string or a list of OpenAI message parameters.
tools
list[OpenAITool] | None
required
Optional tools (affects tokenization of messages).
model
str
required
Model identifier for tokenization.
extra_kwargs
dict
default:"{}"
Additional parameters for tokenization (e.g., add_generation_prompt).
Returns: List of token IDs.

Usage Example

import asyncio
import verifiers as vf
from verifiers.clients.openai_chat_completions_token_client import OpenAIChatCompletionsTokenClient
from verifiers.types import ClientConfig, SamplingArgs

async def main():
    # Initialize client pointing to vLLM server
    client = OpenAIChatCompletionsTokenClient(
        ClientConfig(
            api_key="EMPTY",
            base_url="http://localhost:8000/v1"
        )
    )
    
    # Create a simple environment
    def load_environment():
        dataset = vf.Environment.make_dataset([
            {"question": "What is 2+2?"}
        ])
        
        def correctness(completion: vf.Messages, **kwargs) -> float:
            text = vf.content_to_text(completion[-1].content)
            return 1.0 if "4" in text else 0.0
        
        return vf.SingleTurnEnv(
            dataset=dataset,
            rubric=vf.Rubric(correctness),
        )
    
    env = load_environment()
    
    # Run rollout - token client will automatically use TITO on subsequent turns
    state = await env.rollout(
        input={"question": "What is 2+2?"},
        client=client,
        model="meta-llama/Llama-3.1-8B-Instruct",
        sampling_args=SamplingArgs(temperature=0.0, max_tokens=100)
    )
    
    # Check token information
    if state["trajectory"]:
        first_turn = state["trajectory"][0]
        if first_turn["tokens"]:
            print(f"Prompt tokens: {len(first_turn['tokens']['prompt_ids'])}")
            print(f"Completion tokens: {len(first_turn['tokens']['completion_ids'])}")
    
    await client.close()

asyncio.run(main())

Multi-Turn Example

import asyncio
import verifiers as vf
from verifiers.clients.openai_chat_completions_token_client import OpenAIChatCompletionsTokenClient
from verifiers.types import ClientConfig, SamplingArgs

async def main():
    client = OpenAIChatCompletionsTokenClient(
        ClientConfig(api_key="EMPTY", base_url="http://localhost:8000/v1")
    )
    
    # Multi-turn environment
    class CountingEnv(vf.MultiTurnEnv):
        def __init__(self):
            dataset = vf.Environment.make_dataset([{"start": 1}])
            super().__init__(
                dataset=dataset,
                rubric=vf.Rubric(lambda **kw: 1.0),
                max_turns=5
            )
        
        async def env_response(
            self, completion: vf.Messages, state: vf.State
        ) -> vf.Messages:
            turn = state["turn"]
            return [vf.UserMessage(content=f"What is {turn + 1} + 1?")]
        
        async def is_completed(self, state: vf.State) -> bool:
            return state["turn"] >= 5
    
    env = CountingEnv()
    
    state = await env.rollout(
        input={"start": 1},
        client=client,
        model="meta-llama/Llama-3.1-8B-Instruct",
        sampling_args=SamplingArgs(temperature=0.0, max_tokens=50)
    )
    
    # Turn 0: Uses MITO (first turn)
    # Turn 1-4: Uses TITO (reuses tokens from previous turns)
    
    print(f"Completed {len(state['trajectory'])} turns")
    for i, step in enumerate(state["trajectory"]):
        prompt_len = len(step["tokens"]["prompt_ids"]) if step["tokens"] else 0
        completion_len = len(step["tokens"]["completion_ids"]) if step["tokens"] else 0
        print(f"Turn {i}: prompt={prompt_len} tokens, completion={completion_len} tokens")
    
    await client.close()

asyncio.run(main())

State Keys

The client manages these keys in state:
_cached_suffix_ids
list[int]
Cached chat template suffix tokens computed once per rollout. Used to correctly handle message delimiter tokens across turns.

TITO vs MITO

Message-In Token-Out (MITO)

Standard behavior:
  • Sends full message history on each turn
  • Server re-tokenizes everything
  • Used for: first turn, multimodal content

Token-In Token-Out (TITO)

Optimized behavior:
  • Sends token IDs directly to skip re-tokenization
  • Reuses cached tokens from previous turns
  • Stitches new tokens for environment responses
  • Used for: subsequent text-only turns
Performance benefit: TITO eliminates redundant tokenization overhead in multi-turn conversations, especially valuable for:
  • Long conversation histories
  • Models with complex chat templates
  • High-throughput inference scenarios

Multimodal Content Handling

The client automatically detects multimodal content (images, audio) and falls back to MITO:
def _has_multimodal_content(messages) -> bool:
    """Check if any message contains multimodal content."""
    for msg in messages:
        content = msg.get("content") if hasattr(msg, "get") else None
        if isinstance(content, list):
            for part in content:
                if hasattr(part, "get") and part.get("type") in (
                    "image_url",
                    "input_audio",
                ):
                    return True
    return False
Reason: vLLM ≤0.16’s /tokenize endpoint doesn’t run the multimodal processor, so image placeholders stay collapsed (1 token instead of N) and token-stitching produces broken prompts.

Chat Template Suffix Tokens

The client handles chat template suffix tokens (e.g., EOM tokens, newlines) correctly:
  1. Computes suffix tokens once using dummy messages
  2. Caches them in state["_cached_suffix_ids"]
  3. For each turn, finds the largest overlap between previous turn tokens and suffix tokens
  4. Appends non-overlapping suffix tokens to handle truncated turns
This ensures that token stitching respects the chat template format even when turns are truncated mid-message.

Fallback Conditions

The client falls back to standard MITO when:
  1. First turn: len(state["trajectory"]) == 0
  2. Multimodal content: Current or any previous turn contains images/audio
  3. No prefix match: get_prompt_ids() returns None

Error Handling

Inherits error handling from OpenAIChatCompletionsClient:
  • Context length errors → OverlongPromptError
  • Empty responses → EmptyModelResponseError
  • Invalid responses → InvalidModelResponseError
  • Authentication errors → Re-raised from provider

See Also

Build docs developers (and LLMs) love