OpenAIChatCompletionsTokenClient

The OpenAIChatCompletionsTokenClient class extends OpenAIChatCompletionsClient to use vLLM’s custom /v1/chat/completions/tokens endpoint for token-level prompt stitching (TITO - token-in, token-out) instead of message-level inference (MITO - message-in, token-out).

Overview

This client optimizes multi-turn conversations by reusing tokenized prompts from previous turns rather than re-tokenizing the entire conversation history on each turn. It:

Detects message-level prefix matches in the conversation trajectory
Reuses token IDs from previous turns when possible
Handles chat template suffix tokens correctly
Falls back to standard message-based inference for the first turn or when multimodal content is present
Automatically manages token stitching across truncated turns

This client requires a vLLM server with the custom /v1/chat/completions/tokens and /tokenize endpoints. It is designed for inference optimization and will fall back to standard behavior when necessary.

Type Aliases

Inherits from OpenAIChatCompletionsClient:

OpenAIChatMessage = ChatCompletionMessageParam
OpenAIChatMessages = list[OpenAIChatMessage]
OpenAIChatResponse = ChatCompletion
OpenAITool = ChatCompletionToolParam

Additional response type:

class TokenizeResponse(BaseModel):
    count: int
    max_model_len: int
    tokens: list[int]
    token_strs: Optional[list[str]] = None

Class Definition

class OpenAIChatCompletionsTokenClient(OpenAIChatCompletionsClient)

Inherits all generic type parameters from OpenAIChatCompletionsClient:

ClientT: AsyncOpenAI
MessagesT: OpenAIChatMessages
ResponseT: OpenAIChatResponse
ToolT: OpenAITool

Constructor

OpenAIChatCompletionsTokenClient(client_or_config: AsyncOpenAI | ClientConfig)

client_or_config

AsyncOpenAI | ClientConfig

required

Either a pre-configured AsyncOpenAI client or a ClientConfig to create one. The base URL should point to a vLLM server.

Example

from verifiers.clients.openai_chat_completions_token_client import OpenAIChatCompletionsTokenClient
from verifiers.types import ClientConfig

# Using ClientConfig with vLLM server
client = OpenAIChatCompletionsTokenClient(
    ClientConfig(
        api_key="EMPTY",  # vLLM typically doesn't require real API key
        base_url="http://localhost:8000/v1"
    )
)

# Using pre-configured AsyncOpenAI client
from openai import AsyncOpenAI
client = OpenAIChatCompletionsTokenClient(
    AsyncOpenAI(api_key="EMPTY", base_url="http://localhost:8000/v1")
)

Properties

token_client

@property
def token_client(self) -> AsyncOpenAI

Returns an AsyncOpenAI client with the base URL stripped of /v1 suffix for accessing vLLM’s /tokenize endpoint. Returns: AsyncOpenAI instance configured for tokenization endpoints.

Methods

get_native_response

@handle_openai_overlong_prompt
async def get_native_response(
    self,
    prompt: OpenAIChatMessages,
    model: str,
    sampling_args: SamplingArgs,
    tools: list[OpenAITool] | None = None,
    **kwargs,
) -> OpenAIChatResponse

Calls the vLLM Chat Completions API, using token-level inference when possible.

prompt

OpenAIChatMessages

required

List of OpenAI message parameters.

model

str

required

Model identifier hosted on vLLM.

sampling_args

SamplingArgs

required

Sampling parameters. max_tokens is automatically renamed to max_completion_tokens. logprobs is automatically set to True and return_token_ids=True is added to extra_body.

tools

list[OpenAITool] | None

default:"None"

Optional list of tools in OpenAI format.

kwargs

dict

Must include state (type State) for accessing trajectory and managing cached tokens.

Returns: OpenAI ChatCompletion object. Behavior:

First turn (len(state["trajectory"]) == 0): Uses standard /chat/completions endpoint (MITO)
Multimodal content present: Falls back to standard endpoint because vLLM’s /tokenize doesn’t run multimodal processor
Subsequent text-only turns: Uses /chat/completions/tokens endpoint (TITO) with token stitching
No prefix match found: Falls back to standard endpoint

get_prompt_ids

async def get_prompt_ids(
    self,
    state: State,
    prompt_messages: OpenAIChatMessages,
    oai_tools: list[OpenAITool] | None,
) -> list[int] | None

Builds prompt token IDs by finding the longest message-level prefix match in the trajectory and stitching with new tokens.

state

State

required

Current rollout state containing trajectory history.

prompt_messages

OpenAIChatMessages

required

Current prompt messages to convert to token IDs.

oai_tools

list[OpenAITool] | None

required

Tools in OpenAI format (affects tokenization).

Returns: List of token IDs representing the full prompt, or None if no prefix match found. Algorithm:

Scans trajectory backwards to find the step whose messages form the longest prefix of prompt_messages
Extracts token IDs from that step (prompt_ids + completion_ids)
Computes and appends chat template suffix tokens (e.g., EOM tokens)
Tokenizes the full prompt to derive environment response tokens
Returns prev_turn_ids + suffix_ids + env_response_ids

tokenize

async def tokenize(
    self,
    messages: str | OpenAIChatMessages,
    tools: list[OpenAITool] | None,
    model: str,
    extra_kwargs: dict = {},
    **kwargs,
) -> list[int]

Tokenizes messages or text using the vLLM /tokenize API.

messages

str | OpenAIChatMessages

required

Either a plain text string or a list of OpenAI message parameters.

tools

list[OpenAITool] | None

required

Optional tools (affects tokenization of messages).

model

str

required

Model identifier for tokenization.

extra_kwargs

dict

default:"{}"

Additional parameters for tokenization (e.g., add_generation_prompt).

Returns: List of token IDs.

Usage Example

import asyncio
import verifiers as vf
from verifiers.clients.openai_chat_completions_token_client import OpenAIChatCompletionsTokenClient
from verifiers.types import ClientConfig, SamplingArgs

async def main():
    # Initialize client pointing to vLLM server
    client = OpenAIChatCompletionsTokenClient(
        ClientConfig(
            api_key="EMPTY",
            base_url="http://localhost:8000/v1"
        )
    )
    
    # Create a simple environment
    def load_environment():
        dataset = vf.Environment.make_dataset([
            {"question": "What is 2+2?"}
        ])
        
        def correctness(completion: vf.Messages, **kwargs) -> float:
            text = vf.content_to_text(completion[-1].content)
            return 1.0 if "4" in text else 0.0
        
        return vf.SingleTurnEnv(
            dataset=dataset,
            rubric=vf.Rubric(correctness),
        )
    
    env = load_environment()
    
    # Run rollout - token client will automatically use TITO on subsequent turns
    state = await env.rollout(
        input={"question": "What is 2+2?"},
        client=client,
        model="meta-llama/Llama-3.1-8B-Instruct",
        sampling_args=SamplingArgs(temperature=0.0, max_tokens=100)
    )
    
    # Check token information
    if state["trajectory"]:
        first_turn = state["trajectory"][0]
        if first_turn["tokens"]:
            print(f"Prompt tokens: {len(first_turn['tokens']['prompt_ids'])}")
            print(f"Completion tokens: {len(first_turn['tokens']['completion_ids'])}")
    
    await client.close()

asyncio.run(main())

Multi-Turn Example

import asyncio
import verifiers as vf
from verifiers.clients.openai_chat_completions_token_client import OpenAIChatCompletionsTokenClient
from verifiers.types import ClientConfig, SamplingArgs

async def main():
    client = OpenAIChatCompletionsTokenClient(
        ClientConfig(api_key="EMPTY", base_url="http://localhost:8000/v1")
    )
    
    # Multi-turn environment
    class CountingEnv(vf.MultiTurnEnv):
        def __init__(self):
            dataset = vf.Environment.make_dataset([{"start": 1}])
            super().__init__(
                dataset=dataset,
                rubric=vf.Rubric(lambda **kw: 1.0),
                max_turns=5
            )
        
        async def env_response(
            self, completion: vf.Messages, state: vf.State
        ) -> vf.Messages:
            turn = state["turn"]
            return [vf.UserMessage(content=f"What is {turn + 1} + 1?")]
        
        async def is_completed(self, state: vf.State) -> bool:
            return state["turn"] >= 5
    
    env = CountingEnv()
    
    state = await env.rollout(
        input={"start": 1},
        client=client,
        model="meta-llama/Llama-3.1-8B-Instruct",
        sampling_args=SamplingArgs(temperature=0.0, max_tokens=50)
    )
    
    # Turn 0: Uses MITO (first turn)
    # Turn 1-4: Uses TITO (reuses tokens from previous turns)
    
    print(f"Completed {len(state['trajectory'])} turns")
    for i, step in enumerate(state["trajectory"]):
        prompt_len = len(step["tokens"]["prompt_ids"]) if step["tokens"] else 0
        completion_len = len(step["tokens"]["completion_ids"]) if step["tokens"] else 0
        print(f"Turn {i}: prompt={prompt_len} tokens, completion={completion_len} tokens")
    
    await client.close()

asyncio.run(main())

State Keys

The client manages these keys in state:

_cached_suffix_ids

list[int]

Cached chat template suffix tokens computed once per rollout. Used to correctly handle message delimiter tokens across turns.

TITO vs MITO

Message-In Token-Out (MITO)

Standard behavior:

Sends full message history on each turn
Server re-tokenizes everything
Used for: first turn, multimodal content

Token-In Token-Out (TITO)

Optimized behavior:

Sends token IDs directly to skip re-tokenization
Reuses cached tokens from previous turns
Stitches new tokens for environment responses
Used for: subsequent text-only turns

Performance benefit: TITO eliminates redundant tokenization overhead in multi-turn conversations, especially valuable for:

Long conversation histories
Models with complex chat templates
High-throughput inference scenarios

Multimodal Content Handling

The client automatically detects multimodal content (images, audio) and falls back to MITO:

def _has_multimodal_content(messages) -> bool:
    """Check if any message contains multimodal content."""
    for msg in messages:
        content = msg.get("content") if hasattr(msg, "get") else None
        if isinstance(content, list):
            for part in content:
                if hasattr(part, "get") and part.get("type") in (
                    "image_url",
                    "input_audio",
                ):
                    return True
    return False

Reason: vLLM ≤0.16’s /tokenize endpoint doesn’t run the multimodal processor, so image placeholders stay collapsed (1 token instead of N) and token-stitching produces broken prompts.

Chat Template Suffix Tokens

The client handles chat template suffix tokens (e.g., EOM tokens, newlines) correctly:

Computes suffix tokens once using dummy messages
Caches them in state["_cached_suffix_ids"]
For each turn, finds the largest overlap between previous turn tokens and suffix tokens
Appends non-overlapping suffix tokens to handle truncated turns

This ensures that token stitching respects the chat template format even when turns are truncated mid-message.

Fallback Conditions

The client falls back to standard MITO when:

First turn: len(state["trajectory"]) == 0
Multimodal content: Current or any previous turn contains images/audio
No prefix match: get_prompt_ids() returns None

Error Handling

Inherits error handling from OpenAIChatCompletionsClient:

Context length errors → OverlongPromptError
Empty responses → EmptyModelResponseError
Invalid responses → InvalidModelResponseError
Authentication errors → Re-raised from provider

Environment Classes

Rubrics & Parsers

Clients

Integration Classes

Experimental

Data Types

Utilities

OpenAIChatCompletionsTokenClient

Overview

Type Aliases

Class Definition

Constructor

Example

Properties

token_client

Methods

get_native_response

get_prompt_ids

tokenize

Usage Example

Multi-Turn Example

State Keys

TITO vs MITO

Message-In Token-Out (MITO)

Token-In Token-Out (TITO)

Multimodal Content Handling

Chat Template Suffix Tokens

Fallback Conditions

Error Handling

See Also

Build docs developers (and LLMs) love

Environment Classes

Rubrics & Parsers

Clients

Integration Classes

Experimental

Data Types

Utilities

​Overview

​Type Aliases

​Class Definition

​Constructor

​Example

​Properties

​token_client

​Methods

​get_native_response

​get_prompt_ids

​tokenize

​Usage Example

​Multi-Turn Example

​State Keys

​TITO vs MITO

​Message-In Token-Out (MITO)

​Token-In Token-Out (TITO)

​Multimodal Content Handling

​Chat Template Suffix Tokens

​Fallback Conditions

​Error Handling

​See Also

Build docs developers (and LLMs) love

Overview

Type Aliases

Class Definition

Constructor

Example

Properties

token_client

Methods

get_native_response

get_prompt_ids

tokenize

Usage Example

Multi-Turn Example

State Keys

TITO vs MITO

Message-In Token-Out (MITO)

Token-In Token-Out (TITO)

Multimodal Content Handling

Chat Template Suffix Tokens

Fallback Conditions

Error Handling

See Also