Streaming

The ZeroEval SDK automatically traces streaming LLM requests from OpenAI and Vercel AI SDK, capturing latency, throughput, and token metrics without code changes.

OpenAI streaming

Enable streaming in OpenAI requests:

import { OpenAI } from 'openai';
import { wrap } from 'zeroeval';

const openai = wrap(new OpenAI());

const stream = await openai.chat.completions.create({
  model: 'gpt-4',
  messages: [
    { role: 'system', content: 'You are a helpful assistant.' },
    { role: 'user', content: 'Tell me a short story.' }
  ],
  stream: true
});

// Consume the stream
for await (const chunk of stream) {
  const content = chunk.choices[0]?.delta?.content || '';
  if (content) {
    process.stdout.write(content);
  }
}

The wrapper automatically tracks:

Time to first token (latency)
Characters per second (throughput)
Total input and output tokens
Full response assembly

Vercel AI SDK streaming

Stream with the Vercel AI SDK:

import * as ai from 'ai';
import { createOpenAI } from '@ai-sdk/openai';
import { wrap } from 'zeroeval';

const openai = createOpenAI();
const wrappedAI = wrap(ai);

const result = await wrappedAI.streamText({
  model: openai('gpt-4o-mini'),
  system: 'You are a concise assistant.',
  messages: [
    { role: 'user', content: 'Explain TypeScript in one sentence.' }
  ]
});

// Consume text stream
for await (const chunk of result.textStream) {
  process.stdout.write(chunk);
}

The wrapper supports both textStream and fullStream iterators.

Token tracking

For OpenAI streaming requests, enable usage tracking:

const stream = await openai.chat.completions.create({
  model: 'gpt-4',
  messages: [ /* ... */ ],
  stream: true,
  stream_options: { include_usage: true }
});

The SDK automatically sets stream_options.include_usage for OpenAI-native models, capturing token counts in the final chunk.

Token usage in streaming mode is only available for OpenAI models (gpt-4, gpt-3.5-turbo, etc.). Third-party models accessed via OpenAI may not support this feature.

Latency metrics

The SDK captures time to first token (TTFT):

const stream = await openai.chat.completions.create({
  model: 'gpt-4',
  messages: [ /* ... */ ],
  stream: true
});

// Latency is automatically tracked
for await (const chunk of stream) {
  // First content chunk triggers latency calculation
}

// Check the span for latency
const span = getCurrentSpan();
console.log('Latency:', span?.attributes.latency, 'seconds');

Latency is calculated from request start to the first content chunk and stored in span.attributes.latency (in seconds).

Throughput calculation

Throughput is calculated as characters per second:

const stream = await openai.chat.completions.create({
  model: 'gpt-4',
  messages: [ /* ... */ ],
  stream: true
});

let fullResponse = '';
for await (const chunk of stream) {
  const content = chunk.choices[0]?.delta?.content || '';
  fullResponse += content;
}

// Throughput is automatically calculated
const span = getCurrentSpan();
console.log('Throughput:', span?.attributes.throughput, 'chars/sec');

The formula: throughput = total_characters / elapsed_time

Full stream vs text stream

Vercel AI SDK offers two stream types:

const result = await wrappedAI.streamText({ /* ... */ });

// Text stream - raw string chunks
for await (const textChunk of result.textStream) {
  console.log('Text:', textChunk);
}

// Full stream - structured chunks with metadata
for await (const chunk of result.fullStream) {
  if (chunk.type === 'text-delta') {
    console.log('Delta:', chunk.textDelta);
  }
  if (chunk.usage) {
    console.log('Tokens:', chunk.usage);
  }
}

Both are automatically traced. Use fullStream when you need metadata like token counts and chunk types.

Stream error handling

Errors during streaming are captured automatically:

try {
  const stream = await openai.chat.completions.create({
    model: 'gpt-4',
    messages: [ /* ... */ ],
    stream: true
  });
  
  for await (const chunk of stream) {
    // Process chunk
  }
} catch (error) {
  // Error is automatically attached to the span
  console.error('Stream error:', error);
}

The span will include:

Error code (error name)
Error message
Stack trace

Streaming with prompts

Combine streaming with prompt management:

import { wrap, prompt } from 'zeroeval';

const openai = wrap(new OpenAI());

const systemPrompt = await prompt({
  name: 'storyteller',
  content: 'You are a creative storyteller.'
});

const stream = await openai.chat.completions.create({
  model: 'gpt-4',
  messages: [
    { role: 'system', content: systemPrompt },
    { role: 'user', content: 'Tell me about a brave cat.' }
  ],
  stream: true
});

for await (const chunk of stream) {
  const content = chunk.choices[0]?.delta?.content || '';
  process.stdout.write(content);
}

Prompt metadata is extracted before streaming begins and attached to the span.

Streaming metrics

Metrics automatically captured for streaming requests:

Metric	Attribute	Description
Time to first token	`latency`	Seconds from request start to first content chunk
Throughput	`throughput`	Characters per second (total chars / elapsed time)
Input tokens	`inputTokens`	Prompt tokens (if available)
Output tokens	`outputTokens`	Completion tokens (if available)
Chunk count	`chunkCount`	Total number of chunks received (Vercel AI only)
Streaming	`streaming`	Boolean flag indicating streaming mode

Next.js streaming example

Stream responses in Next.js API routes:

import { OpenAI } from 'openai';
import { wrap } from 'zeroeval';
import { NextRequest } from 'next/server';

const openai = wrap(new OpenAI());

export async function POST(request: NextRequest) {
  const { message } = await request.json();
  
  const stream = await openai.chat.completions.create({
    model: 'gpt-4',
    messages: [
      { role: 'system', content: 'You are a helpful assistant.' },
      { role: 'user', content: message }
    ],
    stream: true
  });
  
  // Create a ReadableStream for the response
  const encoder = new TextEncoder();
  const readable = new ReadableStream({
    async start(controller) {
      for await (const chunk of stream) {
        const content = chunk.choices[0]?.delta?.content || '';
        if (content) {
          controller.enqueue(encoder.encode(content));
        }
      }
      controller.close();
    }
  });
  
  return new Response(readable, {
    headers: { 'Content-Type': 'text/plain; charset=utf-8' }
  });
}

Vercel AI SDK data stream

Use Vercel AI’s data stream utilities:

import * as ai from 'ai';
import { createOpenAI } from '@ai-sdk/openai';
import { wrap } from 'zeroeval';

const openai = createOpenAI();
const wrappedAI = wrap(ai);

export async function POST(req: Request) {
  const { messages } = await req.json();
  
  const result = await wrappedAI.streamText({
    model: openai('gpt-4o-mini'),
    messages
  });
  
  // Convert to data stream response
  return result.toDataStreamResponse();
}

The span is automatically closed when toDataStreamResponse() is called.

Best practices

Always consume streams completely

Ensure you iterate through all chunks to get accurate metrics. Partial consumption will result in incomplete data.

Enable usage tracking for OpenAI

Set stream_options.include_usage = true to capture token counts. The SDK does this automatically for native OpenAI models.

Handle errors gracefully

Wrap stream consumption in try-catch blocks. Errors are automatically attached to spans but should still be handled in your application.

Use fullStream for detailed metrics

When using Vercel AI SDK, prefer fullStream over textStream if you need chunk-level metadata and usage statistics.

Performance considerations

The SDK’s streaming wrapper adds minimal overhead:

Latency: < 1ms per chunk (imperceptible to users)
Memory: Buffers only the current chunk, not the entire response
Throughput: No throttling or rate limiting

Streaming traces are ended when the stream is fully consumed or when conversion methods like toDataStreamResponse() are called.

Span decorator - Trace functions automatically
Prompt management - Version and optimize prompts

Get Started

Core Concepts

Integrations

Guides

OpenAI streaming

Vercel AI SDK streaming

Token tracking

Latency metrics

Throughput calculation

Full stream vs text stream

Stream error handling

Streaming with prompts

Streaming metrics

Next.js streaming example

Vercel AI SDK data stream

Best practices

Performance considerations

Build docs developers (and LLMs) love

Get Started

Core Concepts

Integrations

Guides

​OpenAI streaming

​Vercel AI SDK streaming

​Token tracking

​Latency metrics

​Throughput calculation

​Full stream vs text stream

​Stream error handling

​Streaming with prompts

​Streaming metrics

​Next.js streaming example

​Vercel AI SDK data stream

​Best practices

​Performance considerations

​Related resources

Build docs developers (and LLMs) love

OpenAI streaming

Vercel AI SDK streaming

Token tracking

Latency metrics

Throughput calculation

Full stream vs text stream

Stream error handling

Streaming with prompts

Streaming metrics

Next.js streaming example

Vercel AI SDK data stream

Best practices

Performance considerations

Related resources