Skip to main content
BAML’s streaming API allows you to receive partial, structured results as the LLM generates its response. This enables you to display real-time progress and provide a better user experience.

Basic Streaming

Use b.stream.FunctionName() to stream responses:
from baml_client.async_client import b

async def example():
    stream = b.stream.ExtractResume(resume_text)
    
    # Iterate over partial results
    async for partial in stream:
        print(f"Partial: {partial}")
        # partial has nullable fields populated as data arrives
    
    # Get the final, validated result
    final = await stream.get_final_response()
    print(f"Final: {final}")
Sync version:
from baml_client.sync_client import b

def example():
    stream = b.stream.ExtractResume(resume_text)
    
    for partial in stream:
        print(f"Partial: {partial}")
    
    final = stream.get_final_response()
    print(f"Final: {final}")

Partial Types

BAML generates partial types for streaming in the partial_types module. By default:
  • All class fields become nullable in partial types
  • Fields are filled with non-null values as tokens arrive
  • The final result is validated against your original type
Example: Given this BAML class:
class Resume {
    name string
    email string
    skills string[]
    experience Experience[]
}

class Experience {
    company string
    title string
    years int
}
The generated partial type looks like:
from baml_client.partial_types import Resume, Experience

# Partial types have nullable fields
class Resume:
    name: str | None
    email: str | None
    skills: list[str] | None
    experience: list[Experience] | None

class Experience:
    company: str | None
    title: str | None
    years: int | None

Stream Request

Use .stream_request to get the HTTP request for streaming without actually sending it:
from baml_client.async_client import b

async def example():
    request = await b.stream_request.ExtractResume(resume_text)
    print(request.url)
    print(request.headers)
    print(request.body.json())

Parse Stream

Parse streaming responses yourself using .parse_stream:
from openai import AsyncOpenAI
from baml_client.async_client import b

async def example():
    client = AsyncOpenAI()
    
    request = await b.stream_request.ExtractResume(resume_text)
    stream = await client.chat.completions.create(**request.body.json())
    
    llm_response = []
    async for chunk in stream:
        if len(chunk.choices) > 0 and chunk.choices[0].delta.content:
            llm_response.append(chunk.choices[0].delta.content)
            # Parse accumulated response
            partial = b.parse_stream.ExtractResume("".join(llm_response))
            print(partial)

Streaming with Options

Pass options to streaming calls just like regular calls:
from baml_client.async_client import b

async def example():
    stream = b.stream.ExtractResume(
        resume_text,
        baml_options={
            "client": "openai/gpt-4o-mini",
            "tags": {"user_id": "123"},
        }
    )
    
    async for partial in stream:
        print(partial)
    
    final = await stream.get_final_response()

Stream Behavior

Partial Updates

As the LLM streams tokens, BAML:
  1. Accumulates the raw JSON text
  2. Attempts to parse partial JSON into your defined types
  3. Fills fields with values as they become available
  4. Emits partial results that can be displayed immediately

Example Stream Progression

For a Resume type, you might see:
# First partial - only name
Resume(name="John Doe", email=None, skills=None, experience=None)

# Second partial - name and email
Resume(name="John Doe", email="[email protected]", skills=None, experience=None)

# Third partial - with some skills
Resume(name="John Doe", email="[email protected]", skills=["Python"], experience=None)

# Final response - all fields populated
Resume(
    name="John Doe",
    email="[email protected]",
    skills=["Python", "TypeScript", "Go"],
    experience=[...]
)

Final Response

The final response from get_final_response() / getFinalResponse():
  • Is fully validated against your original BAML types
  • Throws validation errors if the LLM output doesn’t match your schema
  • Returns the non-nullable, complete type

Error Handling

Streaming can throw errors:
from baml_client.async_client import b
from baml_py import BamlValidationError

async def example():
    stream = b.stream.ExtractResume(resume_text)
    
    try:
        async for partial in stream:
            print(partial)
        
        final = await stream.get_final_response()
    except BamlValidationError as e:
        print(f"Validation failed: {e.message}")
        print(f"Raw output: {e.raw_output}")

Best Practices

  1. Use streaming for long responses - Better UX when generating large amounts of structured data
  2. Handle partial data gracefully - Check for null/None fields in partial results
  3. Display progress incrementally - Update UI as partial results arrive
  4. Always call get_final_response() - Ensures full validation of the complete result
  5. Handle errors - Stream can fail at any point during generation

Build docs developers (and LLMs) love