Why Stream?
- Better UX: Users see responses immediately, not after 10+ seconds
- Lower latency: First token appears much faster
- Transparency: Users can stop generation early if answer is sufficient
- Long outputs: Handle long responses without timeout issues
Basic Streaming
All LangChain chat models support streaming via thestream() method:
Message Chunks
Streaming returnsAIMessageChunk objects:
Async Streaming
Use async streaming for concurrent operations:Streaming Multiple Queries
Streaming Chains
Stream through LCEL chains:Streaming with Multiple Steps
Streaming Events
Get granular control withastream_events():
Event Types
- on_chat_model_stream
- on_chat_model_start
- on_chat_model_end
- on_chain_stream
Individual token/chunk from model:
Streaming RAG
Stream retrieval-augmented generation:Streaming Tool Calls
Stream agent tool calls:Token Usage Tracking
Track tokens while streaming:Custom Stream Processing
Process chunks with custom logic:Buffering Strategies
Error Handling
Handle streaming errors gracefully:Streaming with Callbacks
Use callbacks for side effects:Custom Callback
Best Practices
Use async for concurrency
Process multiple streams concurrently with
astream() and asyncio.gather().Performance Tips
- Use
astream()overstream()for async contexts - Buffer chunks for smoother display (word or sentence level)
- Set
temperature=0for faster, more deterministic streaming - Use smaller models (gpt-4o-mini) for lower latency
- Enable streaming callbacks for automatic handling
Next Steps
- Learn about Chat Models for model configuration
- Explore Output Parsing for structured streaming
- Build real-time agents with Building Agents
- Check LangSmith for streaming observability