Advanced Streaming
KoreShield supports streaming responses through its OpenAI-compatible proxy. This page covers production-grade streaming guidance, timeouts, and infrastructure considerations.Streaming enables low-latency user experiences by delivering partial responses as they’re generated, while maintaining full security scanning.
Use Cases
- Low-latency UX with partial tokens appearing in real-time
- Long-form generation where full responses exceed typical timeouts
- Real-time dashboards and agent pipelines
- Interactive chat applications with immediate feedback
How Streaming Works
- Client sends a request with
stream: trueto the KoreShield proxy - KoreShield applies security checks, then forwards the request to the provider
- The proxy relays streamed chunks to the client as they arrive
Client Examples
TypeScript (Fetch)
TypeScript (OpenAI SDK)
Python (Requests)
Python (OpenAI SDK)
Reverse Proxy and Load Balancer Settings
Streaming requires long-lived connections. Ensure any proxy or load balancer supports:- Idle timeouts of 60 to 120 seconds or higher
- HTTP/1.1 keep-alive or HTTP/2 support
- Response buffering disabled or minimized
NGINX Configuration
AWS Application Load Balancer
Timeouts and Retries
- Client timeouts should be higher than your longest expected response
- Retries should be disabled for streaming requests unless you support resume logic
- Consider a fallback to non-streaming if streaming fails
Security Considerations
Observability
Logging Streaming Requests
Prometheus Metrics
Troubleshooting
Empty stream or no chunks received
Empty stream or no chunks received
Possible causes:
- Provider doesn’t support streaming for the selected model
- Missing
stream: truein request - Proxy buffering responses
- Verify model supports streaming in provider docs
- Check request payload includes
stream: true - Disable buffering in reverse proxy (see NGINX config above)
Broken connections or premature stream termination
Broken connections or premature stream termination
Possible causes:
- Idle timeout too short
- Load balancer closing connection
- Client timeout exceeded
- Increase idle timeouts on load balancers to 120s+
- Implement keep-alive headers
- Increase client read timeout
Delayed chunks or buffering
Delayed chunks or buffering
Possible causes:
- Response buffering enabled in proxy
- Network latency
- Provider throttling
- Disable response buffering in proxy config
- Check network latency to provider
- Monitor provider API status
High latency on first chunk
High latency on first chunk
Possible causes:
- Cold start on serverless infrastructure
- Security scanning overhead
- Provider model loading time
- Use provisioned concurrency for serverless
- Cache security scan results for identical prompts
- Select models optimized for low TTFT (time to first token)