Overview
Routing in vLLora enables you to:- Multi-provider support - Route to OpenAI, Anthropic, Gemini, Bedrock, and custom providers
- Automatic fallbacks - Retry failed requests on different providers
- Load balancing - Distribute requests based on various strategies
- Cost optimization - Route to cheaper models when appropriate
- A/B testing - Split traffic between different models or configurations
Provider flexibility
Switch between providers seamlessly without code changes
High availability
Automatic failover ensures requests succeed even if providers fail
Cost control
Route to cost-effective models based on request characteristics
Performance tuning
Optimize for latency, throughput, or quality
Routing strategies
vLLora implements several routing strategies, each suited for different use cases:1. Fallback routing
Try providers in sequence until one succeeds:2. Percentage routing
Split traffic by percentage for A/B testing:3. Conditional routing
Route based on request characteristics:4. Optimized routing
Automatically select the best provider based on metrics:latency- Fastest response timecost- Lowest cost per tokensuccess_rate- Highest success ratethroughput- Highest tokens per second
Router configuration
Via request
Specify routing inline with each request:Via configuration file
Define global routing rules:Provider targeting
Basic provider selection
Specific model
Custom endpoint
Provider-specific parameters
Fallback strategies
Automatic retry
vLLora automatically retries failed requests on the next target:Retry configuration
Error handling
Different errors trigger different behaviors:- Rate limits (429) - Wait and retry with backoff
- Server errors (5xx) - Immediately try next target
- Client errors (4xx) - Fail without retrying (bad request)
- Network errors - Retry with next target
Load balancing
Round-robin
Distribute requests evenly:Weighted distribution
Control distribution with weights:Least-loaded
Route to the provider with the fewest active requests:Cost optimization
Model selection by cost
Route to cheaper models when possible:Budget-based routing
Route based on remaining budget:Performance optimization
Latency-based routing
Route to fastest providers:Geographic routing
Route to providers closest to the user:Monitoring routing decisions
Trace routing
Routing decisions are recorded in traces:Metrics
Track routing effectiveness:- Success rate per provider
- Average latency per provider
- Cost per provider
- Fallback frequency
Best practices
Always configure fallbacks
Always configure fallbacks
Providers can fail unexpectedly. Always configure at least 2-3 fallback targets to ensure high availability.
Monitor routing metrics
Monitor routing metrics
Track which providers are used most frequently, their success rates, and costs. Use this data to optimize routing strategies.
Test routing in development
Test routing in development
Verify routing behavior in development before deploying to production. Use trace inspection to confirm requests route as expected.
Use conditional routing for cost control
Use conditional routing for cost control
Route simple requests to cheaper models and complex ones to premium models. This balances cost and quality.
Set appropriate retry limits
Set appropriate retry limits
Don’t retry indefinitely. Set
max_retries to prevent cascading failures and excessive latency.Next steps
Providers
Learn about supported providers
Configuration
Configure routing in config.yaml
Tracing
Monitor routing decisions
API reference
Router API documentation