Overview
SGLang supports distributed request tracing using OpenTelemetry. Tracing provides detailed insights into request execution flow, helping you:- Debug performance bottlenecks
- Understand request lifecycle across components
- Track latency breakdown by processing stage
- Correlate requests across prefill-decode disaggregation
- Trace requests through distributed multi-GPU systems
Prerequisites
Install OpenTelemetry packages:Quick Start
1. Set up an OTLP Collector
You need an OpenTelemetry collector endpoint. Common options:- Jaeger (local development)
- Grafana Tempo (production)
- DataDog, New Relic, Honeycomb (managed services)
Run Jaeger Locally
2. Enable Tracing in SGLang
Start the server with tracing enabled:3. Send Requests
Send requests to generate traces:4. View Traces
Open the Jaeger UI and search for traces from the “sglang server” service.Configuration Options
Trace Levels
Control tracing granularity with the--trace-level flag:
1: Basic request-level tracing (root spans only)2: Intermediate detail (major processing stages)3: Detailed tracing (default, includes all operations)
OTLP Protocol
Choose between gRPC (default) or HTTP: gRPC (default):Batch Span Processing
Optimize trace export with environment variables:Trace Structure
Span Hierarchy
A typical request trace contains:-
Root Span:
Req {request_id}- Represents the entire request lifecycle
- Contains request metadata (rid, model)
-
Thread Spans: Process-specific execution threads
- Example:
Scheduler [TP 0] (host:abcd1234 | pid:12345) - Labels:
tp_rank,host_id,pid,thread_label
- Example:
-
Slice Spans: Specific processing stages
- Examples:
prefill,decode,kv_transfer - Nested to show parent-child relationships
- Examples:
Span Attributes
Traces include standard GenAI semantic attributes:gen_ai.usage.prompt_tokens: Number of input tokensgen_ai.usage.completion_tokens: Number of output tokensgen_ai.usage.cached_tokens: Number of cached tokensgen_ai.request.max_tokens: Maximum tokens requestedgen_ai.request.temperature: Sampling temperaturegen_ai.request.top_p: Top-p sampling parametergen_ai.response.model: Model identifiergen_ai.response.finish_reasons: Why generation stoppedgen_ai.request.id: Request identifiergen_ai.latency.time_in_queue: Queue waiting timegen_ai.latency.time_to_first_token: TTFT latencygen_ai.latency.e2e: End-to-end latencygen_ai.latency.time_in_model_prefill: Prefill execution timegen_ai.latency.time_in_model_decode: Decode execution time
Events
Spans can contain events marking specific occurrences:- Token generation milestones
- Cache hits/misses
- Queue admissions
- Error conditions
Distributed Tracing
Multi-Process Tracing
In multi-GPU setups (TP/PP/DP), SGLang automatically:- Assigns unique IDs to prevent collisions across processes
- Labels spans with rank information (
tp_rank,pp_rank,dp_rank) - Links spans across processes using trace context propagation
Prefill-Decode Disaggregation
When using PD disaggregation, traces span both workers:bootstrap_room attribute links related requests.
Client-Side Trace Context
Propagate trace context from clients using W3C Trace Context headers:Performance Considerations
Overhead
Tracing introduces minimal overhead:- Level 1: <1% latency impact
- Level 2: ~1-2% latency impact
- Level 3: ~2-5% latency impact
Sampling
For high-throughput production systems, consider sampling:- Use a tracing backend with sampling support (e.g., Tempo)
- Configure sampling at the collector level
- Sample based on trace characteristics (slow requests, errors)
Selective Tracing
To trace specific requests, control tracing at the request level (requires custom implementation):Troubleshooting
No Traces Appearing
-
Verify OpenTelemetry is installed:
-
Check OTLP endpoint connectivity:
-
Review SGLang logs for errors:
-
Verify protocol configuration:
- gRPC: Port 4317 (default)
- HTTP: Port 4318 with
/v1/tracespath
High Memory Usage
- Reduce
SGLANG_OTLP_EXPORTER_MAX_EXPORT_BATCH_SIZE - Decrease
SGLANG_OTLP_EXPORTER_SCHEDULE_DELAY_MILLIS - Lower
--trace-level
Trace ID Collisions
SGLang uses a custom ID generator to prevent collisions across processes. If you still see issues:- Ensure each process has a unique host identifier
- Check that
/etc/machine-idexists and is unique - Verify MAC addresses differ across machines
Integration Examples
Grafana Tempo
DataDog
Next Steps
- Explore monitoring for real-time metrics
- Review available Prometheus metrics
- Run benchmarks to establish baselines
