Overview
SGLang exposes comprehensive Prometheus metrics when launched with the--enable-metrics flag. These metrics provide insights into performance, resource utilization, and system behavior.
Accessing Metrics
Metrics are exposed at the/metrics endpoint:
Core Metrics
Request Metrics
Token Counters
sglang:prompt_tokens_total (Counter)
- Number of prefill tokens processed
- Labels:
model_name,engine_type,tp_rank,pp_rank,moe_ep_rank
sglang:generation_tokens_total (Counter)
- Number of generation tokens processed
- Labels:
model_name,engine_type,tp_rank,pp_rank,moe_ep_rank
sglang:cached_tokens_total (Counter)
- Number of cached tokens (prefix cache hits)
- Labels:
model_name,engine_type,cache_source - Cache sources:
device,host,storage_<backend>
sglang:realtime_tokens_total (Counter)
- Total tokens processed, updated on each log interval
- Labels:
model_name,mode(values:prefill_compute,prefill_cache,decode)
Request Counts
sglang:num_requests_total (Counter)
- Total number of requests processed
sglang:num_so_requests_total (Counter)
- Number of structured output (grammar) requests processed
sglang:num_aborted_requests_total (Counter)
- Number of requests that were aborted
Latency Histograms
sglang:time_to_first_token_seconds (Histogram)
- Time from request start to first token generation
- Buckets: 0.001s to 400s (logarithmic scale)
sglang:inter_token_latency_seconds (Histogram)
- Time between consecutive token generations
- Buckets: 0.002s to 8s
sglang:e2e_request_latency_seconds (Histogram)
- End-to-end request latency from submission to completion
- Buckets: 0.1s to 2400s
sglang:time_per_output_token_seconds (Histogram)
- Average time per output token
- Calculated as
(total_latency - TTFT) / (num_tokens - 1) - Buckets: 0.005s to 2.5s
sglang:per_stage_req_latency_seconds (Histogram)
- Latency breakdown by request processing stage
- Labels:
stage(various internal stages) - Buckets: Exponential from 1ms to ~1191s
System State Metrics
Resource Usage
sglang:num_running_reqs (Gauge)
- Number of currently running requests
sglang:num_queue_reqs (Gauge)
- Number of requests in the waiting queue
sglang:num_used_tokens (Gauge)
- Number of tokens currently in use in the KV cache
sglang:token_usage (Gauge)
- Fraction of KV cache capacity in use (0.0 to 1.0)
sglang:max_total_num_tokens (Gauge)
- Maximum total number of tokens in the KV cache pool
sglang:gen_throughput (Gauge)
- Current generation throughput in tokens per second
Cache Metrics
sglang:cache_hit_rate (Gauge)
- Prefix cache hit rate (0.0 to 1.0)
sglang:cache_config_info (Gauge)
- Cache configuration information
- Labels:
page_size,num_pages - Value is always 1 (info metric)
Performance Metrics
Function Latency
sglang:func_latency_seconds (Histogram)
- Latency of key functions in seconds
- Labels:
name(function name, e.g.,generate_request) - Buckets: 50ms to ~50s (exponential)
GPU Execution
sglang:gpu_execution_seconds_total (Counter)
- Total time GPU is busy executing workloads
- Labels:
category(forward mode category)
sglang:cuda_graph_passes_total (Counter)
- Number of forward passes using CUDA graphs
- Labels:
mode(decode_cuda_graphordecode_none)
sglang:is_cuda_graph (Gauge)
- Whether the current batch is using CUDA graph (1.0 or 0.0)
Queue Time
sglang:queue_time_seconds (Histogram)
- Time requests spend in the waiting queue
- Buckets: 0s to 3000s
Speculative Decoding Metrics
These metrics are available when speculative decoding is enabled:sglang:spec_accept_length (Gauge)
- Average number of tokens accepted per speculative decoding step
sglang:spec_accept_rate (Gauge)
- Acceptance rate: accepted tokens / total draft tokens
Advanced Features
Grammar/Structured Output Metrics
sglang:grammar_compilation_time_seconds (Histogram)
- Time to compile grammar/schema definitions
- Buckets: 0s to 240s
sglang:num_grammar_cache_hit_total (Counter)
- Number of grammar cache hits
sglang:num_grammar_aborted_total (Counter)
- Number of grammar requests that were aborted
sglang:num_grammar_timeout_total (Counter)
- Number of grammar timeouts
sglang:num_grammar_queue_reqs (Gauge)
- Number of requests in the grammar waiting queue
sglang:grammar_schema_count (Histogram)
- Number of schemas in grammar definitions
sglang:grammar_ebnf_size (Histogram)
- Size of EBNF grammar definitions in bytes
sglang:grammar_tree_traversal_time_avg (Histogram)
- Average time for grammar tree traversal
sglang:grammar_tree_traversal_time_max (Histogram)
- Maximum time for grammar tree traversal
Retraction Metrics
sglang:num_retracted_reqs (Gauge)
- Current number of retracted requests
sglang:num_retracted_requests_total (Counter)
- Total number of requests that have been retracted
sglang:num_retracted_input_tokens_total (Counter)
- Total number of input tokens from retracted requests
sglang:num_retracted_output_tokens_total (Counter)
- Total number of output tokens from retracted requests
sglang:num_retractions (Histogram)
- Distribution of retraction counts per request
sglang:num_paused_reqs (Gauge)
- Number of requests paused by async weight sync
LoRA Metrics
Available when LoRA adapters are enabled:sglang:lora_pool_slots_used (Gauge)
- Number of LoRA adapter slots currently in use
sglang:lora_pool_slots_total (Gauge)
- Total number of LoRA adapter slots available (max_loras_per_batch)
sglang:lora_pool_utilization (Gauge)
- LoRA pool utilization ratio (used/total), 1.0 means pool is full
Hierarchical Cache (HiCache) Metrics
Available when hierarchical cache is enabled:sglang:hicache_host_used_tokens (Gauge)
- Number of tokens currently in host (CPU) memory cache
sglang:hicache_host_total_tokens (Gauge)
- Total capacity of host KV cache in tokens
Prefill-Decode Disaggregation
These metrics are available in disaggregated prefill/decode mode:Prefill Worker Metrics
sglang:num_prefill_prealloc_queue_reqs (Gauge)
- Number of requests in prefill preallocation queue
sglang:num_prefill_inflight_queue_reqs (Gauge)
- Number of requests in prefill inflight queue
sglang:num_prefill_retries_total (Counter)
- Total number of prefill retries
Decode Worker Metrics
sglang:num_decode_prealloc_queue_reqs (Gauge)
- Number of requests in decode preallocation queue
sglang:num_decode_transfer_queue_reqs (Gauge)
- Number of requests in decode transfer queue
KV Transfer Metrics
sglang:kv_transfer_speed_gb_s (Gauge)
- KV cache transfer speed in GB/s
sglang:kv_transfer_latency_ms (Gauge)
- KV cache transfer latency in milliseconds
sglang:kv_transfer_bootstrap_ms (Gauge)
- Bootstrap time for KV transfer in milliseconds
sglang:kv_transfer_alloc_ms (Gauge)
- Allocation waiting time for KV transfer in milliseconds
sglang:kv_transfer_total_mb (Gauge)
- Total size of KV data transferred in megabytes
sglang:num_bootstrap_failed_reqs_total (Counter)
- Number of requests that failed during bootstrap
sglang:num_transfer_failed_reqs_total (Counter)
- Number of requests that failed during transfer
Storage Backend Metrics
For L3 storage cache:sglang:prefetched_tokens_total (Counter)
- Number of tokens prefetched from storage
sglang:backuped_tokens_total (Counter)
- Number of tokens backed up to storage
sglang:prefetch_pgs (Histogram)
- Distribution of prefetch page counts
sglang:backup_pgs (Histogram)
- Distribution of backup page counts
sglang:prefetch_bandwidth (Histogram)
- Prefetch bandwidth in GB/s
sglang:backup_bandwidth (Histogram)
- Backup bandwidth in GB/s
Routing Key Metrics
sglang:num_unique_running_routing_keys (Gauge)
- Number of unique routing keys in the running batch
sglang:routing_key_running_req_count (GaugeHistogram)
- Distribution of routing keys by running request count
sglang:routing_key_all_req_count (GaugeHistogram)
- Distribution of routing keys by total (running + waiting) request count
CPU Metrics
sglang:process_cpu_seconds_total (Counter)
- Total CPU time consumed by the process (user + system)
- Labels:
component
Utilization Metrics
sglang:utilization (Gauge)
- Overall system utilization (0.0 to >1.0)
- Calculated from request load and token usage
sglang:max_running_requests_under_SLO (Gauge)
- Maximum number of running requests while meeting SLO targets
sglang:new_token_ratio (Gauge)
- Ratio of new tokens to total tokens in prefill batches
Engine Startup Metrics
sglang:engine_startup_time (Gauge)
- Time taken for the engine to start up in seconds
sglang:engine_load_weights_time (Gauge)
- Time taken to load model weights in seconds
Data Parallel Cooperation Metrics
For multi-rank data parallel setups:sglang:dp_cooperation_realtime_tokens_total (Counter)
- Tokens processed with DP cooperation labels
- Additional label:
num_prefill_ranks
sglang:dp_cooperation_gpu_execution_seconds_total (Counter)
- GPU execution time with DP cooperation labels
- Additional label:
num_prefill_ranks
Prefill Delayer Metrics
sglang:prefill_delayer_wait_forward_passes (Histogram)
- Number of forward passes waited by prefill delayer
sglang:prefill_delayer_wait_seconds (Histogram)
- Wait time in seconds by prefill delayer
sglang:prefill_delayer_outcomes_total (Counter)
- Prefill delayer outcome counts
- Labels:
input_estimation,output_allow,output_reason,actual_execution
MoE Expert Parallel Metrics
For MoE models with expert parallelism:sglang:eplb_balancedness (Summary)
- Load balancing across MoE experts
- Labels:
forward_mode
Label Descriptions
Common labels across metrics:model_name: Name of the served modelengine_type: Type of engine (unified,prefill, ordecode)tp_rank: Tensor parallel rank (0 to tp_size-1)pp_rank: Pipeline parallel rank (0 to pp_size-1)dp_rank: Data parallel rank (if applicable)moe_ep_rank: MoE expert parallel rank
--extra-metric-labels.
Querying Metrics
Example PromQL Queries
Average TTFT over last 5 minutes:Customizing Histogram Buckets
You can customize histogram buckets using environment variables:Next Steps
- Set up monitoring dashboards
- Enable request tracing for detailed insights
- Run performance benchmarks
