Skip to main content

Overview

SGLang exposes comprehensive Prometheus metrics when launched with the --enable-metrics flag. These metrics provide insights into performance, resource utilization, and system behavior.

Accessing Metrics

Metrics are exposed at the /metrics endpoint:
curl http://localhost:30000/metrics

Core Metrics

Request Metrics

Token Counters

sglang:prompt_tokens_total (Counter)
  • Number of prefill tokens processed
  • Labels: model_name, engine_type, tp_rank, pp_rank, moe_ep_rank
sglang:generation_tokens_total (Counter)
  • Number of generation tokens processed
  • Labels: model_name, engine_type, tp_rank, pp_rank, moe_ep_rank
sglang:cached_tokens_total (Counter)
  • Number of cached tokens (prefix cache hits)
  • Labels: model_name, engine_type, cache_source
  • Cache sources: device, host, storage_<backend>
sglang:realtime_tokens_total (Counter)
  • Total tokens processed, updated on each log interval
  • Labels: model_name, mode (values: prefill_compute, prefill_cache, decode)

Request Counts

sglang:num_requests_total (Counter)
  • Total number of requests processed
sglang:num_so_requests_total (Counter)
  • Number of structured output (grammar) requests processed
sglang:num_aborted_requests_total (Counter)
  • Number of requests that were aborted

Latency Histograms

sglang:time_to_first_token_seconds (Histogram)
  • Time from request start to first token generation
  • Buckets: 0.001s to 400s (logarithmic scale)
Example output:
sglang:time_to_first_token_seconds_sum{model_name="meta-llama/Llama-3.1-8B-Instruct"} 2351897.947
sglang:time_to_first_token_seconds_count{model_name="meta-llama/Llama-3.1-8B-Instruct"} 11008
sglang:inter_token_latency_seconds (Histogram)
  • Time between consecutive token generations
  • Buckets: 0.002s to 8s
sglang:e2e_request_latency_seconds (Histogram)
  • End-to-end request latency from submission to completion
  • Buckets: 0.1s to 2400s
sglang:time_per_output_token_seconds (Histogram)
  • Average time per output token
  • Calculated as (total_latency - TTFT) / (num_tokens - 1)
  • Buckets: 0.005s to 2.5s
sglang:per_stage_req_latency_seconds (Histogram)
  • Latency breakdown by request processing stage
  • Labels: stage (various internal stages)
  • Buckets: Exponential from 1ms to ~1191s

System State Metrics

Resource Usage

sglang:num_running_reqs (Gauge)
  • Number of currently running requests
sglang:num_queue_reqs (Gauge)
  • Number of requests in the waiting queue
sglang:num_used_tokens (Gauge)
  • Number of tokens currently in use in the KV cache
sglang:token_usage (Gauge)
  • Fraction of KV cache capacity in use (0.0 to 1.0)
sglang:max_total_num_tokens (Gauge)
  • Maximum total number of tokens in the KV cache pool
sglang:gen_throughput (Gauge)
  • Current generation throughput in tokens per second

Cache Metrics

sglang:cache_hit_rate (Gauge)
  • Prefix cache hit rate (0.0 to 1.0)
sglang:cache_config_info (Gauge)
  • Cache configuration information
  • Labels: page_size, num_pages
  • Value is always 1 (info metric)

Performance Metrics

Function Latency

sglang:func_latency_seconds (Histogram)
  • Latency of key functions in seconds
  • Labels: name (function name, e.g., generate_request)
  • Buckets: 50ms to ~50s (exponential)

GPU Execution

sglang:gpu_execution_seconds_total (Counter)
  • Total time GPU is busy executing workloads
  • Labels: category (forward mode category)
sglang:cuda_graph_passes_total (Counter)
  • Number of forward passes using CUDA graphs
  • Labels: mode (decode_cuda_graph or decode_none)
sglang:is_cuda_graph (Gauge)
  • Whether the current batch is using CUDA graph (1.0 or 0.0)

Queue Time

sglang:queue_time_seconds (Histogram)
  • Time requests spend in the waiting queue
  • Buckets: 0s to 3000s

Speculative Decoding Metrics

These metrics are available when speculative decoding is enabled: sglang:spec_accept_length (Gauge)
  • Average number of tokens accepted per speculative decoding step
sglang:spec_accept_rate (Gauge)
  • Acceptance rate: accepted tokens / total draft tokens

Advanced Features

Grammar/Structured Output Metrics

sglang:grammar_compilation_time_seconds (Histogram)
  • Time to compile grammar/schema definitions
  • Buckets: 0s to 240s
sglang:num_grammar_cache_hit_total (Counter)
  • Number of grammar cache hits
sglang:num_grammar_aborted_total (Counter)
  • Number of grammar requests that were aborted
sglang:num_grammar_timeout_total (Counter)
  • Number of grammar timeouts
sglang:num_grammar_queue_reqs (Gauge)
  • Number of requests in the grammar waiting queue
sglang:grammar_schema_count (Histogram)
  • Number of schemas in grammar definitions
sglang:grammar_ebnf_size (Histogram)
  • Size of EBNF grammar definitions in bytes
sglang:grammar_tree_traversal_time_avg (Histogram)
  • Average time for grammar tree traversal
sglang:grammar_tree_traversal_time_max (Histogram)
  • Maximum time for grammar tree traversal

Retraction Metrics

sglang:num_retracted_reqs (Gauge)
  • Current number of retracted requests
sglang:num_retracted_requests_total (Counter)
  • Total number of requests that have been retracted
sglang:num_retracted_input_tokens_total (Counter)
  • Total number of input tokens from retracted requests
sglang:num_retracted_output_tokens_total (Counter)
  • Total number of output tokens from retracted requests
sglang:num_retractions (Histogram)
  • Distribution of retraction counts per request
sglang:num_paused_reqs (Gauge)
  • Number of requests paused by async weight sync

LoRA Metrics

Available when LoRA adapters are enabled: sglang:lora_pool_slots_used (Gauge)
  • Number of LoRA adapter slots currently in use
sglang:lora_pool_slots_total (Gauge)
  • Total number of LoRA adapter slots available (max_loras_per_batch)
sglang:lora_pool_utilization (Gauge)
  • LoRA pool utilization ratio (used/total), 1.0 means pool is full

Hierarchical Cache (HiCache) Metrics

Available when hierarchical cache is enabled: sglang:hicache_host_used_tokens (Gauge)
  • Number of tokens currently in host (CPU) memory cache
sglang:hicache_host_total_tokens (Gauge)
  • Total capacity of host KV cache in tokens

Prefill-Decode Disaggregation

These metrics are available in disaggregated prefill/decode mode:

Prefill Worker Metrics

sglang:num_prefill_prealloc_queue_reqs (Gauge)
  • Number of requests in prefill preallocation queue
sglang:num_prefill_inflight_queue_reqs (Gauge)
  • Number of requests in prefill inflight queue
sglang:num_prefill_retries_total (Counter)
  • Total number of prefill retries

Decode Worker Metrics

sglang:num_decode_prealloc_queue_reqs (Gauge)
  • Number of requests in decode preallocation queue
sglang:num_decode_transfer_queue_reqs (Gauge)
  • Number of requests in decode transfer queue

KV Transfer Metrics

sglang:kv_transfer_speed_gb_s (Gauge)
  • KV cache transfer speed in GB/s
sglang:kv_transfer_latency_ms (Gauge)
  • KV cache transfer latency in milliseconds
sglang:kv_transfer_bootstrap_ms (Gauge)
  • Bootstrap time for KV transfer in milliseconds
sglang:kv_transfer_alloc_ms (Gauge)
  • Allocation waiting time for KV transfer in milliseconds
sglang:kv_transfer_total_mb (Gauge)
  • Total size of KV data transferred in megabytes
sglang:num_bootstrap_failed_reqs_total (Counter)
  • Number of requests that failed during bootstrap
sglang:num_transfer_failed_reqs_total (Counter)
  • Number of requests that failed during transfer

Storage Backend Metrics

For L3 storage cache: sglang:prefetched_tokens_total (Counter)
  • Number of tokens prefetched from storage
sglang:backuped_tokens_total (Counter)
  • Number of tokens backed up to storage
sglang:prefetch_pgs (Histogram)
  • Distribution of prefetch page counts
sglang:backup_pgs (Histogram)
  • Distribution of backup page counts
sglang:prefetch_bandwidth (Histogram)
  • Prefetch bandwidth in GB/s
sglang:backup_bandwidth (Histogram)
  • Backup bandwidth in GB/s

Routing Key Metrics

sglang:num_unique_running_routing_keys (Gauge)
  • Number of unique routing keys in the running batch
sglang:routing_key_running_req_count (GaugeHistogram)
  • Distribution of routing keys by running request count
sglang:routing_key_all_req_count (GaugeHistogram)
  • Distribution of routing keys by total (running + waiting) request count

CPU Metrics

sglang:process_cpu_seconds_total (Counter)
  • Total CPU time consumed by the process (user + system)
  • Labels: component

Utilization Metrics

sglang:utilization (Gauge)
  • Overall system utilization (0.0 to >1.0)
  • Calculated from request load and token usage
sglang:max_running_requests_under_SLO (Gauge)
  • Maximum number of running requests while meeting SLO targets
sglang:new_token_ratio (Gauge)
  • Ratio of new tokens to total tokens in prefill batches

Engine Startup Metrics

sglang:engine_startup_time (Gauge)
  • Time taken for the engine to start up in seconds
sglang:engine_load_weights_time (Gauge)
  • Time taken to load model weights in seconds

Data Parallel Cooperation Metrics

For multi-rank data parallel setups: sglang:dp_cooperation_realtime_tokens_total (Counter)
  • Tokens processed with DP cooperation labels
  • Additional label: num_prefill_ranks
sglang:dp_cooperation_gpu_execution_seconds_total (Counter)
  • GPU execution time with DP cooperation labels
  • Additional label: num_prefill_ranks

Prefill Delayer Metrics

sglang:prefill_delayer_wait_forward_passes (Histogram)
  • Number of forward passes waited by prefill delayer
sglang:prefill_delayer_wait_seconds (Histogram)
  • Wait time in seconds by prefill delayer
sglang:prefill_delayer_outcomes_total (Counter)
  • Prefill delayer outcome counts
  • Labels: input_estimation, output_allow, output_reason, actual_execution

MoE Expert Parallel Metrics

For MoE models with expert parallelism: sglang:eplb_balancedness (Summary)
  • Load balancing across MoE experts
  • Labels: forward_mode

Label Descriptions

Common labels across metrics:
  • model_name: Name of the served model
  • engine_type: Type of engine (unified, prefill, or decode)
  • tp_rank: Tensor parallel rank (0 to tp_size-1)
  • pp_rank: Pipeline parallel rank (0 to pp_size-1)
  • dp_rank: Data parallel rank (if applicable)
  • moe_ep_rank: MoE expert parallel rank
Custom labels can be added via --extra-metric-labels.

Querying Metrics

Example PromQL Queries

Average TTFT over last 5 minutes:
rate(sglang:time_to_first_token_seconds_sum[5m]) / 
rate(sglang:time_to_first_token_seconds_count[5m])
Token throughput (tokens/sec):
rate(sglang:generation_tokens_total[1m])
Cache hit rate:
sglang:cache_hit_rate
Current queue depth:
sglang:num_queue_reqs
P99 E2E latency:
histogram_quantile(0.99, 
  rate(sglang:e2e_request_latency_seconds_bucket[5m])
)

Customizing Histogram Buckets

You can customize histogram buckets using environment variables:
export SGLANG_PROMPT_TOKENS_BUCKETS="100,500,1000,5000,10000"
export SGLANG_GENERATION_TOKENS_BUCKETS="10,50,100,500,1000"

python -m sglang.launch_server \
  --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \
  --enable-metrics

Next Steps