Prometheus Metrics

Overview

SGLang exposes comprehensive Prometheus metrics when launched with the --enable-metrics flag. These metrics provide insights into performance, resource utilization, and system behavior.

Accessing Metrics

Metrics are exposed at the /metrics endpoint:

curl http://localhost:30000/metrics

Core Metrics

Request Metrics

Token Counters

sglang:prompt_tokens_total (Counter)

Number of prefill tokens processed
Labels: model_name, engine_type, tp_rank, pp_rank, moe_ep_rank

sglang:generation_tokens_total (Counter)

Number of generation tokens processed
Labels: model_name, engine_type, tp_rank, pp_rank, moe_ep_rank

sglang:cached_tokens_total (Counter)

Number of cached tokens (prefix cache hits)
Labels: model_name, engine_type, cache_source
Cache sources: device, host, storage_<backend>

sglang:realtime_tokens_total (Counter)

Total tokens processed, updated on each log interval
Labels: model_name, mode (values: prefill_compute, prefill_cache, decode)

Request Counts

sglang:num_requests_total (Counter)

Total number of requests processed

sglang:num_so_requests_total (Counter)

Number of structured output (grammar) requests processed

sglang:num_aborted_requests_total (Counter)

Number of requests that were aborted

Latency Histograms

sglang:time_to_first_token_seconds (Histogram)

Time from request start to first token generation
Buckets: 0.001s to 400s (logarithmic scale)

Example output:

sglang:time_to_first_token_seconds_sum{model_name="meta-llama/Llama-3.1-8B-Instruct"} 2351897.947
sglang:time_to_first_token_seconds_count{model_name="meta-llama/Llama-3.1-8B-Instruct"} 11008

sglang:inter_token_latency_seconds (Histogram)

Time between consecutive token generations
Buckets: 0.002s to 8s

sglang:e2e_request_latency_seconds (Histogram)

End-to-end request latency from submission to completion
Buckets: 0.1s to 2400s

sglang:time_per_output_token_seconds (Histogram)

Average time per output token
Calculated as (total_latency - TTFT) / (num_tokens - 1)
Buckets: 0.005s to 2.5s

sglang:per_stage_req_latency_seconds (Histogram)

Latency breakdown by request processing stage
Labels: stage (various internal stages)
Buckets: Exponential from 1ms to ~1191s

System State Metrics

Resource Usage

sglang:num_running_reqs (Gauge)

Number of currently running requests

sglang:num_queue_reqs (Gauge)

Number of requests in the waiting queue

sglang:num_used_tokens (Gauge)

Number of tokens currently in use in the KV cache

sglang:token_usage (Gauge)

Fraction of KV cache capacity in use (0.0 to 1.0)

sglang:max_total_num_tokens (Gauge)

Maximum total number of tokens in the KV cache pool

sglang:gen_throughput (Gauge)

Current generation throughput in tokens per second

Cache Metrics

sglang:cache_hit_rate (Gauge)

Prefix cache hit rate (0.0 to 1.0)

sglang:cache_config_info (Gauge)

Cache configuration information
Labels: page_size, num_pages
Value is always 1 (info metric)

Performance Metrics

Function Latency

sglang:func_latency_seconds (Histogram)

Latency of key functions in seconds
Labels: name (function name, e.g., generate_request)
Buckets: 50ms to ~50s (exponential)

GPU Execution

sglang:gpu_execution_seconds_total (Counter)

Total time GPU is busy executing workloads
Labels: category (forward mode category)

sglang:cuda_graph_passes_total (Counter)

Number of forward passes using CUDA graphs
Labels: mode (decode_cuda_graph or decode_none)

sglang:is_cuda_graph (Gauge)

Whether the current batch is using CUDA graph (1.0 or 0.0)

Queue Time

sglang:queue_time_seconds (Histogram)

Time requests spend in the waiting queue
Buckets: 0s to 3000s

Speculative Decoding Metrics

These metrics are available when speculative decoding is enabled: sglang:spec_accept_length (Gauge)

Average number of tokens accepted per speculative decoding step

sglang:spec_accept_rate (Gauge)

Acceptance rate: accepted tokens / total draft tokens

Advanced Features

Grammar/Structured Output Metrics

sglang:grammar_compilation_time_seconds (Histogram)

Time to compile grammar/schema definitions
Buckets: 0s to 240s

sglang:num_grammar_cache_hit_total (Counter)

Number of grammar cache hits

sglang:num_grammar_aborted_total (Counter)

Number of grammar requests that were aborted

sglang:num_grammar_timeout_total (Counter)

Number of grammar timeouts

sglang:num_grammar_queue_reqs (Gauge)

Number of requests in the grammar waiting queue

sglang:grammar_schema_count (Histogram)

Number of schemas in grammar definitions

sglang:grammar_ebnf_size (Histogram)

Size of EBNF grammar definitions in bytes

sglang:grammar_tree_traversal_time_avg (Histogram)

Average time for grammar tree traversal

sglang:grammar_tree_traversal_time_max (Histogram)

Maximum time for grammar tree traversal

Retraction Metrics

sglang:num_retracted_reqs (Gauge)

Current number of retracted requests

sglang:num_retracted_requests_total (Counter)

Total number of requests that have been retracted

sglang:num_retracted_input_tokens_total (Counter)

Total number of input tokens from retracted requests

sglang:num_retracted_output_tokens_total (Counter)

Total number of output tokens from retracted requests

sglang:num_retractions (Histogram)

Distribution of retraction counts per request

sglang:num_paused_reqs (Gauge)

Number of requests paused by async weight sync

LoRA Metrics

Available when LoRA adapters are enabled: sglang:lora_pool_slots_used (Gauge)

Number of LoRA adapter slots currently in use

sglang:lora_pool_slots_total (Gauge)

Total number of LoRA adapter slots available (max_loras_per_batch)

sglang:lora_pool_utilization (Gauge)

LoRA pool utilization ratio (used/total), 1.0 means pool is full

Hierarchical Cache (HiCache) Metrics

Available when hierarchical cache is enabled: sglang:hicache_host_used_tokens (Gauge)

Number of tokens currently in host (CPU) memory cache

sglang:hicache_host_total_tokens (Gauge)

Total capacity of host KV cache in tokens

Prefill-Decode Disaggregation

These metrics are available in disaggregated prefill/decode mode:

Prefill Worker Metrics

sglang:num_prefill_prealloc_queue_reqs (Gauge)

Number of requests in prefill preallocation queue

sglang:num_prefill_inflight_queue_reqs (Gauge)

Number of requests in prefill inflight queue

sglang:num_prefill_retries_total (Counter)

Total number of prefill retries

Decode Worker Metrics

sglang:num_decode_prealloc_queue_reqs (Gauge)

Number of requests in decode preallocation queue

sglang:num_decode_transfer_queue_reqs (Gauge)

Number of requests in decode transfer queue

KV Transfer Metrics

sglang:kv_transfer_speed_gb_s (Gauge)

KV cache transfer speed in GB/s

sglang:kv_transfer_latency_ms (Gauge)

KV cache transfer latency in milliseconds

sglang:kv_transfer_bootstrap_ms (Gauge)

Bootstrap time for KV transfer in milliseconds

sglang:kv_transfer_alloc_ms (Gauge)

Allocation waiting time for KV transfer in milliseconds

sglang:kv_transfer_total_mb (Gauge)

Total size of KV data transferred in megabytes

sglang:num_bootstrap_failed_reqs_total (Counter)

Number of requests that failed during bootstrap

sglang:num_transfer_failed_reqs_total (Counter)

Number of requests that failed during transfer

Storage Backend Metrics

For L3 storage cache: sglang:prefetched_tokens_total (Counter)

Number of tokens prefetched from storage

sglang:backuped_tokens_total (Counter)

Number of tokens backed up to storage

sglang:prefetch_pgs (Histogram)

Distribution of prefetch page counts

sglang:backup_pgs (Histogram)

Distribution of backup page counts

sglang:prefetch_bandwidth (Histogram)

Prefetch bandwidth in GB/s

sglang:backup_bandwidth (Histogram)

Backup bandwidth in GB/s

Routing Key Metrics

sglang:num_unique_running_routing_keys (Gauge)

Number of unique routing keys in the running batch

sglang:routing_key_running_req_count (GaugeHistogram)

Distribution of routing keys by running request count

sglang:routing_key_all_req_count (GaugeHistogram)

Distribution of routing keys by total (running + waiting) request count

CPU Metrics

sglang:process_cpu_seconds_total (Counter)

Total CPU time consumed by the process (user + system)
Labels: component

Utilization Metrics

sglang:utilization (Gauge)

Overall system utilization (0.0 to >1.0)
Calculated from request load and token usage

sglang:max_running_requests_under_SLO (Gauge)

Maximum number of running requests while meeting SLO targets

sglang:new_token_ratio (Gauge)

Ratio of new tokens to total tokens in prefill batches

Engine Startup Metrics

sglang:engine_startup_time (Gauge)

Time taken for the engine to start up in seconds

sglang:engine_load_weights_time (Gauge)

Time taken to load model weights in seconds

Data Parallel Cooperation Metrics

For multi-rank data parallel setups: sglang:dp_cooperation_realtime_tokens_total (Counter)

Tokens processed with DP cooperation labels
Additional label: num_prefill_ranks

sglang:dp_cooperation_gpu_execution_seconds_total (Counter)

GPU execution time with DP cooperation labels
Additional label: num_prefill_ranks

Prefill Delayer Metrics

sglang:prefill_delayer_wait_forward_passes (Histogram)

Number of forward passes waited by prefill delayer

sglang:prefill_delayer_wait_seconds (Histogram)

Wait time in seconds by prefill delayer

sglang:prefill_delayer_outcomes_total (Counter)

Prefill delayer outcome counts
Labels: input_estimation, output_allow, output_reason, actual_execution

MoE Expert Parallel Metrics

For MoE models with expert parallelism: sglang:eplb_balancedness (Summary)

Load balancing across MoE experts
Labels: forward_mode

Label Descriptions

Common labels across metrics:

model_name: Name of the served model
engine_type: Type of engine (unified, prefill, or decode)
tp_rank: Tensor parallel rank (0 to tp_size-1)
pp_rank: Pipeline parallel rank (0 to pp_size-1)
dp_rank: Data parallel rank (if applicable)
moe_ep_rank: MoE expert parallel rank

Custom labels can be added via --extra-metric-labels.

Querying Metrics

Example PromQL Queries

Average TTFT over last 5 minutes:

rate(sglang:time_to_first_token_seconds_sum[5m]) / 
rate(sglang:time_to_first_token_seconds_count[5m])

Token throughput (tokens/sec):

rate(sglang:generation_tokens_total[1m])

Cache hit rate:

sglang:cache_hit_rate

Current queue depth:

sglang:num_queue_reqs

P99 E2E latency:

histogram_quantile(0.99, 
  rate(sglang:e2e_request_latency_seconds_bucket[5m])
)

Customizing Histogram Buckets

You can customize histogram buckets using environment variables:

export SGLANG_PROMPT_TOKENS_BUCKETS="100,500,1000,5000,10000"
export SGLANG_GENERATION_TOKENS_BUCKETS="10,50,100,500,1000"

python -m sglang.launch_server \
  --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \
  --enable-metrics

Next Steps

Set up monitoring dashboards
Enable request tracing for detailed insights
Run performance benchmarks

Get Started

Core Concepts

Backend (Runtime)

Frontend (Language)

Model Support

Advanced Features

Distributed Serving

Optimization

Deployment

Observability

​Overview

​Accessing Metrics

​Core Metrics

​Request Metrics

​Token Counters

​Request Counts

​Latency Histograms

​System State Metrics

​Resource Usage

​Cache Metrics

​Performance Metrics

​Function Latency

​GPU Execution

​Queue Time

​Speculative Decoding Metrics

​Advanced Features

​Grammar/Structured Output Metrics

​Retraction Metrics

​LoRA Metrics

​Hierarchical Cache (HiCache) Metrics

​Prefill-Decode Disaggregation

​Prefill Worker Metrics

​Decode Worker Metrics

​KV Transfer Metrics

​Storage Backend Metrics

​Routing Key Metrics

​CPU Metrics

​Utilization Metrics

​Engine Startup Metrics

​Data Parallel Cooperation Metrics

​Prefill Delayer Metrics

​MoE Expert Parallel Metrics

​Label Descriptions

​Querying Metrics

​Example PromQL Queries

​Customizing Histogram Buckets

​Next Steps

Overview

Accessing Metrics

Core Metrics

Request Metrics

Token Counters

Request Counts

Latency Histograms

System State Metrics

Resource Usage

Cache Metrics

Performance Metrics

Function Latency

GPU Execution

Queue Time

Speculative Decoding Metrics

Advanced Features

Grammar/Structured Output Metrics

Retraction Metrics

LoRA Metrics

Hierarchical Cache (HiCache) Metrics

Prefill-Decode Disaggregation

Prefill Worker Metrics

Decode Worker Metrics

KV Transfer Metrics

Storage Backend Metrics

Routing Key Metrics

CPU Metrics

Utilization Metrics

Engine Startup Metrics

Data Parallel Cooperation Metrics

Prefill Delayer Metrics

MoE Expert Parallel Metrics

Label Descriptions

Querying Metrics

Example PromQL Queries

Customizing Histogram Buckets

Next Steps