SGLang supports various environment variables that can be used to configure its runtime behavior. This document provides a comprehensive list.
SGLang uses two prefixes for environment variables: SGL_ and SGLANG_. This is due to historical reasons. While both are currently supported for different settings, future versions might consolidate them.
General Configuration
| Environment Variable | Description | Default Value |
|---|
SGLANG_USE_MODELSCOPE | Enable using models from ModelScope | false |
SGLANG_HOST_IP | Host IP address for the server | 0.0.0.0 |
SGLANG_PORT | Port for the server | auto-detected |
SGLANG_LOGGING_CONFIG_PATH | Custom logging configuration path | Not set |
SGLANG_DISABLE_REQUEST_LOGGING | Disable request logging | false |
SGLANG_LOG_REQUEST_HEADERS | Comma-separated list of additional HTTP headers to log when --log-requests is enabled. Appends to the default x-smg-routing-key. | Not set |
SGLANG_HEALTH_CHECK_TIMEOUT | Timeout for health check in seconds | 20 |
SGLANG_EPLB_HEATMAP_COLLECTION_INTERVAL | The interval of passes to collect the metric of selected count of physical experts on each layer and GPU rank. 0 means disabled. | 0 |
SGLANG_FORWARD_UNKNOWN_TOOLS | Forward unknown tool calls to clients instead of dropping them | false |
SGLANG_REQ_WAITING_TIMEOUT | Timeout (in seconds) for requests waiting in the queue before being scheduled | -1 (disabled) |
SGLANG_REQ_RUNNING_TIMEOUT | Timeout (in seconds) for requests running in the decode batch | -1 (disabled) |
| Environment Variable | Description | Default Value |
|---|
SGLANG_ENABLE_TORCH_INFERENCE_MODE | Control whether to use torch.inference_mode | false |
SGLANG_ENABLE_TORCH_COMPILE | Enable torch.compile | true |
SGLANG_SET_CPU_AFFINITY | Enable CPU affinity setting (often set to 1 in Docker builds) | 0 |
SGLANG_ALLOW_OVERWRITE_LONGER_CONTEXT_LEN | Allows the scheduler to overwrite longer context length requests | 0 |
SGLANG_IS_FLASHINFER_AVAILABLE | Control FlashInfer availability check | true |
SGLANG_SKIP_P2P_CHECK | Skip P2P (peer-to-peer) access check | false |
SGLANG_CHUNKED_PREFIX_CACHE_THRESHOLD | Sets the threshold for enabling chunked prefix caching | 8192 |
SGLANG_FUSED_MLA_ENABLE_ROPE_FUSION | Enable RoPE fusion in Fused Multi-Layer Attention | 1 |
SGLANG_DISABLE_CONSECUTIVE_PREFILL_OVERLAP | Disable overlap schedule for consecutive prefill batches | false |
SGLANG_SCHEDULER_MAX_RECV_PER_POLL | Set the maximum number of requests per poll, with a negative value indicating no limit | -1 |
SGLANG_DISABLE_FA4_WARMUP | Disable Flash Attention 4 warmup passes (set to 1, true, yes, or on to disable) | false |
SGLANG_DATA_PARALLEL_BUDGET_INTERVAL | Interval for DPBudget updates | 1 |
SGLANG_SCHEDULER_RECV_SKIPPER_WEIGHT_DEFAULT | Default weight value for scheduler recv skipper counter | 1000 |
SGLANG_SCHEDULER_RECV_SKIPPER_WEIGHT_DECODE | Weight increment for decode forward mode in scheduler recv skipper | 1 |
SGLANG_SCHEDULER_RECV_SKIPPER_WEIGHT_VERIFY | Weight increment for target verify forward mode in scheduler recv skipper | 1 |
SGLANG_SCHEDULER_RECV_SKIPPER_WEIGHT_NONE | Weight increment when forward mode is None in scheduler recv skipper | 1 |
SGLANG_MM_BUFFER_SIZE_MB | Size of preallocated GPU buffer (in MB) for multi-modal feature hashing optimization. Set to 0 to disable. | 0 |
SGLANG_MM_PRECOMPUTE_HASH | Enable precomputing of hash values for MultimodalDataItem | false |
SGLANG_NCCL_ALL_GATHER_IN_OVERLAP_SCHEDULER_SYNC_BATCH | Enable NCCL for gathering when preparing mlp sync batch under overlap scheduler | false |
SGLANG_SYMM_MEM_PREALLOC_GB_SIZE | Size of preallocated GPU buffer (in GB) for NCCL symmetric memory pool. Only effective when server arg --enable-symm-mem is set. | 4 |
SGLANG_CUSTOM_ALLREDUCE_ALGO | The algorithm of custom all-reduce. Set to oneshot/1stage or twoshot/2stage to force use. | “ |
DeepGEMM Configuration
DeepGEMM is an advanced optimization for NVIDIA Hopper (SM90) and Blackwell (SM100) GPUs. It’s automatically enabled when the package is installed.
| Environment Variable | Description | Default Value |
|---|
SGLANG_ENABLE_JIT_DEEPGEMM | Enable Just-In-Time compilation of DeepGEMM kernels (set to "0" to disable) | "true" |
SGLANG_JIT_DEEPGEMM_PRECOMPILE | Enable precompilation of DeepGEMM kernels | "true" |
SGLANG_JIT_DEEPGEMM_COMPILE_WORKERS | Number of workers for parallel DeepGEMM kernel compilation | 4 |
SGLANG_IN_DEEPGEMM_PRECOMPILE_STAGE | Indicator flag used during the DeepGEMM precompile script | "false" |
SGLANG_DG_CACHE_DIR | Directory for caching compiled DeepGEMM kernels | ~/.cache/deep_gemm |
SGLANG_DG_USE_NVRTC | Use NVRTC (instead of Triton) for JIT compilation (Experimental) | "0" |
SGLANG_USE_DEEPGEMM_BMM | Use DeepGEMM for Batched Matrix Multiplication (BMM) operations | "false" |
SGLANG_JIT_DEEPGEMM_FAST_WARMUP | Precompile less kernels during warmup. Reduces warmup time from 30min to <3min but might cause performance degradation. | "false" |
DeepEP Configuration
DeepEP is optimized for DeepSeek models with expert parallelism.
| Environment Variable | Description | Default Value |
|---|
SGLANG_DEEPEP_BF16_DISPATCH | Use Bfloat16 for dispatch | "false" |
SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK | The maximum number of dispatched tokens on each GPU | "128" |
SGLANG_FLASHINFER_NUM_MAX_DISPATCH_TOKENS_PER_RANK | The maximum number of dispatched tokens on each GPU for —moe-a2a-backend=flashinfer | "1024" |
SGLANG_DEEPEP_LL_COMBINE_SEND_NUM_SMS | Number of SMs used for DeepEP combine when single batch overlap is enabled | "32" |
SGLANG_BLACKWELL_OVERLAP_SHARED_EXPERTS_OUTSIDE_SBO | Run shared experts on an alternate stream when single batch overlap is enabled on GB200 | "false" |
MORI Configuration
MORI is an advanced MoE optimization framework for multi-node deployments.
| Environment Variable | Description | Default Value |
|---|
SGLANG_MORI_FP8_DISP | Use FP8 for dispatch | "false" |
SGLANG_MORI_NUM_MAX_DISPATCH_TOKENS_PER_RANK | Maximum number of dispatch tokens per rank for MORI-EP buffer allocation | 4096 |
SGLANG_MORI_DISPATCH_INTER_KERNEL_SWITCH_THRESHOLD | Threshold for switching between InterNodeV1 and InterNodeV1LL kernel types | 256 |
SGLANG_MORI_QP_PER_TRANSFER | Number of RDMA Queue Pairs (QPs) used per transfer operation | 1 |
SGLANG_MORI_POST_BATCH_SIZE | Number of RDMA work requests posted in a single batch to each QP | -1 |
SGLANG_MORI_NUM_WORKERS | Number of worker threads in the RDMA executor thread pool | 1 |
NSA Backend Configuration
NSA backend is optimized for DeepSeek V3.2 and later models.
| Environment Variable | Description | Default Value |
|---|
SGLANG_NSA_FUSE_TOPK | Fuse the operation of picking topk logits and picking topk indices from page table | true |
SGLANG_NSA_ENABLE_MTP_PRECOMPUTE_METADATA | Precompute metadata that can be shared among different draft steps when MTP is enabled | true |
Memory Management
| Environment Variable | Description | Default Value |
|---|
SGLANG_DEBUG_MEMORY_POOL | Enable memory pool debugging | false |
SGLANG_CLIP_MAX_NEW_TOKENS_ESTIMATION | Clip max new tokens estimation for memory planning | 4096 |
SGLANG_DETOKENIZER_MAX_STATES | Maximum states for detokenizer | System-dependent |
SGLANG_ENABLE_TP_MEMORY_INBALANCE_CHECK | Enable checks for memory imbalance across Tensor Parallel ranks | true |
SGLANG_MOONCAKE_CUSTOM_MEM_POOL | Configure the custom memory pool type for Mooncake. Supports NVLINK, BAREX, INTRA_NODE_NVLINK. If set to true, defaults to NVLINK. | None |
Model-Specific Options
| Environment Variable | Description | Default Value |
|---|
SGLANG_USE_AITER | Use AITER optimize implementation | false |
SGLANG_MOE_PADDING | Enable MoE padding (sets padding size to 128 if value is 1) | 0 |
SGLANG_CUTLASS_MOE | Use Cutlass FP8 MoE kernel on Blackwell GPUs (deprecated, use —moe-runner-backend=cutlass) | false |
Quantization
| Environment Variable | Description | Default Value |
|---|
SGLANG_INT4_WEIGHT | Enable INT4 weight quantization | false |
SGLANG_PER_TOKEN_GROUP_QUANT_8BIT_V2 | Apply per token group quantization kernel with fused silu and mul and masked m | false |
SGLANG_FORCE_FP8_MARLIN | Force using FP8 MARLIN kernels even if other FP8 kernels are available | false |
SGLANG_FLASHINFER_FP4_GEMM_BACKEND | DEPRECATED: Use --fp4-gemm-backend instead | “ |
SGLANG_NVFP4_CKPT_FP8_GEMM_IN_ATTN | Quantize q_b_proj from BF16 to FP8 when launching DeepSeek NVFP4 checkpoint | false |
SGLANG_MOE_NVFP4_DISPATCH | Use nvfp4 for moe dispatch (on flashinfer_cutlass or flashinfer_cutedsl moe runner backend) | "false" |
SGLANG_NVFP4_CKPT_FP8_NEXTN_MOE | Quantize moe of nextn layer from BF16 to FP8 when launching DeepSeek NVFP4 checkpoint | false |
SGLANG_ENABLE_FLASHINFER_FP8_GEMM | DEPRECATED: Use --fp8-gemm-backend=flashinfer_trtllm (SM100/SM103) or --fp8-gemm-backend=flashinfer_cutlass (SM120+) instead | false |
SGLANG_SUPPORT_CUTLASS_BLOCK_FP8 | DEPRECATED: Use --fp8-gemm-backend=cutlass instead | false |
Distributed Computing
| Environment Variable | Description | Default Value |
|---|
SGLANG_BLOCK_NONZERO_RANK_CHILDREN | Control blocking of non-zero rank children processes | 1 |
SGLANG_IS_FIRST_RANK_ON_NODE | Indicates if the current process is the first rank on its node | "true" |
SGLANG_PP_LAYER_PARTITION | Pipeline parallel layer partition specification | Not set |
SGLANG_ONE_VISIBLE_DEVICE_PER_PROCESS | Set one visible device per process for distributed computing | false |
Testing & Debugging
These variables are primarily used for internal testing, continuous integration, or debugging. Do not use in production unless you understand the implications.
| Environment Variable | Description | Default Value |
|---|
SGLANG_IS_IN_CI | Indicates if running in CI environment | false |
SGLANG_IS_IN_CI_AMD | Indicates running in AMD CI environment | 0 |
SGLANG_TEST_RETRACT | Enable retract decode testing | false |
SGLANG_TEST_RETRACT_NO_PREFILL_BS | When SGLANG_TEST_RETRACT is enabled, no prefill is performed if the batch size exceeds this value | 2 ** 31 |
SGLANG_RECORD_STEP_TIME | Record step time for profiling | false |
SGLANG_TEST_REQUEST_TIME_STATS | Test request time statistics | false |
Profiling & Benchmarking
| Environment Variable | Description | Default Value |
|---|
SGLANG_TORCH_PROFILER_DIR | Directory for PyTorch profiler output | /tmp |
SGLANG_PROFILE_WITH_STACK | Set with_stack option for PyTorch profiler (capture stack trace) | true |
SGLANG_PROFILE_RECORD_SHAPES | Set record_shapes option for PyTorch profiler | true |
SGLANG_OTLP_EXPORTER_SCHEDULE_DELAY_MILLIS | Config BatchSpanProcessor.schedule_delay_millis if tracing is enabled | 500 |
SGLANG_OTLP_EXPORTER_MAX_EXPORT_BATCH_SIZE | Config BatchSpanProcessor.max_export_batch_size if tracing is enabled | 64 |
Storage & Caching
| Environment Variable | Description | Default Value |
|---|
SGLANG_WAIT_WEIGHTS_READY_TIMEOUT | Timeout period for waiting on weights | 120 |
SGLANG_DISABLE_OUTLINES_DISK_CACHE | Disable Outlines disk cache | true |
SGLANG_USE_CUSTOM_TRITON_KERNEL_CACHE | Use SGLang’s custom Triton kernel cache implementation for lower overheads (automatically enabled on CUDA) | false |
| Environment Variable | Description | Default Value |
|---|
SGLANG_TOOL_STRICT_LEVEL | Controls the strictness level of tool call parsing and validation. Level 0: Off - No strict validation Level 1: Function strict - Enables structural tag constraints Level 2: Parameter strict - Enforces strict parameter validation | 0 |
See Also