Environment Variables

SGLang supports various environment variables that can be used to configure its runtime behavior. This document provides a comprehensive list.

SGLang uses two prefixes for environment variables: SGL_ and SGLANG_. This is due to historical reasons. While both are currently supported for different settings, future versions might consolidate them.

General Configuration

Environment Variable	Description	Default Value
`SGLANG_USE_MODELSCOPE`	Enable using models from ModelScope	`false`
`SGLANG_HOST_IP`	Host IP address for the server	`0.0.0.0`
`SGLANG_PORT`	Port for the server	auto-detected
`SGLANG_LOGGING_CONFIG_PATH`	Custom logging configuration path	Not set
`SGLANG_DISABLE_REQUEST_LOGGING`	Disable request logging	`false`
`SGLANG_LOG_REQUEST_HEADERS`	Comma-separated list of additional HTTP headers to log when `--log-requests` is enabled. Appends to the default `x-smg-routing-key`.	Not set
`SGLANG_HEALTH_CHECK_TIMEOUT`	Timeout for health check in seconds	`20`
`SGLANG_EPLB_HEATMAP_COLLECTION_INTERVAL`	The interval of passes to collect the metric of selected count of physical experts on each layer and GPU rank. 0 means disabled.	`0`
`SGLANG_FORWARD_UNKNOWN_TOOLS`	Forward unknown tool calls to clients instead of dropping them	`false`
`SGLANG_REQ_WAITING_TIMEOUT`	Timeout (in seconds) for requests waiting in the queue before being scheduled	`-1` (disabled)
`SGLANG_REQ_RUNNING_TIMEOUT`	Timeout (in seconds) for requests running in the decode batch	`-1` (disabled)

Performance Tuning

Environment Variable	Description	Default Value
`SGLANG_ENABLE_TORCH_INFERENCE_MODE`	Control whether to use torch.inference_mode	`false`
`SGLANG_ENABLE_TORCH_COMPILE`	Enable torch.compile	`true`
`SGLANG_SET_CPU_AFFINITY`	Enable CPU affinity setting (often set to `1` in Docker builds)	`0`
`SGLANG_ALLOW_OVERWRITE_LONGER_CONTEXT_LEN`	Allows the scheduler to overwrite longer context length requests	`0`
`SGLANG_IS_FLASHINFER_AVAILABLE`	Control FlashInfer availability check	`true`
`SGLANG_SKIP_P2P_CHECK`	Skip P2P (peer-to-peer) access check	`false`
`SGLANG_CHUNKED_PREFIX_CACHE_THRESHOLD`	Sets the threshold for enabling chunked prefix caching	`8192`
`SGLANG_FUSED_MLA_ENABLE_ROPE_FUSION`	Enable RoPE fusion in Fused Multi-Layer Attention	`1`
`SGLANG_DISABLE_CONSECUTIVE_PREFILL_OVERLAP`	Disable overlap schedule for consecutive prefill batches	`false`
`SGLANG_SCHEDULER_MAX_RECV_PER_POLL`	Set the maximum number of requests per poll, with a negative value indicating no limit	`-1`
`SGLANG_DISABLE_FA4_WARMUP`	Disable Flash Attention 4 warmup passes (set to `1`, `true`, `yes`, or `on` to disable)	`false`
`SGLANG_DATA_PARALLEL_BUDGET_INTERVAL`	Interval for DPBudget updates	`1`
`SGLANG_SCHEDULER_RECV_SKIPPER_WEIGHT_DEFAULT`	Default weight value for scheduler recv skipper counter	`1000`
`SGLANG_SCHEDULER_RECV_SKIPPER_WEIGHT_DECODE`	Weight increment for decode forward mode in scheduler recv skipper	`1`
`SGLANG_SCHEDULER_RECV_SKIPPER_WEIGHT_VERIFY`	Weight increment for target verify forward mode in scheduler recv skipper	`1`
`SGLANG_SCHEDULER_RECV_SKIPPER_WEIGHT_NONE`	Weight increment when forward mode is None in scheduler recv skipper	`1`
`SGLANG_MM_BUFFER_SIZE_MB`	Size of preallocated GPU buffer (in MB) for multi-modal feature hashing optimization. Set to `0` to disable.	`0`
`SGLANG_MM_PRECOMPUTE_HASH`	Enable precomputing of hash values for MultimodalDataItem	`false`
`SGLANG_NCCL_ALL_GATHER_IN_OVERLAP_SCHEDULER_SYNC_BATCH`	Enable NCCL for gathering when preparing mlp sync batch under overlap scheduler	`false`
`SGLANG_SYMM_MEM_PREALLOC_GB_SIZE`	Size of preallocated GPU buffer (in GB) for NCCL symmetric memory pool. Only effective when server arg `--enable-symm-mem` is set.	`4`
`SGLANG_CUSTOM_ALLREDUCE_ALGO`	The algorithm of custom all-reduce. Set to `oneshot`/`1stage` or `twoshot`/`2stage` to force use.	“

DeepGEMM Configuration

DeepGEMM is an advanced optimization for NVIDIA Hopper (SM90) and Blackwell (SM100) GPUs. It’s automatically enabled when the package is installed.

Environment Variable	Description	Default Value
`SGLANG_ENABLE_JIT_DEEPGEMM`	Enable Just-In-Time compilation of DeepGEMM kernels (set to `"0"` to disable)	`"true"`
`SGLANG_JIT_DEEPGEMM_PRECOMPILE`	Enable precompilation of DeepGEMM kernels	`"true"`
`SGLANG_JIT_DEEPGEMM_COMPILE_WORKERS`	Number of workers for parallel DeepGEMM kernel compilation	`4`
`SGLANG_IN_DEEPGEMM_PRECOMPILE_STAGE`	Indicator flag used during the DeepGEMM precompile script	`"false"`
`SGLANG_DG_CACHE_DIR`	Directory for caching compiled DeepGEMM kernels	`~/.cache/deep_gemm`
`SGLANG_DG_USE_NVRTC`	Use NVRTC (instead of Triton) for JIT compilation (Experimental)	`"0"`
`SGLANG_USE_DEEPGEMM_BMM`	Use DeepGEMM for Batched Matrix Multiplication (BMM) operations	`"false"`
`SGLANG_JIT_DEEPGEMM_FAST_WARMUP`	Precompile less kernels during warmup. Reduces warmup time from 30min to <3min but might cause performance degradation.	`"false"`

DeepEP Configuration

DeepEP is optimized for DeepSeek models with expert parallelism.

Environment Variable	Description	Default Value
`SGLANG_DEEPEP_BF16_DISPATCH`	Use Bfloat16 for dispatch	`"false"`
`SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK`	The maximum number of dispatched tokens on each GPU	`"128"`
`SGLANG_FLASHINFER_NUM_MAX_DISPATCH_TOKENS_PER_RANK`	The maximum number of dispatched tokens on each GPU for —moe-a2a-backend=flashinfer	`"1024"`
`SGLANG_DEEPEP_LL_COMBINE_SEND_NUM_SMS`	Number of SMs used for DeepEP combine when single batch overlap is enabled	`"32"`
`SGLANG_BLACKWELL_OVERLAP_SHARED_EXPERTS_OUTSIDE_SBO`	Run shared experts on an alternate stream when single batch overlap is enabled on GB200	`"false"`

MORI Configuration

MORI is an advanced MoE optimization framework for multi-node deployments.

Environment Variable	Description	Default Value
`SGLANG_MORI_FP8_DISP`	Use FP8 for dispatch	`"false"`
`SGLANG_MORI_NUM_MAX_DISPATCH_TOKENS_PER_RANK`	Maximum number of dispatch tokens per rank for MORI-EP buffer allocation	`4096`
`SGLANG_MORI_DISPATCH_INTER_KERNEL_SWITCH_THRESHOLD`	Threshold for switching between `InterNodeV1` and `InterNodeV1LL` kernel types	`256`
`SGLANG_MORI_QP_PER_TRANSFER`	Number of RDMA Queue Pairs (QPs) used per transfer operation	`1`
`SGLANG_MORI_POST_BATCH_SIZE`	Number of RDMA work requests posted in a single batch to each QP	`-1`
`SGLANG_MORI_NUM_WORKERS`	Number of worker threads in the RDMA executor thread pool	`1`

NSA Backend Configuration

NSA backend is optimized for DeepSeek V3.2 and later models.

Environment Variable	Description	Default Value
`SGLANG_NSA_FUSE_TOPK`	Fuse the operation of picking topk logits and picking topk indices from page table	`true`
`SGLANG_NSA_ENABLE_MTP_PRECOMPUTE_METADATA`	Precompute metadata that can be shared among different draft steps when MTP is enabled	`true`

Memory Management

Environment Variable	Description	Default Value
`SGLANG_DEBUG_MEMORY_POOL`	Enable memory pool debugging	`false`
`SGLANG_CLIP_MAX_NEW_TOKENS_ESTIMATION`	Clip max new tokens estimation for memory planning	`4096`
`SGLANG_DETOKENIZER_MAX_STATES`	Maximum states for detokenizer	System-dependent
`SGLANG_ENABLE_TP_MEMORY_INBALANCE_CHECK`	Enable checks for memory imbalance across Tensor Parallel ranks	`true`
`SGLANG_MOONCAKE_CUSTOM_MEM_POOL`	Configure the custom memory pool type for Mooncake. Supports `NVLINK`, `BAREX`, `INTRA_NODE_NVLINK`. If set to `true`, defaults to `NVLINK`.	`None`

Model-Specific Options

Environment Variable	Description	Default Value
`SGLANG_USE_AITER`	Use AITER optimize implementation	`false`
`SGLANG_MOE_PADDING`	Enable MoE padding (sets padding size to 128 if value is `1`)	`0`
`SGLANG_CUTLASS_MOE`	Use Cutlass FP8 MoE kernel on Blackwell GPUs (deprecated, use —moe-runner-backend=cutlass)	`false`

Quantization

Environment Variable	Description	Default Value
`SGLANG_INT4_WEIGHT`	Enable INT4 weight quantization	`false`
`SGLANG_PER_TOKEN_GROUP_QUANT_8BIT_V2`	Apply per token group quantization kernel with fused silu and mul and masked m	`false`
`SGLANG_FORCE_FP8_MARLIN`	Force using FP8 MARLIN kernels even if other FP8 kernels are available	`false`
`SGLANG_FLASHINFER_FP4_GEMM_BACKEND`	DEPRECATED: Use `--fp4-gemm-backend` instead	“
`SGLANG_NVFP4_CKPT_FP8_GEMM_IN_ATTN`	Quantize q_b_proj from BF16 to FP8 when launching DeepSeek NVFP4 checkpoint	`false`
`SGLANG_MOE_NVFP4_DISPATCH`	Use nvfp4 for moe dispatch (on flashinfer_cutlass or flashinfer_cutedsl moe runner backend)	`"false"`
`SGLANG_NVFP4_CKPT_FP8_NEXTN_MOE`	Quantize moe of nextn layer from BF16 to FP8 when launching DeepSeek NVFP4 checkpoint	`false`
`SGLANG_ENABLE_FLASHINFER_FP8_GEMM`	DEPRECATED: Use `--fp8-gemm-backend=flashinfer_trtllm` (SM100/SM103) or `--fp8-gemm-backend=flashinfer_cutlass` (SM120+) instead	`false`
`SGLANG_SUPPORT_CUTLASS_BLOCK_FP8`	DEPRECATED: Use `--fp8-gemm-backend=cutlass` instead	`false`

Distributed Computing

Environment Variable	Description	Default Value
`SGLANG_BLOCK_NONZERO_RANK_CHILDREN`	Control blocking of non-zero rank children processes	`1`
`SGLANG_IS_FIRST_RANK_ON_NODE`	Indicates if the current process is the first rank on its node	`"true"`
`SGLANG_PP_LAYER_PARTITION`	Pipeline parallel layer partition specification	Not set
`SGLANG_ONE_VISIBLE_DEVICE_PER_PROCESS`	Set one visible device per process for distributed computing	`false`

Testing & Debugging

These variables are primarily used for internal testing, continuous integration, or debugging. Do not use in production unless you understand the implications.

Environment Variable	Description	Default Value
`SGLANG_IS_IN_CI`	Indicates if running in CI environment	`false`
`SGLANG_IS_IN_CI_AMD`	Indicates running in AMD CI environment	`0`
`SGLANG_TEST_RETRACT`	Enable retract decode testing	`false`
`SGLANG_TEST_RETRACT_NO_PREFILL_BS`	When SGLANG_TEST_RETRACT is enabled, no prefill is performed if the batch size exceeds this value	`2 ** 31`
`SGLANG_RECORD_STEP_TIME`	Record step time for profiling	`false`
`SGLANG_TEST_REQUEST_TIME_STATS`	Test request time statistics	`false`

Profiling & Benchmarking

Environment Variable	Description	Default Value
`SGLANG_TORCH_PROFILER_DIR`	Directory for PyTorch profiler output	`/tmp`
`SGLANG_PROFILE_WITH_STACK`	Set `with_stack` option for PyTorch profiler (capture stack trace)	`true`
`SGLANG_PROFILE_RECORD_SHAPES`	Set `record_shapes` option for PyTorch profiler	`true`
`SGLANG_OTLP_EXPORTER_SCHEDULE_DELAY_MILLIS`	Config BatchSpanProcessor.schedule_delay_millis if tracing is enabled	`500`
`SGLANG_OTLP_EXPORTER_MAX_EXPORT_BATCH_SIZE`	Config BatchSpanProcessor.max_export_batch_size if tracing is enabled	`64`

Storage & Caching

Environment Variable	Description	Default Value
`SGLANG_WAIT_WEIGHTS_READY_TIMEOUT`	Timeout period for waiting on weights	`120`
`SGLANG_DISABLE_OUTLINES_DISK_CACHE`	Disable Outlines disk cache	`true`
`SGLANG_USE_CUSTOM_TRITON_KERNEL_CACHE`	Use SGLang’s custom Triton kernel cache implementation for lower overheads (automatically enabled on CUDA)	`false`

Function Calling / Tool Use

Environment Variable	Description	Default Value
`SGLANG_TOOL_STRICT_LEVEL`	Controls the strictness level of tool call parsing and validation. Level 0: Off - No strict validation Level 1: Function strict - Enables structural tag constraints Level 2: Parameter strict - Enforces strict parameter validation	`0`

Additional Resources

Environment Variables

General Configuration

Performance Tuning

DeepGEMM Configuration

DeepEP Configuration

MORI Configuration

NSA Backend Configuration

Memory Management

Model-Specific Options

Quantization

Distributed Computing

Testing & Debugging

Profiling & Benchmarking

Storage & Caching

Function Calling / Tool Use

See Also

Additional Resources

​General Configuration

​Performance Tuning

​DeepGEMM Configuration

​DeepEP Configuration

​MORI Configuration

​NSA Backend Configuration

​Memory Management

​Model-Specific Options

​Quantization

​Distributed Computing

​Testing & Debugging

​Profiling & Benchmarking

​Storage & Caching

​Function Calling / Tool Use

​See Also

General Configuration

Performance Tuning

DeepGEMM Configuration

DeepEP Configuration

MORI Configuration

NSA Backend Configuration

Memory Management

Model-Specific Options

Quantization

Distributed Computing

Testing & Debugging

Profiling & Benchmarking

Storage & Caching

Function Calling / Tool Use

See Also