Skip to main content
SGLang supports various environment variables that can be used to configure its runtime behavior. This document provides a comprehensive list.
SGLang uses two prefixes for environment variables: SGL_ and SGLANG_. This is due to historical reasons. While both are currently supported for different settings, future versions might consolidate them.

General Configuration

Environment VariableDescriptionDefault Value
SGLANG_USE_MODELSCOPEEnable using models from ModelScopefalse
SGLANG_HOST_IPHost IP address for the server0.0.0.0
SGLANG_PORTPort for the serverauto-detected
SGLANG_LOGGING_CONFIG_PATHCustom logging configuration pathNot set
SGLANG_DISABLE_REQUEST_LOGGINGDisable request loggingfalse
SGLANG_LOG_REQUEST_HEADERSComma-separated list of additional HTTP headers to log when --log-requests is enabled. Appends to the default x-smg-routing-key.Not set
SGLANG_HEALTH_CHECK_TIMEOUTTimeout for health check in seconds20
SGLANG_EPLB_HEATMAP_COLLECTION_INTERVALThe interval of passes to collect the metric of selected count of physical experts on each layer and GPU rank. 0 means disabled.0
SGLANG_FORWARD_UNKNOWN_TOOLSForward unknown tool calls to clients instead of dropping themfalse
SGLANG_REQ_WAITING_TIMEOUTTimeout (in seconds) for requests waiting in the queue before being scheduled-1 (disabled)
SGLANG_REQ_RUNNING_TIMEOUTTimeout (in seconds) for requests running in the decode batch-1 (disabled)

Performance Tuning

Environment VariableDescriptionDefault Value
SGLANG_ENABLE_TORCH_INFERENCE_MODEControl whether to use torch.inference_modefalse
SGLANG_ENABLE_TORCH_COMPILEEnable torch.compiletrue
SGLANG_SET_CPU_AFFINITYEnable CPU affinity setting (often set to 1 in Docker builds)0
SGLANG_ALLOW_OVERWRITE_LONGER_CONTEXT_LENAllows the scheduler to overwrite longer context length requests0
SGLANG_IS_FLASHINFER_AVAILABLEControl FlashInfer availability checktrue
SGLANG_SKIP_P2P_CHECKSkip P2P (peer-to-peer) access checkfalse
SGLANG_CHUNKED_PREFIX_CACHE_THRESHOLDSets the threshold for enabling chunked prefix caching8192
SGLANG_FUSED_MLA_ENABLE_ROPE_FUSIONEnable RoPE fusion in Fused Multi-Layer Attention1
SGLANG_DISABLE_CONSECUTIVE_PREFILL_OVERLAPDisable overlap schedule for consecutive prefill batchesfalse
SGLANG_SCHEDULER_MAX_RECV_PER_POLLSet the maximum number of requests per poll, with a negative value indicating no limit-1
SGLANG_DISABLE_FA4_WARMUPDisable Flash Attention 4 warmup passes (set to 1, true, yes, or on to disable)false
SGLANG_DATA_PARALLEL_BUDGET_INTERVALInterval for DPBudget updates1
SGLANG_SCHEDULER_RECV_SKIPPER_WEIGHT_DEFAULTDefault weight value for scheduler recv skipper counter1000
SGLANG_SCHEDULER_RECV_SKIPPER_WEIGHT_DECODEWeight increment for decode forward mode in scheduler recv skipper1
SGLANG_SCHEDULER_RECV_SKIPPER_WEIGHT_VERIFYWeight increment for target verify forward mode in scheduler recv skipper1
SGLANG_SCHEDULER_RECV_SKIPPER_WEIGHT_NONEWeight increment when forward mode is None in scheduler recv skipper1
SGLANG_MM_BUFFER_SIZE_MBSize of preallocated GPU buffer (in MB) for multi-modal feature hashing optimization. Set to 0 to disable.0
SGLANG_MM_PRECOMPUTE_HASHEnable precomputing of hash values for MultimodalDataItemfalse
SGLANG_NCCL_ALL_GATHER_IN_OVERLAP_SCHEDULER_SYNC_BATCHEnable NCCL for gathering when preparing mlp sync batch under overlap schedulerfalse
SGLANG_SYMM_MEM_PREALLOC_GB_SIZESize of preallocated GPU buffer (in GB) for NCCL symmetric memory pool. Only effective when server arg --enable-symm-mem is set.4
SGLANG_CUSTOM_ALLREDUCE_ALGOThe algorithm of custom all-reduce. Set to oneshot/1stage or twoshot/2stage to force use.

DeepGEMM Configuration

DeepGEMM is an advanced optimization for NVIDIA Hopper (SM90) and Blackwell (SM100) GPUs. It’s automatically enabled when the package is installed.
Environment VariableDescriptionDefault Value
SGLANG_ENABLE_JIT_DEEPGEMMEnable Just-In-Time compilation of DeepGEMM kernels (set to "0" to disable)"true"
SGLANG_JIT_DEEPGEMM_PRECOMPILEEnable precompilation of DeepGEMM kernels"true"
SGLANG_JIT_DEEPGEMM_COMPILE_WORKERSNumber of workers for parallel DeepGEMM kernel compilation4
SGLANG_IN_DEEPGEMM_PRECOMPILE_STAGEIndicator flag used during the DeepGEMM precompile script"false"
SGLANG_DG_CACHE_DIRDirectory for caching compiled DeepGEMM kernels~/.cache/deep_gemm
SGLANG_DG_USE_NVRTCUse NVRTC (instead of Triton) for JIT compilation (Experimental)"0"
SGLANG_USE_DEEPGEMM_BMMUse DeepGEMM for Batched Matrix Multiplication (BMM) operations"false"
SGLANG_JIT_DEEPGEMM_FAST_WARMUPPrecompile less kernels during warmup. Reduces warmup time from 30min to <3min but might cause performance degradation."false"

DeepEP Configuration

DeepEP is optimized for DeepSeek models with expert parallelism.
Environment VariableDescriptionDefault Value
SGLANG_DEEPEP_BF16_DISPATCHUse Bfloat16 for dispatch"false"
SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANKThe maximum number of dispatched tokens on each GPU"128"
SGLANG_FLASHINFER_NUM_MAX_DISPATCH_TOKENS_PER_RANKThe maximum number of dispatched tokens on each GPU for —moe-a2a-backend=flashinfer"1024"
SGLANG_DEEPEP_LL_COMBINE_SEND_NUM_SMSNumber of SMs used for DeepEP combine when single batch overlap is enabled"32"
SGLANG_BLACKWELL_OVERLAP_SHARED_EXPERTS_OUTSIDE_SBORun shared experts on an alternate stream when single batch overlap is enabled on GB200"false"

MORI Configuration

MORI is an advanced MoE optimization framework for multi-node deployments.
Environment VariableDescriptionDefault Value
SGLANG_MORI_FP8_DISPUse FP8 for dispatch"false"
SGLANG_MORI_NUM_MAX_DISPATCH_TOKENS_PER_RANKMaximum number of dispatch tokens per rank for MORI-EP buffer allocation4096
SGLANG_MORI_DISPATCH_INTER_KERNEL_SWITCH_THRESHOLDThreshold for switching between InterNodeV1 and InterNodeV1LL kernel types256
SGLANG_MORI_QP_PER_TRANSFERNumber of RDMA Queue Pairs (QPs) used per transfer operation1
SGLANG_MORI_POST_BATCH_SIZENumber of RDMA work requests posted in a single batch to each QP-1
SGLANG_MORI_NUM_WORKERSNumber of worker threads in the RDMA executor thread pool1

NSA Backend Configuration

NSA backend is optimized for DeepSeek V3.2 and later models.
Environment VariableDescriptionDefault Value
SGLANG_NSA_FUSE_TOPKFuse the operation of picking topk logits and picking topk indices from page tabletrue
SGLANG_NSA_ENABLE_MTP_PRECOMPUTE_METADATAPrecompute metadata that can be shared among different draft steps when MTP is enabledtrue

Memory Management

Environment VariableDescriptionDefault Value
SGLANG_DEBUG_MEMORY_POOLEnable memory pool debuggingfalse
SGLANG_CLIP_MAX_NEW_TOKENS_ESTIMATIONClip max new tokens estimation for memory planning4096
SGLANG_DETOKENIZER_MAX_STATESMaximum states for detokenizerSystem-dependent
SGLANG_ENABLE_TP_MEMORY_INBALANCE_CHECKEnable checks for memory imbalance across Tensor Parallel rankstrue
SGLANG_MOONCAKE_CUSTOM_MEM_POOLConfigure the custom memory pool type for Mooncake. Supports NVLINK, BAREX, INTRA_NODE_NVLINK. If set to true, defaults to NVLINK.None

Model-Specific Options

Environment VariableDescriptionDefault Value
SGLANG_USE_AITERUse AITER optimize implementationfalse
SGLANG_MOE_PADDINGEnable MoE padding (sets padding size to 128 if value is 1)0
SGLANG_CUTLASS_MOEUse Cutlass FP8 MoE kernel on Blackwell GPUs (deprecated, use —moe-runner-backend=cutlass)false

Quantization

Environment VariableDescriptionDefault Value
SGLANG_INT4_WEIGHTEnable INT4 weight quantizationfalse
SGLANG_PER_TOKEN_GROUP_QUANT_8BIT_V2Apply per token group quantization kernel with fused silu and mul and masked mfalse
SGLANG_FORCE_FP8_MARLINForce using FP8 MARLIN kernels even if other FP8 kernels are availablefalse
SGLANG_FLASHINFER_FP4_GEMM_BACKENDDEPRECATED: Use --fp4-gemm-backend instead
SGLANG_NVFP4_CKPT_FP8_GEMM_IN_ATTNQuantize q_b_proj from BF16 to FP8 when launching DeepSeek NVFP4 checkpointfalse
SGLANG_MOE_NVFP4_DISPATCHUse nvfp4 for moe dispatch (on flashinfer_cutlass or flashinfer_cutedsl moe runner backend)"false"
SGLANG_NVFP4_CKPT_FP8_NEXTN_MOEQuantize moe of nextn layer from BF16 to FP8 when launching DeepSeek NVFP4 checkpointfalse
SGLANG_ENABLE_FLASHINFER_FP8_GEMMDEPRECATED: Use --fp8-gemm-backend=flashinfer_trtllm (SM100/SM103) or --fp8-gemm-backend=flashinfer_cutlass (SM120+) insteadfalse
SGLANG_SUPPORT_CUTLASS_BLOCK_FP8DEPRECATED: Use --fp8-gemm-backend=cutlass insteadfalse

Distributed Computing

Environment VariableDescriptionDefault Value
SGLANG_BLOCK_NONZERO_RANK_CHILDRENControl blocking of non-zero rank children processes1
SGLANG_IS_FIRST_RANK_ON_NODEIndicates if the current process is the first rank on its node"true"
SGLANG_PP_LAYER_PARTITIONPipeline parallel layer partition specificationNot set
SGLANG_ONE_VISIBLE_DEVICE_PER_PROCESSSet one visible device per process for distributed computingfalse

Testing & Debugging

These variables are primarily used for internal testing, continuous integration, or debugging. Do not use in production unless you understand the implications.
Environment VariableDescriptionDefault Value
SGLANG_IS_IN_CIIndicates if running in CI environmentfalse
SGLANG_IS_IN_CI_AMDIndicates running in AMD CI environment0
SGLANG_TEST_RETRACTEnable retract decode testingfalse
SGLANG_TEST_RETRACT_NO_PREFILL_BSWhen SGLANG_TEST_RETRACT is enabled, no prefill is performed if the batch size exceeds this value2 ** 31
SGLANG_RECORD_STEP_TIMERecord step time for profilingfalse
SGLANG_TEST_REQUEST_TIME_STATSTest request time statisticsfalse

Profiling & Benchmarking

Environment VariableDescriptionDefault Value
SGLANG_TORCH_PROFILER_DIRDirectory for PyTorch profiler output/tmp
SGLANG_PROFILE_WITH_STACKSet with_stack option for PyTorch profiler (capture stack trace)true
SGLANG_PROFILE_RECORD_SHAPESSet record_shapes option for PyTorch profilertrue
SGLANG_OTLP_EXPORTER_SCHEDULE_DELAY_MILLISConfig BatchSpanProcessor.schedule_delay_millis if tracing is enabled500
SGLANG_OTLP_EXPORTER_MAX_EXPORT_BATCH_SIZEConfig BatchSpanProcessor.max_export_batch_size if tracing is enabled64

Storage & Caching

Environment VariableDescriptionDefault Value
SGLANG_WAIT_WEIGHTS_READY_TIMEOUTTimeout period for waiting on weights120
SGLANG_DISABLE_OUTLINES_DISK_CACHEDisable Outlines disk cachetrue
SGLANG_USE_CUSTOM_TRITON_KERNEL_CACHEUse SGLang’s custom Triton kernel cache implementation for lower overheads (automatically enabled on CUDA)false

Function Calling / Tool Use

Environment VariableDescriptionDefault Value
SGLANG_TOOL_STRICT_LEVELControls the strictness level of tool call parsing and validation.
Level 0: Off - No strict validation
Level 1: Function strict - Enables structural tag constraints
Level 2: Parameter strict - Enforces strict parameter validation
0

See Also