Troubleshooting

This page lists common errors and tips for resolving them when running SGLang.

CUDA Out of Memory

If you encounter out-of-memory (OOM) errors, you can adjust the following parameters:

During Prefill

If OOM occurs during prefill, try reducing --chunked-prefill-size to 4096 or 2048. This saves memory but slows down the prefill speed for long prompts.

python -m sglang.launch_server \
  --model-path meta-llama/Llama-3.1-8B-Instruct \
  --chunked-prefill-size 4096

During Decoding

If OOM occurs during decoding, try lowering --max-running-requests:

python -m sglang.launch_server \
  --model-path meta-llama/Llama-3.1-8B-Instruct \
  --max-running-requests 128

Memory Pool Size

You can also decrease --mem-fraction-static to a smaller value, such as 0.8 or 0.7. This decreases the memory usage of the KV cache memory pool and helps prevent OOM errors during both prefill and decoding. However, it limits maximum concurrency and reduces peak throughput.

python -m sglang.launch_server \
  --model-path meta-llama/Llama-3.1-8B-Instruct \
  --mem-fraction-static 0.7

Input Logprobs for Long Prompts

Another common case for OOM is requesting input logprobs for a long prompt as it requires significant memory. To address this, set logprob_start_len in your sampling parameters to include only the necessary parts:

response = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[{"role": "user", "content": "Your long prompt..."}],
    extra_body={
        "logprob_start_len": 100  # Only compute logprobs from token 100 onwards
    }
)

If you do need input logprobs for a long prompt, try reducing --mem-fraction-static.

CUDA Error: Illegal Memory Access Encountered

This error may result from kernel errors or out-of-memory issues:

Kernel Error

If it is a kernel error, resolving it may be challenging. Please file an issue on GitHub with:

Your model configuration
Full error traceback
SGLang version (pip show sglang)
GPU type and CUDA version
Minimal reproduction steps

Out of Memory

If it is an out-of-memory issue, it may sometimes be reported as this error instead of “Out of Memory.” Refer to the CUDA Out of Memory section above for guidance on avoiding OOM issues.

The Server Hangs

If the server hangs during initialization or running, it can be due to:

Memory Issues

If it is out of memory, you might see that avail mem is very low during the initialization or right after initialization. In this case, you can try to:

Decrease --mem-fraction-static:
```
--mem-fraction-static 0.7
```
Decrease --cuda-graph-max-bs:
```
--cuda-graph-max-bs 128
```
Decrease --chunked-prefill-size:
```
--chunked-prefill-size 4096
```

Network Issues

For multi-GPU or multi-node setups, the server might hang due to NCCL errors. Check:

Network connectivity between nodes
NCCL environment variables (e.g., NCCL_DEBUG=INFO for detailed logs)
Firewall settings
InfiniBand/RDMA configuration if applicable

Other Bugs

For other bugs, please file an issue on GitHub with detailed information.

Model Loading Issues

Model Not Found

If you see errors like “Model not found” or “Repository not found”:

Check model path: Ensure the model path is correct
Hugging Face authentication: For private models, set your token:
```
export HF_TOKEN=your_token_here
```
Use ModelScope (in China):
```
export SGLANG_USE_MODELSCOPE=true
```

Unsupported Model Architecture

If your model architecture is not supported:

Check the supported models documentation
Try using --trust-remote-code if the model uses custom code
Consider using the Transformers fallback
See Adding Support for New Models

Performance Issues

Low Throughput

Check GPU utilization: Use nvidia-smi dmon to monitor GPU usage
Increase batch size: Try --max-running-requests 256 or higher
Enable CUDA graphs: Add --disable-cuda-graph false
Tune chunked prefill: Adjust --chunked-prefill-size
Use quantization: Consider FP8 or INT4 quantization

See Hyperparameter Tuning for detailed guidance.

High Latency

Reduce batch size: Lower --max-running-requests
Decrease chunked prefill size: Set --chunked-prefill-size 2048
Disable prefix caching: Add --disable-radix-cache (if determinism is needed)
Check for resource contention: Ensure no other processes are competing for GPU

API and Integration Issues

Connection Refused

If you can’t connect to the server:

Check server is running: Look for “SGLang started successfully” in logs
Verify host and port:
```
curl http://localhost:30000/health
```
Check firewall: Ensure port is not blocked
Network configuration: For remote connections, verify --host 0.0.0.0

Request Timeout

If requests are timing out:

Increase client timeout: Set a longer timeout in your client
Check request queue: Might be overloaded, increase --max-running-requests
Monitor server logs: Look for errors or warnings

Set request timeouts:

export SGLANG_REQ_WAITING_TIMEOUT=300  # 5 minutes in seconds
export SGLANG_REQ_RUNNING_TIMEOUT=300

Incorrect Output Format

If the output format is not as expected:

Check chat template: Some models require specific templates
Verify model compatibility: Ensure model supports the requested features
Review sampling parameters: Check temperature, top_p, max_tokens, etc.
Use custom chat template: See Custom Chat Template

Multi-GPU and Distributed Issues

Tensor Parallelism Errors

If you encounter errors with tensor parallelism:

Check GPU count: Ensure --tp matches available GPUs
Verify NCCL: Test with NCCL_DEBUG=INFO
P2P access: Some GPUs require enabling peer-to-peer access
Memory imbalance: Check SGLANG_ENABLE_TP_MEMORY_INBALANCE_CHECK

Multi-Node Communication

For multi-node setups:

Network configuration: Ensure all nodes can communicate
NCCL settings: May need to configure NCCL_SOCKET_IFNAME, NCCL_IB_DISABLE, etc.
Shared filesystem: Some features require shared storage
Synchronization: Check clock synchronization across nodes

See Multi-Node Deployment for details.

Quantization Issues

Quantized Model Loading Fails

Check quantization format: Ensure it’s supported (AWQ, GPTQ, FP8, etc.)
Verify model files: Check that quantized weights are present
GPU compatibility: FP8 requires Hopper or later GPUs
Install dependencies: Some quantization methods need additional packages

See Quantization for supported formats.

Accuracy Degradation

If quantized models produce poor results:

Try different quantization methods: FP8 > INT8 > INT4 in terms of accuracy
Check calibration: Some models need proper calibration data
Per-channel vs per-tensor: Per-channel quantization usually gives better results
Compare with FP16: Verify model works correctly without quantization

Vision and Multimodal Issues

Image Loading Errors

If images aren’t loading properly:

Check image format: Ensure it’s a supported format (PNG, JPEG, etc.)
Verify image URL: Test the URL in a browser
Image size: Very large images may cause OOM
Base64 encoding: For base64 images, check encoding is correct

Multimodal Model Errors

Check model support: Verify the model supports multimodal inputs
Message format: Follow the correct message structure with image_url
Memory requirements: Multimodal models need more memory
Environment variables: Check SGLANG_MM_BUFFER_SIZE_MB and related settings

Debugging Tips

Enable Verbose Logging

python -m sglang.launch_server \
  --model-path meta-llama/Llama-3.1-8B-Instruct \
  --log-level debug

Check System Information

# GPU information
nvidia-smi

# SGLang version
pip show sglang

# CUDA version
nvcc --version

# Python version
python --version

Monitor GPU Memory

# Continuous monitoring
watch -n 1 nvidia-smi

# Or use
nvidia-smi dmon -s mu

Profile Performance

See Benchmark and Profiling for detailed profiling instructions.

Getting Help

If you’re still experiencing issues:

Search existing issues: Check GitHub Issues
Join Slack: Ask in the community at https://slack.sglang.io/
File a bug report: Include:
- SGLang version
- GPU type and driver version
- Full error traceback
- Minimal reproduction code
- Model and configuration used

Additional Resources

Troubleshooting

CUDA Out of Memory

During Prefill

During Decoding

Memory Pool Size

Input Logprobs for Long Prompts

CUDA Error: Illegal Memory Access Encountered

Kernel Error

Out of Memory

The Server Hangs

Memory Issues

Network Issues

Other Bugs

Model Loading Issues

Model Not Found

Unsupported Model Architecture

Performance Issues

Low Throughput

High Latency

API and Integration Issues

Connection Refused

Request Timeout

Incorrect Output Format

Multi-GPU and Distributed Issues

Tensor Parallelism Errors

Multi-Node Communication

Quantization Issues

Quantized Model Loading Fails

Accuracy Degradation

Vision and Multimodal Issues

Image Loading Errors

Multimodal Model Errors

Debugging Tips

Enable Verbose Logging

Check System Information

Monitor GPU Memory

Profile Performance

Getting Help

See Also

Additional Resources

​CUDA Out of Memory

​During Prefill

​During Decoding

​Memory Pool Size

​Input Logprobs for Long Prompts

​CUDA Error: Illegal Memory Access Encountered

​Kernel Error

​Out of Memory

​The Server Hangs

​Memory Issues

​Network Issues

​Other Bugs

​Model Loading Issues

​Model Not Found

​Unsupported Model Architecture

​Performance Issues

​Low Throughput

​High Latency

​API and Integration Issues

​Connection Refused

​Request Timeout

​Incorrect Output Format

​Multi-GPU and Distributed Issues

​Tensor Parallelism Errors

​Multi-Node Communication

​Quantization Issues

​Quantized Model Loading Fails

​Accuracy Degradation

​Vision and Multimodal Issues

​Image Loading Errors

​Multimodal Model Errors

​Debugging Tips

​Enable Verbose Logging

​Check System Information

​Monitor GPU Memory

​Profile Performance

​Getting Help

​See Also

CUDA Out of Memory

During Prefill

During Decoding

Memory Pool Size

Input Logprobs for Long Prompts

CUDA Error: Illegal Memory Access Encountered

Kernel Error

Out of Memory

The Server Hangs

Memory Issues

Network Issues

Other Bugs

Model Loading Issues

Model Not Found

Unsupported Model Architecture

Performance Issues

Low Throughput

High Latency

API and Integration Issues

Connection Refused

Request Timeout

Incorrect Output Format

Multi-GPU and Distributed Issues

Tensor Parallelism Errors

Multi-Node Communication

Quantization Issues

Quantized Model Loading Fails

Accuracy Degradation

Vision and Multimodal Issues

Image Loading Errors

Multimodal Model Errors

Debugging Tips

Enable Verbose Logging

Check System Information

Monitor GPU Memory

Profile Performance

Getting Help

See Also