CUDA Out of Memory
If you encounter out-of-memory (OOM) errors, you can adjust the following parameters:During Prefill
If OOM occurs during prefill, try reducing--chunked-prefill-size to 4096 or 2048. This saves memory but slows down the prefill speed for long prompts.
During Decoding
If OOM occurs during decoding, try lowering--max-running-requests:
Memory Pool Size
You can also decrease--mem-fraction-static to a smaller value, such as 0.8 or 0.7. This decreases the memory usage of the KV cache memory pool and helps prevent OOM errors during both prefill and decoding. However, it limits maximum concurrency and reduces peak throughput.
Input Logprobs for Long Prompts
Another common case for OOM is requesting input logprobs for a long prompt as it requires significant memory. To address this, setlogprob_start_len in your sampling parameters to include only the necessary parts:
--mem-fraction-static.
CUDA Error: Illegal Memory Access Encountered
This error may result from kernel errors or out-of-memory issues:Kernel Error
If it is a kernel error, resolving it may be challenging. Please file an issue on GitHub with:- Your model configuration
- Full error traceback
- SGLang version (
pip show sglang) - GPU type and CUDA version
- Minimal reproduction steps
Out of Memory
If it is an out-of-memory issue, it may sometimes be reported as this error instead of “Out of Memory.” Refer to the CUDA Out of Memory section above for guidance on avoiding OOM issues.The Server Hangs
If the server hangs during initialization or running, it can be due to:Memory Issues
If it is out of memory, you might see thatavail mem is very low during the initialization or right after initialization. In this case, you can try to:
-
Decrease
--mem-fraction-static: -
Decrease
--cuda-graph-max-bs: -
Decrease
--chunked-prefill-size:
Network Issues
For multi-GPU or multi-node setups, the server might hang due to NCCL errors. Check:- Network connectivity between nodes
- NCCL environment variables (e.g.,
NCCL_DEBUG=INFOfor detailed logs) - Firewall settings
- InfiniBand/RDMA configuration if applicable
Other Bugs
For other bugs, please file an issue on GitHub with detailed information.Model Loading Issues
Model Not Found
If you see errors like “Model not found” or “Repository not found”:- Check model path: Ensure the model path is correct
- Hugging Face authentication: For private models, set your token:
- Use ModelScope (in China):
Unsupported Model Architecture
If your model architecture is not supported:- Check the supported models documentation
- Try using
--trust-remote-codeif the model uses custom code - Consider using the Transformers fallback
- See Adding Support for New Models
Performance Issues
Low Throughput
- Check GPU utilization: Use
nvidia-smi dmonto monitor GPU usage - Increase batch size: Try
--max-running-requests 256or higher - Enable CUDA graphs: Add
--disable-cuda-graph false - Tune chunked prefill: Adjust
--chunked-prefill-size - Use quantization: Consider FP8 or INT4 quantization
High Latency
- Reduce batch size: Lower
--max-running-requests - Decrease chunked prefill size: Set
--chunked-prefill-size 2048 - Disable prefix caching: Add
--disable-radix-cache(if determinism is needed) - Check for resource contention: Ensure no other processes are competing for GPU
API and Integration Issues
Connection Refused
If you can’t connect to the server:- Check server is running: Look for “SGLang started successfully” in logs
- Verify host and port:
- Check firewall: Ensure port is not blocked
- Network configuration: For remote connections, verify
--host 0.0.0.0
Request Timeout
If requests are timing out:- Increase client timeout: Set a longer timeout in your client
- Check request queue: Might be overloaded, increase
--max-running-requests - Monitor server logs: Look for errors or warnings
- Set request timeouts:
Incorrect Output Format
If the output format is not as expected:- Check chat template: Some models require specific templates
- Verify model compatibility: Ensure model supports the requested features
- Review sampling parameters: Check temperature, top_p, max_tokens, etc.
- Use custom chat template: See Custom Chat Template
Multi-GPU and Distributed Issues
Tensor Parallelism Errors
If you encounter errors with tensor parallelism:- Check GPU count: Ensure
--tpmatches available GPUs - Verify NCCL: Test with
NCCL_DEBUG=INFO - P2P access: Some GPUs require enabling peer-to-peer access
- Memory imbalance: Check
SGLANG_ENABLE_TP_MEMORY_INBALANCE_CHECK
Multi-Node Communication
For multi-node setups:- Network configuration: Ensure all nodes can communicate
- NCCL settings: May need to configure
NCCL_SOCKET_IFNAME,NCCL_IB_DISABLE, etc. - Shared filesystem: Some features require shared storage
- Synchronization: Check clock synchronization across nodes
Quantization Issues
Quantized Model Loading Fails
- Check quantization format: Ensure it’s supported (AWQ, GPTQ, FP8, etc.)
- Verify model files: Check that quantized weights are present
- GPU compatibility: FP8 requires Hopper or later GPUs
- Install dependencies: Some quantization methods need additional packages
Accuracy Degradation
If quantized models produce poor results:- Try different quantization methods: FP8 > INT8 > INT4 in terms of accuracy
- Check calibration: Some models need proper calibration data
- Per-channel vs per-tensor: Per-channel quantization usually gives better results
- Compare with FP16: Verify model works correctly without quantization
Vision and Multimodal Issues
Image Loading Errors
If images aren’t loading properly:- Check image format: Ensure it’s a supported format (PNG, JPEG, etc.)
- Verify image URL: Test the URL in a browser
- Image size: Very large images may cause OOM
- Base64 encoding: For base64 images, check encoding is correct
Multimodal Model Errors
- Check model support: Verify the model supports multimodal inputs
- Message format: Follow the correct message structure with image_url
- Memory requirements: Multimodal models need more memory
- Environment variables: Check
SGLANG_MM_BUFFER_SIZE_MBand related settings
Debugging Tips
Enable Verbose Logging
Check System Information
Monitor GPU Memory
Profile Performance
See Benchmark and Profiling for detailed profiling instructions.Getting Help
If you’re still experiencing issues:- Search existing issues: Check GitHub Issues
- Join Slack: Ask in the community at https://slack.sglang.io/
- File a bug report: Include:
- SGLang version
- GPU type and driver version
- Full error traceback
- Minimal reproduction code
- Model and configuration used
See Also
- FAQ - Frequently asked questions
- Environment Variables - Configuration reference
- Hyperparameter Tuning - Performance optimization
- Server Arguments - Command-line options
