Offline Inference Benchmarks
Offline inference measures the throughput when processing a batch of requests without streaming or online constraints.Test Configuration
- Hardware: 1x H200 GPU
- Models: Qwen3-0.6B, Qwen3-14B
- Total Requests: 256 sequences
- Input Length: Randomly sampled between 100-1024 tokens
- Output Length: Randomly sampled between 100-1024 tokens
- Random Seed: 0 (for reproducibility)
Benchmark Script
The offline benchmark is implemented inbenchmark/offline/bench.py. Key configurations:
Results
Mini-SGLang demonstrates competitive throughput compared to the full SGLang implementation:
The benchmark shows the impact of overlap scheduling, which can be disabled for ablation studies using:
Running the Benchmark
To run the offline benchmark yourself:- Total tokens generated
- Time elapsed
- Throughput in tokens/second
Online Inference Benchmarks
Online inference simulates real-world serving conditions with concurrent requests arriving at different times.Test Configuration
- Hardware: 4x H200 GPU (connected by NVLink)
- Model: Qwen3-32B
- Dataset: Qwen trace
- Requests: First 1000 requests from the trace
- Request Scales: 0.4, 0.5, 0.6, 0.7, 0.8, 1.6 (from fast to slow arrival rates)
Server Launch Commands
Mini-SGLang:Results
Mini-SGLang achieves comparable performance to SGLang in online serving scenarios:
The benchmark measures:
- Request throughput (requests/second)
- Time to First Token (TTFT)
- Inter-Token Latency (ITL)
- End-to-end latency
Running the Benchmark
First, start the server:- Download the Qwen trace dataset if not already present
- Run benchmarks at different request rates
- Report performance metrics for each rate
Alternative Benchmarks
WildChat Dataset
For more realistic workloads, use the WildChat benchmark:- Downloads real user conversations from the WildChat-4.8M dataset
- Filters for English and Chinese prompts
- Excludes toxic or redacted content
- Reports length statistics (p50, p90, p99)
Simple Online Benchmark
For quick testing with custom batch sizes:- Default batch size: 64
- Max input length: 8192 tokens
- Output length: Randomly sampled 16-1024 tokens
- Port: 1919
Benchmark Methodology
Warmup
All benchmarks include a warmup phase to:- JIT-compile CUDA kernels
- Initialize FlashInfer cache
- Ensure stable GPU clocks
Metrics Collected
Offline:- Total output tokens generated
- Elapsed time
- Throughput (tokens/second)
- Request latency distribution
- Time to First Token (TTFT)
- Inter-Token Latency (ITL)
- Throughput under load
Reproducibility
All benchmarks use fixed random seeds (typically 42 or 0) to ensure reproducible results. Hardware configurations and model versions are explicitly documented.Performance Factors
Several factors influence benchmark results:- Overlap Scheduling: Hides CPU overhead during GPU computation
- CUDA Graphs: Reduces kernel launch overhead in decode phase
- Chunked Prefill: Prevents OOM for long sequences
- Attention Backend: FlashAttention 3 for prefill, FlashInfer for decode
- Page Size: Affects memory fragmentation and allocation efficiency
Hardware Notes
- H200 GPUs provide 141 GB HBM3e memory and improved bandwidth over H100
- NVLink enables efficient tensor parallelism across multiple GPUs
- CUDA Toolkit version should match driver version for optimal kernel performance