Skip to main content
Mini-SGLang achieves state-of-the-art throughput and latency through advanced optimizations including Radix Cache, Chunked Prefill, and Overlap Scheduling.

Offline Inference Benchmarks

Offline inference measures the throughput when processing a batch of requests without streaming or online constraints.

Test Configuration

  • Hardware: 1x H200 GPU
  • Models: Qwen3-0.6B, Qwen3-14B
  • Total Requests: 256 sequences
  • Input Length: Randomly sampled between 100-1024 tokens
  • Output Length: Randomly sampled between 100-1024 tokens
  • Random Seed: 0 (for reproducibility)

Benchmark Script

The offline benchmark is implemented in benchmark/offline/bench.py. Key configurations:
llm = LLM(
    "Qwen/Qwen3-0.6B",
    max_seq_len_override=4096,
    max_extend_tokens=16384,
    cuda_graph_max_bs=256,
    page_size=256,
)

Results

Mini-SGLang demonstrates competitive throughput compared to the full SGLang implementation: Offline Benchmark Results The benchmark shows the impact of overlap scheduling, which can be disabled for ablation studies using:
MINISGL_DISABLE_OVERLAP_SCHEDULING=1 python benchmark/offline/bench.py

Running the Benchmark

To run the offline benchmark yourself:
cd benchmark/offline
python bench.py
The script will output:
  • Total tokens generated
  • Time elapsed
  • Throughput in tokens/second

Online Inference Benchmarks

Online inference simulates real-world serving conditions with concurrent requests arriving at different times.

Test Configuration

  • Hardware: 4x H200 GPU (connected by NVLink)
  • Model: Qwen3-32B
  • Dataset: Qwen trace
  • Requests: First 1000 requests from the trace
  • Request Scales: 0.4, 0.5, 0.6, 0.7, 0.8, 1.6 (from fast to slow arrival rates)

Server Launch Commands

Mini-SGLang:
python -m minisgl --model "Qwen/Qwen3-32B" --tp 4 --cache naive
SGLang (for comparison):
python3 -m sglang.launch_server --model "Qwen/Qwen3-32B" --tp 4 \
    --disable-radix --port 1919 --decode-attention flashinfer

Results

Mini-SGLang achieves comparable performance to SGLang in online serving scenarios: Online Benchmark Results The benchmark measures:
  • Request throughput (requests/second)
  • Time to First Token (TTFT)
  • Inter-Token Latency (ITL)
  • End-to-end latency

Running the Benchmark

First, start the server:
python -m minisgl --model "Qwen/Qwen3-32B" --tp 4 --cache naive
Then run the benchmark client:
cd benchmark/online
python bench_qwen.py
The script will:
  1. Download the Qwen trace dataset if not already present
  2. Run benchmarks at different request rates
  3. Report performance metrics for each rate

Alternative Benchmarks

WildChat Dataset

For more realistic workloads, use the WildChat benchmark:
cd benchmark/offline
python bench_wildchat.py
This benchmark:
  • Downloads real user conversations from the WildChat-4.8M dataset
  • Filters for English and Chinese prompts
  • Excludes toxic or redacted content
  • Reports length statistics (p50, p90, p99)

Simple Online Benchmark

For quick testing with custom batch sizes:
cd benchmark/online
python bench_simple.py
Configuration:
  • Default batch size: 64
  • Max input length: 8192 tokens
  • Output length: Randomly sampled 16-1024 tokens
  • Port: 1919

Benchmark Methodology

Warmup

All benchmarks include a warmup phase to:
  • JIT-compile CUDA kernels
  • Initialize FlashInfer cache
  • Ensure stable GPU clocks

Metrics Collected

Offline:
  • Total output tokens generated
  • Elapsed time
  • Throughput (tokens/second)
Online:
  • Request latency distribution
  • Time to First Token (TTFT)
  • Inter-Token Latency (ITL)
  • Throughput under load

Reproducibility

All benchmarks use fixed random seeds (typically 42 or 0) to ensure reproducible results. Hardware configurations and model versions are explicitly documented.

Performance Factors

Several factors influence benchmark results:
  1. Overlap Scheduling: Hides CPU overhead during GPU computation
  2. CUDA Graphs: Reduces kernel launch overhead in decode phase
  3. Chunked Prefill: Prevents OOM for long sequences
  4. Attention Backend: FlashAttention 3 for prefill, FlashInfer for decode
  5. Page Size: Affects memory fragmentation and allocation efficiency
See Optimization Tips for detailed tuning guidance.

Hardware Notes

  • H200 GPUs provide 141 GB HBM3e memory and improved bandwidth over H100
  • NVLink enables efficient tensor parallelism across multiple GPUs
  • CUDA Toolkit version should match driver version for optimal kernel performance
For different hardware configurations, adjust batch sizes and parallelism settings accordingly.

Build docs developers (and LLMs) love