Benchmarks

Mini-SGLang achieves state-of-the-art throughput and latency through advanced optimizations including Radix Cache, Chunked Prefill, and Overlap Scheduling.

Offline Inference Benchmarks

Offline inference measures the throughput when processing a batch of requests without streaming or online constraints.

Test Configuration

Hardware: 1x H200 GPU
Models: Qwen3-0.6B, Qwen3-14B
Total Requests: 256 sequences
Input Length: Randomly sampled between 100-1024 tokens
Output Length: Randomly sampled between 100-1024 tokens
Random Seed: 0 (for reproducibility)

Benchmark Script

The offline benchmark is implemented in benchmark/offline/bench.py. Key configurations:

llm = LLM(
    "Qwen/Qwen3-0.6B",
    max_seq_len_override=4096,
    max_extend_tokens=16384,
    cuda_graph_max_bs=256,
    page_size=256,
)

Results

Mini-SGLang demonstrates competitive throughput compared to the full SGLang implementation: Offline Benchmark Results

The benchmark shows the impact of overlap scheduling, which can be disabled for ablation studies using:

MINISGL_DISABLE_OVERLAP_SCHEDULING=1 python benchmark/offline/bench.py

Running the Benchmark

To run the offline benchmark yourself:

cd benchmark/offline
python bench.py

The script will output:

Total tokens generated
Time elapsed
Throughput in tokens/second

Online Inference Benchmarks

Online inference simulates real-world serving conditions with concurrent requests arriving at different times.

Test Configuration

Hardware: 4x H200 GPU (connected by NVLink)
Model: Qwen3-32B
Dataset: Qwen trace
Requests: First 1000 requests from the trace
Request Scales: 0.4, 0.5, 0.6, 0.7, 0.8, 1.6 (from fast to slow arrival rates)

Server Launch Commands

Mini-SGLang:

python -m minisgl --model "Qwen/Qwen3-32B" --tp 4 --cache naive

SGLang (for comparison):

python3 -m sglang.launch_server --model "Qwen/Qwen3-32B" --tp 4 \
    --disable-radix --port 1919 --decode-attention flashinfer

Results

Mini-SGLang achieves comparable performance to SGLang in online serving scenarios: Online Benchmark Results

The benchmark measures:

Request throughput (requests/second)
Time to First Token (TTFT)
Inter-Token Latency (ITL)
End-to-end latency

Running the Benchmark

First, start the server:

python -m minisgl --model "Qwen/Qwen3-32B" --tp 4 --cache naive

Then run the benchmark client:

cd benchmark/online
python bench_qwen.py

The script will:

Download the Qwen trace dataset if not already present
Run benchmarks at different request rates
Report performance metrics for each rate

Alternative Benchmarks

WildChat Dataset

For more realistic workloads, use the WildChat benchmark:

cd benchmark/offline
python bench_wildchat.py

This benchmark:

Downloads real user conversations from the WildChat-4.8M dataset
Filters for English and Chinese prompts
Excludes toxic or redacted content
Reports length statistics (p50, p90, p99)

Simple Online Benchmark

For quick testing with custom batch sizes:

cd benchmark/online
python bench_simple.py

Configuration:

Default batch size: 64
Max input length: 8192 tokens
Output length: Randomly sampled 16-1024 tokens
Port: 1919

Benchmark Methodology

Warmup

All benchmarks include a warmup phase to:

JIT-compile CUDA kernels
Initialize FlashInfer cache
Ensure stable GPU clocks

Metrics Collected

Offline:

Total output tokens generated
Elapsed time
Throughput (tokens/second)

Online:

Request latency distribution
Time to First Token (TTFT)
Inter-Token Latency (ITL)
Throughput under load

Reproducibility

All benchmarks use fixed random seeds (typically 42 or 0) to ensure reproducible results. Hardware configurations and model versions are explicitly documented.

Performance Factors

Several factors influence benchmark results:

Overlap Scheduling: Hides CPU overhead during GPU computation
CUDA Graphs: Reduces kernel launch overhead in decode phase
Chunked Prefill: Prevents OOM for long sequences
Attention Backend: FlashAttention 3 for prefill, FlashInfer for decode
Page Size: Affects memory fragmentation and allocation efficiency

See Optimization Tips for detailed tuning guidance.

Hardware Notes

H200 GPUs provide 141 GB HBM3e memory and improved bandwidth over H100
NVLink enables efficient tensor parallelism across multiple GPUs
CUDA Toolkit version should match driver version for optimal kernel performance

For different hardware configurations, adjust batch sizes and parallelism settings accordingly.

Getting Started

Core Concepts

Guides

Configuration

Performance

Offline Inference Benchmarks

Test Configuration

Benchmark Script

Results

Running the Benchmark

Online Inference Benchmarks

Test Configuration

Server Launch Commands

Results

Running the Benchmark

Alternative Benchmarks

WildChat Dataset

Simple Online Benchmark

Benchmark Methodology

Warmup

Metrics Collected

Reproducibility

Performance Factors

Hardware Notes

Build docs developers (and LLMs) love

Getting Started

Core Concepts

Guides

Configuration

Performance

​Offline Inference Benchmarks

​Test Configuration

​Benchmark Script

​Results

​Running the Benchmark

​Online Inference Benchmarks

​Test Configuration

​Server Launch Commands

​Results

​Running the Benchmark

​Alternative Benchmarks

​WildChat Dataset

​Simple Online Benchmark

​Benchmark Methodology

​Warmup

​Metrics Collected

​Reproducibility

​Performance Factors

​Hardware Notes

Build docs developers (and LLMs) love

Offline Inference Benchmarks

Test Configuration

Benchmark Script

Results

Running the Benchmark

Online Inference Benchmarks

Test Configuration

Server Launch Commands

Results

Running the Benchmark

Alternative Benchmarks

WildChat Dataset

Simple Online Benchmark

Benchmark Methodology

Warmup

Metrics Collected

Reproducibility

Performance Factors

Hardware Notes