Overview
Data Parallelism (DP) replicates the entire model across multiple GPU sets, with each replica processing independent batches of requests. This is the simplest and most effective way to scale throughput when you have sufficient GPU memory.Types of Data Parallelism
- Standard DP: Full model replication, independent inference per replica
- Data Parallelism Attention (DPA): Advanced strategy that applies DP specifically to attention layers
Standard Data Parallelism
How It Works
- Has a complete copy of model weights
- Processes different batches independently
- No inter-replica communication during inference
When to Use Standard DP
Use standard DP when:- Model fits in GPU memory (or across TP within a node)
- Need to maximize throughput with simple scaling
- Working with standard attention models (Llama, Qwen, Mistral)
- Have sufficient GPU resources for full replicas
Data Parallelism Attention (DPA)
DPA is an advanced parallelism strategy that applies data parallelism specifically to the attention component, providing significant benefits for Multi-Head Latent Attention (MLA) models.Why DPA for MLA Models?
MLA models like DeepSeek have only one KV head. With standard Tensor Parallelism: ❌ Problems:- KV cache duplicated across all GPUs
- Wasted memory limits batch size
- Reduced throughput due to memory constraints
- Each DP replica maintains its own KV cache (no duplication)
- Memory savings enable significantly larger batch sizes
- Each replica can be in different forward modes (prefill, decode, idle)
- Substantially improved decoding throughput
Architecture
- Each DP replica processes different batches independently
- No KV cache duplication across replicas
- Independent forward modes per replica
- Combined with EP for MoE models
Supported Models
MLA (Multi-Head Latent Attention) models - where DPA provides maximum benefit:- DeepSeek family (DeepSeek-V2, DeepSeek-V3, DeepSeek-R1)
- MiniMax models
- Kimi-K2
- Other MLA-architecture models
- Qwen models (see PR #6121)
- Llama (use standard DP or TP instead)
- Models with standard GQA
Configuration
Standard DP with SGLang Model Gateway (Recommended)
The recommended way to deploy data parallelism is using SGLang Model Gateway (SMG):- Cache-aware routing (up to 92% throughput improvement)
- Advanced load balancing policies
- Health monitoring and circuit breakers
- Hot worker add/remove
- 40+ Prometheus metrics
- Production-ready reliability
DPA for MLA Models
Basic DPA setup:--dp-size > 1 and --enable-dp-attention are required.
DPA + EP (recommended for DeepSeek MoE):
Multi-Node DPA
SGLang Model Gateway (SMG)
SGLang Model Gateway is a production-ready Rust-based routing system for DP deployments.Installation
Deployment Options
Option A: Co-launch (Simplest)Load Balancing Policies
| Policy | Description | Best For |
|---|---|---|
cache_aware | Combines cache locality with load balancing | Recommended for most workloads |
round_robin | Cycles through workers in order | Simple, predictable distribution |
random | Random worker selection | Baseline, testing |
power_of_two | Samples two workers, picks lighter one | Low latency requirements |
- Maintains approximate radix tree per worker
- Routes to worker with highest prefix match
- Falls back to shortest-queue when imbalanced
- Auto-evicts old entries to prevent memory overflow
- Workload with shared prefixes: +92% throughput, +275% cache hit rate
- See SGLang v0.4 blog
Recommended Production Setup
Monitoring
Check worker status:Combining with Other Parallelism
DP + TP
Most common combination:DPA + EP + TP (DeepSeek)
Recommended for DeepSeek MoE models:Standard DP for MLA Models with SMG
To use standard DP (not DPA) for MLA models:- Launch each replica independently with DPA disabled
- Connect replicas to SMG for load balancing
Performance Comparison
SMG vs Native DP
| Feature | Native DP | SMG-Based DP |
|---|---|---|
| Load Balancing | Basic in-process | Cache-aware, power-of-two |
| Cache Awareness | ❌ No | ✅ Yes (up to +275% hit rate) |
| Throughput | Baseline | +92% with cache-aware |
| Multi-Node | Limited | ✅ Full support |
| Health Monitoring | Basic | ✅ Circuit breakers, health checks |
| Reliability | Basic | ✅ Retries, rate limiting |
| Observability | Basic | ✅ 40+ Prometheus metrics |
| Hot Add/Remove | ❌ No | ✅ Yes |
DPA vs Standard TP (DeepSeek)
Memory efficiency:- Standard TP (tp=8): KV cache duplicated 8 times
- DPA (dp=8): Each replica has unique KV cache
- Result: 8× more memory for KV cache → larger batches
- DPA + EP on DeepSeek: Up to 5× improvement vs vanilla TP
- See Large-Scale EP Blog
Best Practices
For Standard DP:
- Always use SMG instead of native DP for production
- Enable cache-aware routing for workloads with shared prefixes
- Monitor cache hit rates to validate routing effectiveness
- Use health checks to detect and remove unhealthy workers
- Start with co-launch for simplicity, then scale to separate workers
For DPA:
- Use DPA for MLA models (DeepSeek, MiniMax, Kimi-K2)
- Combine with EP for MoE models (DeepSeek-V3)
- Set dp-size = ep-size for optimal performance
- Ensure tp % dp == 0 constraint is satisfied
- Monitor per-replica utilization to ensure balanced workload
Production Deployment:
Troubleshooting
DPA Not Activating
Symptom:--enable-dp-attention has no effect
Solution: Ensure --dp-size > 1:
dp-size == 1.
TP/DP Size Constraint Error
Symptom: “Constrainttp_size % dp_size == 0 not satisfied”
Solution: Ensure TP is divisible by DP:
Low Cache Hit Rate with SMG
Symptom: Low cache hit rate despite cache-aware routing Solution: Tune cache-aware parameters:Configuration Summary
| Parameter | Description | Default | Recommended |
|---|---|---|---|
--dp-size | Data parallel size | 1 | 2-8 |
--enable-dp-attention | Enable DPA | False | Enable for MLA models |
--router-policy | SMG routing policy | round_robin | cache_aware |
--router-health-check-interval-secs | Health check interval | None | 30 |
--cache-threshold | Cache-aware threshold | 0.5 | 0.3-0.7 |
--balance-abs-threshold | Load balance threshold | 32 | 32-64 |
When to Choose Each Strategy
| Strategy | Use Case | Key Benefit |
|---|---|---|
| Native DP | Never recommended | Educational purposes only |
| SMG-Based DP | Production standard DP | Cache-aware routing, reliability |
| DPA | DeepSeek/MLA models | Eliminates KV cache duplication |
| DPA + EP | DeepSeek MoE models | Maximum throughput (up to 5× improvement) |
Related Documentation
- SGLang Model Gateway - Complete SMG documentation
- Expert Parallelism - EP for MoE models
- Tensor Parallelism - TP fundamentals
- Large-Scale EP Blog - DPA+EP performance analysis
- SGLang v0.4 Blog - Cache-aware routing benchmarks
