Benchmark Results
Benchmarked on AWS m5.xlarge (4 vCPU, 16GB RAM) usingmemtier_benchmark with 200 clients and 256-byte values.
Throughput (ops/sec)
| Workload | Redis 8 | Dragonfly 1.37 | Kora | vs Redis | vs Dragonfly |
|---|---|---|---|---|---|
| SET-only | 138,239 | 236,885 | 229,535 | +66.0% | -3.1% |
| GET-only | 144,240 | 241,305 | 239,230 | +65.9% | -0.9% |
| Mixed 1:1 | 139,014 | 232,507 | 233,377 | +67.9% | +0.4% |
| Pipeline x16 | 510,705 | 389,286 | 769,374 | +50.7% | +97.7% |
p50 Latency (ms)
| Workload | Redis 8 | Dragonfly | Kora | |-----------------|---------|-----------|-----------|| | SET-only | 1.415 | 0.847 | 0.831 | | GET-only | 1.359 | 0.839 | 0.839 | | Mixed 1:1 | 1.415 | 0.871 | 0.847 | | Pipeline x16 | 6.303 | 8.191 | 4.063 | Kora delivers sub-millisecond median latency across all workloads.p99 Latency (ms)
| Workload | Redis 8 | Dragonfly | Kora | |-----------------|---------|-----------|-----------|| | SET-only | 2.111 | 1.175 | 1.223 | | GET-only | 1.943 | 1.143 | 1.327 | | Mixed 1:1 | 1.999 | 1.175 | 1.631 | | Pipeline x16 | 9.087 | 10.815 | 7.519 | Kora maintains consistent tail latency even under high load.Architecture Benefits
Kora’s performance comes from its core design principles:1. Shard-Affinity I/O
Each worker thread owns:- Data shard — subset of keys hashed to this worker
- Connection I/O —
current_threadTokio runtime for connections accessing this shard - No locks — data structures use
Rc<RefCell<>>instead ofArc<Mutex<>>
2. Zero Channel Hops for Local Keys
Commands accessing local-shard keys execute inline:3. Single Async Hop for Foreign Keys
Cross-shard commands use onetokio::mpsc + oneshot round-trip:
4. Pipeline Performance
Kora excels at pipelined workloads because:- Batch parsing — RESP parser processes multiple commands per
read()syscall - Parallel shard dispatch — pipelined commands to different shards execute concurrently
- Response aggregation — responses written in order to TCP socket
Worker Count Configuration
The--workers flag controls shard parallelism and is the most important performance tuning knob.
Auto-Detection (Default)
Explicit Worker Count
| System | Recommended Workers |
|---|---|
| 4-core laptop | 4 (default) |
| 8-core server | 8 (default) |
| 32-core EPYC | 32 (match physical cores) |
| 64-core Threadripper | 32-48 (avoid hyperthreads if contention detected) |
| Low memory (less than 2GB) | 2-4 (reduce memory overhead) |
Tuning Strategy
- Start with default — let Kora auto-detect
- Monitor CPU usage — workers should be 70-90% utilized under load
- Increase workers if CPU is underutilized and you have cores available
- Decrease workers if you see context switch overhead (check
perforvmstat)
Memory vs. Workers
Each worker has a base overhead of ~2MB (shard metadata, tokio runtime). For a 32-worker setup:Shard Key Distribution
Kora uses consistent hashing to distribute keys across shards:- More workers — finer-grained sharding reduces hot-shard impact
- Client-side sharding — distribute hot keys across multiple Kora instances
Network Tuning
TCP Socket Options
Kora uses Tokio’s default TCP socket configuration. For high-throughput production deployments, tune OS-level TCP settings:Unix Sockets
For single-machine deployments, use Unix domain sockets for lower latency:- ~20% lower latency for local connections
- ~10% higher throughput for pipelined workloads
redis-cli -s /tmp/kora.sock.
Pipeline Optimization
Pipelining dramatically improves throughput by batching commands:| Pipeline Depth | Throughput (ops/sec) |
|---|---|
| 1 | 233,377 |
| 4 | ~450,000 |
| 8 | ~650,000 |
| 16 | 769,374 |
| 32 | ~850,000 |
Client-Side Pipelining
Most Redis clients support pipelining:Storage Performance
Persistence impacts write throughput depending onwal_sync policy:
| WAL Sync Policy | Throughput Impact | Durability |
|---|---|---|
os_managed | 0-5% overhead | Lowest |
every_second | 5-10% overhead | Medium (1s window) |
every_write | 30-50% overhead | Highest |
every_second for production. It provides strong durability (1-second loss window on crash) with minimal performance cost.
RDB Snapshots
RDB snapshots run in the background and do not block the data path. Enable automatic snapshots:- CPU: 5-10% during snapshot creation (LZ4 compression)
- I/O: Burst write to disk (magnitude of dataset size)
- Memory: Copy-on-write for modified keys during snapshot
Memory Optimization
Value Size
Smaller values → higher throughput (less memory copy overhead):| Value Size | SET Throughput |
|---|---|
| 64 bytes | ~280K ops/sec |
| 256 bytes | ~230K ops/sec |
| 1KB | ~150K ops/sec |
| 4KB | ~60K ops/sec |
Key Size
Use short, structured keys:- Memory usage — key storage and hash table overhead
- Network bandwidth — smaller RESP2 payloads
- CPU usage — faster hashing and comparison
Profiling and Debugging
CPU Profiling
Useperf to identify bottlenecks:
- Hot functions in shard workers
- Lock contention (should be minimal)
- Syscall overhead (
read,write,epoll_wait)
Latency Tracing
Enable trace-level logging for detailed latency breakdown:- Command parse time
- Shard routing decision
- Execution time
- Response serialization time
Benchmarking Kora
Reproduce the benchmark results:Performance Checklist
- ✅ Use default worker count (auto-detect) unless profiling shows otherwise
- ✅ Enable pipelining in client applications
- ✅ Use
wal_sync = "every_second"for balanced durability/performance - ✅ Keep value sizes under 1KB for cache workloads
- ✅ Use short, structured keys
- ✅ Monitor per-command latency via Prometheus metrics
- ✅ Profile with
perfif throughput plateaus - ✅ Consider Unix sockets for single-machine deployments