Skip to main content
Kora is designed for high throughput and low latency using a shared-nothing, shard-affinity threading architecture. This page covers benchmark results, performance tuning, and optimization strategies.

Benchmark Results

Benchmarked on AWS m5.xlarge (4 vCPU, 16GB RAM) using memtier_benchmark with 200 clients and 256-byte values.

Throughput (ops/sec)

WorkloadRedis 8Dragonfly 1.37Koravs Redisvs Dragonfly
SET-only138,239236,885229,535+66.0%-3.1%
GET-only144,240241,305239,230+65.9%-0.9%
Mixed 1:1139,014232,507233,377+67.9%+0.4%
Pipeline x16510,705389,286769,374+50.7%+97.7%
Kora achieves ~66% higher throughput than Redis on single-command workloads and ~98% higher on pipelined workloads.

p50 Latency (ms)

| Workload | Redis 8 | Dragonfly | Kora | |-----------------|---------|-----------|-----------|| | SET-only | 1.415 | 0.847 | 0.831 | | GET-only | 1.359 | 0.839 | 0.839 | | Mixed 1:1 | 1.415 | 0.871 | 0.847 | | Pipeline x16 | 6.303 | 8.191 | 4.063 | Kora delivers sub-millisecond median latency across all workloads.

p99 Latency (ms)

| Workload | Redis 8 | Dragonfly | Kora | |-----------------|---------|-----------|-----------|| | SET-only | 2.111 | 1.175 | 1.223 | | GET-only | 1.943 | 1.143 | 1.327 | | Mixed 1:1 | 1.999 | 1.175 | 1.631 | | Pipeline x16 | 9.087 | 10.815 | 7.519 | Kora maintains consistent tail latency even under high load.

Architecture Benefits

Kora’s performance comes from its core design principles:

1. Shard-Affinity I/O

Each worker thread owns:
  • Data shard — subset of keys hashed to this worker
  • Connection I/Ocurrent_thread Tokio runtime for connections accessing this shard
  • No locks — data structures use Rc<RefCell<>> instead of Arc<Mutex<>>
Benefit: Zero lock contention on the data path. Linear scaling with CPU cores.

2. Zero Channel Hops for Local Keys

Commands accessing local-shard keys execute inline:
Client → RESP Parser → Hash(key) → Local shard?
                                    ↓ Yes
                                    Execute inline (no channel)

                                    Response
No message passing, no context switches.

3. Single Async Hop for Foreign Keys

Cross-shard commands use one tokio::mpsc + oneshot round-trip:
Client → RESP Parser → Hash(key) → Foreign shard?
                                    ↓ Yes
                                    Send via mpsc

                                    Target worker executes

                                    Response via oneshot
Minimal overhead: ~100ns for channel send.

4. Pipeline Performance

Kora excels at pipelined workloads because:
  • Batch parsing — RESP parser processes multiple commands per read() syscall
  • Parallel shard dispatch — pipelined commands to different shards execute concurrently
  • Response aggregation — responses written in order to TCP socket
Result: 769K ops/sec on pipeline x16 (97% faster than Dragonfly).

Worker Count Configuration

The --workers flag controls shard parallelism and is the most important performance tuning knob.

Auto-Detection (Default)

kora-cli --port 6379  # workers = num_cpus::get()
Kora auto-detects logical CPU count and creates one worker per core. This is optimal for most workloads.

Explicit Worker Count

kora-cli --workers 16 --port 6379
Guidelines:
SystemRecommended Workers
4-core laptop4 (default)
8-core server8 (default)
32-core EPYC32 (match physical cores)
64-core Threadripper32-48 (avoid hyperthreads if contention detected)
Low memory (less than 2GB)2-4 (reduce memory overhead)

Tuning Strategy

  1. Start with default — let Kora auto-detect
  2. Monitor CPU usage — workers should be 70-90% utilized under load
  3. Increase workers if CPU is underutilized and you have cores available
  4. Decrease workers if you see context switch overhead (check perf or vmstat)

Memory vs. Workers

Each worker has a base overhead of ~2MB (shard metadata, tokio runtime). For a 32-worker setup:
32 workers × 2MB = 64MB base overhead
Data memory is proportional to total key count, not worker count.

Shard Key Distribution

Kora uses consistent hashing to distribute keys across shards:
let shard_id = hash(key) % worker_count;
Uniform distribution is critical for balanced CPU usage. Measure shard balance:
redis-cli -p 6379 INFO | grep shard
(Hypothetical command — Kora doesn’t expose shard-level stats via INFO yet.) Non-uniform workloads (e.g., 80% of requests hit 20% of keys) may cause hot shards. Monitor per-shard CPU usage and consider:
  • More workers — finer-grained sharding reduces hot-shard impact
  • Client-side sharding — distribute hot keys across multiple Kora instances

Network Tuning

TCP Socket Options

Kora uses Tokio’s default TCP socket configuration. For high-throughput production deployments, tune OS-level TCP settings:
# Increase TCP buffer sizes
sysctl -w net.ipv4.tcp_rmem="4096 87380 16777216"
sysctl -w net.ipv4.tcp_wmem="4096 65536 16777216"

# Increase max socket backlog
sysctl -w net.core.somaxconn=4096

Unix Sockets

For single-machine deployments, use Unix domain sockets for lower latency:
kora-cli --unix-socket /tmp/kora.sock
Unix sockets eliminate TCP/IP stack overhead:
  • ~20% lower latency for local connections
  • ~10% higher throughput for pipelined workloads
Connect with redis-cli -s /tmp/kora.sock.

Pipeline Optimization

Pipelining dramatically improves throughput by batching commands:
# Single command per round-trip
redis-cli -p 6379 SET key1 value1  # ~0.8ms latency
redis-cli -p 6379 SET key2 value2  # 2 × 0.8ms = 1.6ms total

# Pipelined (both commands in one batch)
echo -e "SET key1 value1\nSET key2 value2" | redis-cli -p 6379 --pipe
# ~0.9ms total (80% faster)
Kora’s pipeline performance:
Pipeline DepthThroughput (ops/sec)
1233,377
4~450,000
8~650,000
16769,374
32~850,000
Diminishing returns after 16-32 commands per pipeline.

Client-Side Pipelining

Most Redis clients support pipelining:
import redis
r = redis.Redis(host='localhost', port=6379)

pipe = r.pipeline()
for i in range(100):
    pipe.set(f'key{i}', f'value{i}')
pipe.execute()  # Send all 100 commands in one batch

Storage Performance

Persistence impacts write throughput depending on wal_sync policy:
WAL Sync PolicyThroughput ImpactDurability
os_managed0-5% overheadLowest
every_second5-10% overheadMedium (1s window)
every_write30-50% overheadHighest
Recommendation: Use every_second for production. It provides strong durability (1-second loss window on crash) with minimal performance cost.

RDB Snapshots

RDB snapshots run in the background and do not block the data path. Enable automatic snapshots:
[storage]
snapshot_interval_secs = 300  # Snapshot every 5 minutes
Snapshot overhead:
  • CPU: 5-10% during snapshot creation (LZ4 compression)
  • I/O: Burst write to disk (magnitude of dataset size)
  • Memory: Copy-on-write for modified keys during snapshot

Memory Optimization

Value Size

Smaller values → higher throughput (less memory copy overhead):
Value SizeSET Throughput
64 bytes~280K ops/sec
256 bytes~230K ops/sec
1KB~150K ops/sec
4KB~60K ops/sec
Guideline: Keep values under 1KB for cache workloads.

Key Size

Use short, structured keys:
❌ user:profile:12345:firstname
✅ u:12345:fn

❌ product_catalog_item_2024_12345
✅ p:12345
Shorter keys reduce:
  • Memory usage — key storage and hash table overhead
  • Network bandwidth — smaller RESP2 payloads
  • CPU usage — faster hashing and comparison

Profiling and Debugging

CPU Profiling

Use perf to identify bottlenecks:
# Record CPU samples for 10 seconds
perf record -F 999 -p $(pgrep kora-cli) sleep 10

# Generate flamegraph
perf script | stackcollapse-perf.pl | flamegraph.pl > kora.svg
Look for:
  • Hot functions in shard workers
  • Lock contention (should be minimal)
  • Syscall overhead (read, write, epoll_wait)

Latency Tracing

Enable trace-level logging for detailed latency breakdown:
RUST_LOG=kora_server=trace kora-cli --port 6379
Logs include:
  • Command parse time
  • Shard routing decision
  • Execution time
  • Response serialization time

Benchmarking Kora

Reproduce the benchmark results:
# Start Kora
kora-cli --port 6379 --workers 4

# Run memtier_benchmark (SET)
memtier_benchmark -p 6379 -c 50 -t 4 \
  --key-pattern=R:R --ratio=1:0 \
  --data-size=256 --requests=1000000

# Run memtier_benchmark (GET)
memtier_benchmark -p 6379 -c 50 -t 4 \
  --key-pattern=R:R --ratio=0:1 \
  --data-size=256 --requests=1000000

# Run memtier_benchmark (Mixed 1:1)
memtier_benchmark -p 6379 -c 50 -t 4 \
  --key-pattern=R:R --ratio=1:1 \
  --data-size=256 --requests=1000000

# Run memtier_benchmark (Pipeline x16)
memtier_benchmark -p 6379 -c 50 -t 4 \
  --pipeline=16 --key-pattern=R:R --ratio=1:1 \
  --data-size=256 --requests=1000000

Performance Checklist

  • ✅ Use default worker count (auto-detect) unless profiling shows otherwise
  • ✅ Enable pipelining in client applications
  • ✅ Use wal_sync = "every_second" for balanced durability/performance
  • ✅ Keep value sizes under 1KB for cache workloads
  • ✅ Use short, structured keys
  • ✅ Monitor per-command latency via Prometheus metrics
  • ✅ Profile with perf if throughput plateaus
  • ✅ Consider Unix sockets for single-machine deployments

Build docs developers (and LLMs) love