Performance

Kora is designed for high throughput and low latency using a shared-nothing, shard-affinity threading architecture. This page covers benchmark results, performance tuning, and optimization strategies.

Benchmark Results

Benchmarked on AWS m5.xlarge (4 vCPU, 16GB RAM) using memtier_benchmark with 200 clients and 256-byte values.

Throughput (ops/sec)

Workload	Redis 8	Dragonfly 1.37	Kora	vs Redis	vs Dragonfly
SET-only	138,239	236,885	229,535	+66.0%	-3.1%
GET-only	144,240	241,305	239,230	+65.9%	-0.9%
Mixed 1:1	139,014	232,507	233,377	+67.9%	+0.4%
Pipeline x16	510,705	389,286	769,374	+50.7%	+97.7%

Kora achieves ~66% higher throughput than Redis on single-command workloads and ~98% higher on pipelined workloads.

p50 Latency (ms)

| Workload | Redis 8 | Dragonfly | Kora | |-----------------|---------|-----------|-----------|| | SET-only | 1.415 | 0.847 | 0.831 | | GET-only | 1.359 | 0.839 | 0.839 | | Mixed 1:1 | 1.415 | 0.871 | 0.847 | | Pipeline x16 | 6.303 | 8.191 | 4.063 | Kora delivers sub-millisecond median latency across all workloads.

p99 Latency (ms)

| Workload | Redis 8 | Dragonfly | Kora | |-----------------|---------|-----------|-----------|| | SET-only | 2.111 | 1.175 | 1.223 | | GET-only | 1.943 | 1.143 | 1.327 | | Mixed 1:1 | 1.999 | 1.175 | 1.631 | | Pipeline x16 | 9.087 | 10.815 | 7.519 | Kora maintains consistent tail latency even under high load.

Architecture Benefits

Kora’s performance comes from its core design principles:

1. Shard-Affinity I/O

Each worker thread owns:

Data shard — subset of keys hashed to this worker
Connection I/O — current_thread Tokio runtime for connections accessing this shard
No locks — data structures use Rc<RefCell<>> instead of Arc<Mutex<>>

Benefit: Zero lock contention on the data path. Linear scaling with CPU cores.

2. Zero Channel Hops for Local Keys

Commands accessing local-shard keys execute inline:

Client → RESP Parser → Hash(key) → Local shard?
                                    ↓ Yes
                                    Execute inline (no channel)
                                    ↓
                                    Response

No message passing, no context switches.

3. Single Async Hop for Foreign Keys

Cross-shard commands use one tokio::mpsc + oneshot round-trip:

Client → RESP Parser → Hash(key) → Foreign shard?
                                    ↓ Yes
                                    Send via mpsc
                                    ↓
                                    Target worker executes
                                    ↓
                                    Response via oneshot

Minimal overhead: ~100ns for channel send.

4. Pipeline Performance

Kora excels at pipelined workloads because:

Batch parsing — RESP parser processes multiple commands per read() syscall
Parallel shard dispatch — pipelined commands to different shards execute concurrently
Response aggregation — responses written in order to TCP socket

Result: 769K ops/sec on pipeline x16 (97% faster than Dragonfly).

Worker Count Configuration

The --workers flag controls shard parallelism and is the most important performance tuning knob.

Auto-Detection (Default)

kora-cli --port 6379  # workers = num_cpus::get()

Kora auto-detects logical CPU count and creates one worker per core. This is optimal for most workloads.

Explicit Worker Count

kora-cli --workers 16 --port 6379

Guidelines:

System	Recommended Workers
4-core laptop	4 (default)
8-core server	8 (default)
32-core EPYC	32 (match physical cores)
64-core Threadripper	32-48 (avoid hyperthreads if contention detected)
Low memory (less than 2GB)	2-4 (reduce memory overhead)

Tuning Strategy

Start with default — let Kora auto-detect
Monitor CPU usage — workers should be 70-90% utilized under load
Increase workers if CPU is underutilized and you have cores available
Decrease workers if you see context switch overhead (check perf or vmstat)

Memory vs. Workers

Each worker has a base overhead of ~2MB (shard metadata, tokio runtime). For a 32-worker setup:

32 workers × 2MB = 64MB base overhead

Data memory is proportional to total key count, not worker count.

Shard Key Distribution

Kora uses consistent hashing to distribute keys across shards:

let shard_id = hash(key) % worker_count;

Uniform distribution is critical for balanced CPU usage. Measure shard balance:

redis-cli -p 6379 INFO | grep shard

(Hypothetical command — Kora doesn’t expose shard-level stats via INFO yet.) Non-uniform workloads (e.g., 80% of requests hit 20% of keys) may cause hot shards. Monitor per-shard CPU usage and consider:

More workers — finer-grained sharding reduces hot-shard impact
Client-side sharding — distribute hot keys across multiple Kora instances

Network Tuning

TCP Socket Options

Kora uses Tokio’s default TCP socket configuration. For high-throughput production deployments, tune OS-level TCP settings:

# Increase TCP buffer sizes
sysctl -w net.ipv4.tcp_rmem="4096 87380 16777216"
sysctl -w net.ipv4.tcp_wmem="4096 65536 16777216"

# Increase max socket backlog
sysctl -w net.core.somaxconn=4096

Unix Sockets

For single-machine deployments, use Unix domain sockets for lower latency:

kora-cli --unix-socket /tmp/kora.sock

Unix sockets eliminate TCP/IP stack overhead:

~20% lower latency for local connections
~10% higher throughput for pipelined workloads

Connect with redis-cli -s /tmp/kora.sock.

Pipeline Optimization

Pipelining dramatically improves throughput by batching commands:

# Single command per round-trip
redis-cli -p 6379 SET key1 value1  # ~0.8ms latency
redis-cli -p 6379 SET key2 value2  # 2 × 0.8ms = 1.6ms total

# Pipelined (both commands in one batch)
echo -e "SET key1 value1\nSET key2 value2" | redis-cli -p 6379 --pipe
# ~0.9ms total (80% faster)

Kora’s pipeline performance:

Pipeline Depth	Throughput (ops/sec)
1	233,377
4	~450,000
8	~650,000
16	769,374
32	~850,000

Diminishing returns after 16-32 commands per pipeline.

Client-Side Pipelining

Most Redis clients support pipelining:

import redis
r = redis.Redis(host='localhost', port=6379)

pipe = r.pipeline()
for i in range(100):
    pipe.set(f'key{i}', f'value{i}')
pipe.execute()  # Send all 100 commands in one batch

Storage Performance

Persistence impacts write throughput depending on wal_sync policy:

WAL Sync Policy	Throughput Impact	Durability
`os_managed`	0-5% overhead	Lowest
`every_second`	5-10% overhead	Medium (1s window)
`every_write`	30-50% overhead	Highest

Recommendation: Use every_second for production. It provides strong durability (1-second loss window on crash) with minimal performance cost.

RDB Snapshots

RDB snapshots run in the background and do not block the data path. Enable automatic snapshots:

[storage]
snapshot_interval_secs = 300  # Snapshot every 5 minutes

Snapshot overhead:

CPU: 5-10% during snapshot creation (LZ4 compression)
I/O: Burst write to disk (magnitude of dataset size)
Memory: Copy-on-write for modified keys during snapshot

Memory Optimization

Value Size

Smaller values → higher throughput (less memory copy overhead):

Value Size	SET Throughput
64 bytes	~280K ops/sec
256 bytes	~230K ops/sec
1KB	~150K ops/sec
4KB	~60K ops/sec

Guideline: Keep values under 1KB for cache workloads.

Key Size

Use short, structured keys:

❌ user:profile:12345:firstname
✅ u:12345:fn

❌ product_catalog_item_2024_12345
✅ p:12345

Shorter keys reduce:

Memory usage — key storage and hash table overhead
Network bandwidth — smaller RESP2 payloads
CPU usage — faster hashing and comparison

Profiling and Debugging

CPU Profiling

Use perf to identify bottlenecks:

# Record CPU samples for 10 seconds
perf record -F 999 -p $(pgrep kora-cli) sleep 10

# Generate flamegraph
perf script | stackcollapse-perf.pl | flamegraph.pl > kora.svg

Look for:

Hot functions in shard workers
Lock contention (should be minimal)
Syscall overhead (read, write, epoll_wait)

Latency Tracing

Enable trace-level logging for detailed latency breakdown:

RUST_LOG=kora_server=trace kora-cli --port 6379

Logs include:

Command parse time
Shard routing decision
Execution time
Response serialization time

Benchmarking Kora

Reproduce the benchmark results:

# Start Kora
kora-cli --port 6379 --workers 4

# Run memtier_benchmark (SET)
memtier_benchmark -p 6379 -c 50 -t 4 \
  --key-pattern=R:R --ratio=1:0 \
  --data-size=256 --requests=1000000

# Run memtier_benchmark (GET)
memtier_benchmark -p 6379 -c 50 -t 4 \
  --key-pattern=R:R --ratio=0:1 \
  --data-size=256 --requests=1000000

# Run memtier_benchmark (Mixed 1:1)
memtier_benchmark -p 6379 -c 50 -t 4 \
  --key-pattern=R:R --ratio=1:1 \
  --data-size=256 --requests=1000000

# Run memtier_benchmark (Pipeline x16)
memtier_benchmark -p 6379 -c 50 -t 4 \
  --pipeline=16 --key-pattern=R:R --ratio=1:1 \
  --data-size=256 --requests=1000000

Performance Checklist

✅ Use default worker count (auto-detect) unless profiling shows otherwise
✅ Enable pipelining in client applications
✅ Use wal_sync = "every_second" for balanced durability/performance
✅ Keep value sizes under 1KB for cache workloads
✅ Use short, structured keys
✅ Monitor per-command latency via Prometheus metrics
✅ Profile with perf if throughput plateaus
✅ Consider Unix sockets for single-machine deployments

Getting Started

Core Concepts

Features

Operations

Development

Performance

Benchmark Results

Throughput (ops/sec)

p50 Latency (ms)

p99 Latency (ms)

Architecture Benefits

1. Shard-Affinity I/O

2. Zero Channel Hops for Local Keys

3. Single Async Hop for Foreign Keys

4. Pipeline Performance

Worker Count Configuration

Auto-Detection (Default)

Explicit Worker Count

Tuning Strategy

Memory vs. Workers

Shard Key Distribution

Network Tuning

TCP Socket Options

Unix Sockets

Pipeline Optimization

Client-Side Pipelining

Storage Performance

RDB Snapshots

Memory Optimization

Value Size

Key Size

Profiling and Debugging

CPU Profiling

Latency Tracing

Benchmarking Kora

Performance Checklist

Build docs developers (and LLMs) love

Getting Started

Core Concepts

Features

Operations

Development

​Benchmark Results

​Throughput (ops/sec)

​p50 Latency (ms)

​p99 Latency (ms)

​Architecture Benefits

​1. Shard-Affinity I/O

​2. Zero Channel Hops for Local Keys

​3. Single Async Hop for Foreign Keys

​4. Pipeline Performance

​Worker Count Configuration

​Auto-Detection (Default)

​Explicit Worker Count

​Tuning Strategy

​Memory vs. Workers

​Shard Key Distribution

​Network Tuning

​TCP Socket Options

​Unix Sockets

​Pipeline Optimization

​Client-Side Pipelining

​Storage Performance

​RDB Snapshots

​Memory Optimization

​Value Size

​Key Size

​Profiling and Debugging

​CPU Profiling

​Latency Tracing

​Benchmarking Kora

​Performance Checklist

Build docs developers (and LLMs) love

Benchmark Results

Throughput (ops/sec)

p50 Latency (ms)

p99 Latency (ms)

Architecture Benefits

1. Shard-Affinity I/O

2. Zero Channel Hops for Local Keys

3. Single Async Hop for Foreign Keys

4. Pipeline Performance

Worker Count Configuration

Auto-Detection (Default)

Explicit Worker Count

Tuning Strategy

Memory vs. Workers

Shard Key Distribution

Network Tuning

TCP Socket Options

Unix Sockets

Pipeline Optimization

Client-Side Pipelining

Storage Performance

RDB Snapshots

Memory Optimization

Value Size

Key Size

Profiling and Debugging

CPU Profiling

Latency Tracing

Benchmarking Kora

Performance Checklist