workerd performs well out of the box, but there’s significant headroom for optimization. This guide covers compilation flags, runtime configuration, and application-level tuning.
Performance tuning is an ongoing effort in workerd. Many optimizations mentioned here represent low-hanging fruit that can dramatically improve performance.
Build optimizations
Release builds
Always use release builds for production:
# Standard release build
bazel build //src/workerd/server:workerd --config=release
Release mode enables:
- Compiler optimizations (
-O2)
- Dead code elimination
- Inlining
- NDEBUG mode (disables assertions)
Link-time optimization (LTO)
For maximum performance, use thin LTO:
bazel build --config=thin-lto //src/workerd/server:workerd
LTO benefits:
- ~2x performance improvement on simple benchmarks
- Better cross-module optimization
- Smaller binary size
- Longer compile times
Experiments suggest workerd can roughly double performance on “hello world” benchmarks with LTO and memory allocator tuning.
Memory allocator
workerd uses tcmalloc by default, which is already optimized for allocation-heavy workloads. Alternative allocators:
- tcmalloc: Default, excellent for workerd’s usage patterns
- jemalloc: May perform better for specific workloads
- mimalloc: Lightweight alternative
To benchmark allocators, link against different allocator libraries at build time.
Runtime configuration
Multi-process deployment
workerd is single-threaded. Utilize all CPU cores by running multiple instances:
# Run one instance per CPU core
for i in {0..7}; do
workerd serve config.capnp --socket-fd http=$((3+i)) &
done
See systemd deployment for production-ready multi-instance setup.
Connection pooling
For outbound requests, workerd automatically pools connections. Tune limits:
const config :Workerd.Config = (
# Global connection limits
# Note: These limits are per-process
);
Cache configuration
For Durable Objects, tune the storage cache:
// Limit cache usage to prevent memory exhaustion
const options = {
softLimit: 16 * 1024 * 1024, // 16 MiB
hardLimit: 32 * 1024 * 1024, // 32 MiB
dirtyListByteLimit: 4 * 1024 * 1024, // 4 MiB
};
See Durable Objects storage internals for details.
Application-level optimizations
Minimize cold starts
Pre-warm isolates
workerd creates V8 isolates on-demand. For predictable performance:
// Heavy initialization at module level
import crypto from 'crypto';
import { expensive } from './utils';
const precomputedData = expensive();
export default {
async fetch(request) {
// Fast path uses pre-computed data
return new Response(precomputedData);
}
};
Avoid dynamic imports
// ❌ Slow: Dynamic import on every request
export default {
async fetch(request) {
const { handler } = await import('./handler');
return handler(request);
}
};
// ✅ Fast: Static import
import { handler } from './handler';
export default {
async fetch(request) {
return handler(request);
}
};
Optimize request handling
Batch operations
// ❌ Slow: Sequential requests
const results = [];
for (const url of urls) {
const response = await fetch(url);
results.push(await response.json());
}
// ✅ Fast: Parallel requests
const responses = await Promise.all(
urls.map(url => fetch(url))
);
const results = await Promise.all(
responses.map(r => r.json())
);
Stream when possible
// ✅ Stream large responses
export default {
async fetch(request) {
const upstream = await fetch('https://example.com/large-file');
// Stream through without buffering
return new Response(upstream.body, {
headers: upstream.headers
});
}
};
Minimize allocations
Reuse objects
// ❌ Creates new object on every request
const headers = new Headers();
headers.set('X-Custom', 'value');
// ✅ Reuse immutable objects
const COMMON_HEADERS = new Headers({
'X-Custom': 'value',
'Content-Type': 'application/json'
});
export default {
async fetch(request) {
return new Response(data, { headers: COMMON_HEADERS });
}
};
Avoid string concatenation in hot paths
// ❌ Slow: String concatenation
let html = '<html>';
html += '<body>';
html += '<h1>' + title + '</h1>';
html += '</body></html>';
// ✅ Fast: Template literals or array join
const html = `<html><body><h1>${title}</h1></body></html>`;
// ✅ Also fast: Array join for dynamic lists
const parts = ['<html><body>'];
for (const item of items) {
parts.push(`<li>${item}</li>`);
}
parts.push('</body></html>');
const html = parts.join('');
Storage optimizations
For Durable Objects:
Batch reads and writes
// ❌ Slow: Individual operations
await storage.put('key1', 'value1');
await storage.put('key2', 'value2');
await storage.put('key3', 'value3');
// ✅ Fast: Batch operation
await storage.put({
key1: 'value1',
key2: 'value2',
key3: 'value3',
});
Use transactions
// ✅ Atomic batch with transaction
await storage.transaction(async (txn) => {
const current = await txn.get('counter');
await txn.put('counter', current + 1);
await txn.put('lastUpdate', Date.now());
// Single commit for all operations
});
Limit list() operations
// ❌ Can exhaust cache
const all = await storage.list();
// ✅ Use pagination
const page1 = await storage.list({ limit: 1000 });
const page2 = await storage.list({
start: page1.cursor,
limit: 1000
});
WebSocket optimizations
Use hibernation for idle connections
class ChatRoom {
async webSocketMessage(ws, message) {
// Process message
this.broadcast(message);
// Allow hibernation for idle connections
ws.enableHibernation();
}
}
Batch broadcasts
// ❌ Slow: Send to each connection individually
for (const ws of connections) {
ws.send(message);
}
// ✅ Fast: Use acceptWebSocket with hibernation
// workerd can batch message delivery
this.broadcast(message); // Optimized internally
Monitoring and profiling
Request timing
Add timing headers:
export default {
async fetch(request) {
const start = Date.now();
const response = await handleRequest(request);
const duration = Date.now() - start;
response.headers.set('X-Response-Time', `${duration}ms`);
return response;
}
};
CPU profiling
workerd supports V8 CPU profiling:
# Enable CPU profiling
workerd serve config.capnp --experimental-enable-cpu-profiler
CPU profiling adds overhead. Only enable for debugging, not in production.
Memory profiling
Monitor memory usage:
// Check memory usage
const used = performance.memory?.usedJSHeapSize;
const limit = performance.memory?.jsHeapSizeLimit;
if (used / limit > 0.8) {
console.warn('Memory usage high:', used / 1024 / 1024, 'MB');
}
Benchmarking
Load testing
Use wrk or ab for load testing:
# wrk - 10 threads, 100 connections, 30 seconds
wrk -t10 -c100 -d30s http://localhost:8080/
# Apache Bench - 10000 requests, 100 concurrent
ab -n 10000 -c 100 http://localhost:8080/
Simple “Hello World” performance (with LTO):
- Throughput: ~100k req/s per core (single instance)
- Latency p50: <1ms
- Latency p99: <5ms
Your mileage will vary based on:
- Application complexity
- I/O operations
- System resources
Linux
Increase file descriptor limits
# In systemd service file
LimitNOFILE=65536
# Or system-wide in /etc/security/limits.conf
workerd soft nofile 65536
workerd hard nofile 65536
Use hugepages
For workloads with large memory footprints:
# Enable transparent hugepages
echo always > /sys/kernel/mm/transparent_hugepage/enabled
TCP tuning
For high connection counts:
# Increase TCP buffer sizes
sudo sysctl -w net.core.rmem_max=16777216
sudo sysctl -w net.core.wmem_max=16777216
sudo sysctl -w net.ipv4.tcp_rmem="4096 87380 16777216"
sudo sysctl -w net.ipv4.tcp_wmem="4096 87380 16777216"
macOS
Increase file descriptor limits:
# Check current limits
ulimit -n
# Increase (temporary)
ulimit -n 65536
# Permanent: Add to ~/.zshrc or ~/.bashrc
ulimit -n 65536
V8 tuning
Heap size limits
workerd uses V8’s default heap limits. For memory-intensive applications:
# Increase V8 heap size via workerd flags
# Note: This is per-isolate
workerd serve config.capnp --v8-max-heap-size=512
Optimization tier
V8 has multiple optimization tiers (Ignition, TurboFan). Hot functions automatically get optimized.
To force optimization (testing only):
// Force function optimization (V8 internal)
// Don't use in production!
function hotPath(data) {
// Your hot path code
}
// Warm up the function
for (let i = 0; i < 10000; i++) {
hotPath(testData);
}
Anti-patterns
Avoid these performance pitfalls:
❌ Synchronous crypto
// Blocks the event loop
const hash = crypto.createHash('sha256')
.update(largeData)
.digest('hex');
// ✅ Use async crypto when available
const hash = await crypto.subtle.digest('SHA-256', largeData);
❌ Large JSON.parse/stringify
// Blocks event loop for large objects
const data = JSON.parse(hugeJsonString);
// ✅ Stream or chunk large data
const stream = response.body;
const reader = stream.getReader();
// Process in chunks
❌ Inefficient regex
// Catastrophic backtracking
const regex = /^(a+)+$/;
// ✅ Use specific, bounded patterns
const regex = /^a{1,100}$/;
❌ Memory leaks
// Global array grows forever
const cache = [];
export default {
fetch(request) {
cache.push(request.url); // ❌ Memory leak
}
};
// ✅ Use bounded cache with eviction
const cache = new LRUCache({ max: 1000 });
Further reading