Five 9s (99.999%) requires zero-downtime deployments, automated failover, multi-region redundancy, and comprehensive chaos engineering. This is extremely expensive and only justified for critical financial or life-safety systems.
Parallel redundancy dramatically improves availability. Two independent 99.9% nodes in parallel achieve 99.9999% availability — but only if failures are truly independent.
Synchronous replication: Primary waits for replica ACK (slower writes, zero data loss)
Asynchronous replication: Primary doesn’t wait (faster writes, possible data loss)
Replication lag: Time delay between primary write and replica visibility
Monitoring Lag
Read Routing Strategy
Synchronous Replication
-- PostgreSQL: check replication lag on replicaSELECT now() - pg_last_xact_replay_timestamp() AS lag_duration, pg_wal_lsn_diff(pg_last_wal_receive_lsn(), pg_last_wal_replay_lsn()) AS lag_bytesFROM pg_stat_replication;-- Alert thresholds-- Warning: lag > 5 seconds-- Critical: lag > 30 seconds
// Application-level read routingclass DatabasePool { async read(query) { // Random replica for load distribution const replica = this.replicas[Math.floor(Math.random() * this.replicas.length)]; return replica.query(query); } async write(query) { // Always primary for writes return this.primary.query(query); } async readAfterWrite(query) { // Read-your-own-writes: route to primary return this.primary.query(query); }}// Usageawait db.write('INSERT INTO users ...');await db.readAfterWrite('SELECT * FROM users WHERE id = ?'); // → primary
-- PostgreSQL: configure synchronous replicationALTER SYSTEM SET synchronous_commit = 'on';ALTER SYSTEM SET synchronous_standby_names = 'FIRST 1 (replica1, replica2)';SELECT pg_reload_conf();-- Primary waits for at least 1 replica ACK before commit-- Trade-off: slower writes, zero data loss
Synchronous replication adds latency (typically 1-5ms) to every write. Use only for critical data (financial, inventory) where data loss is unacceptable.
Use master-slave replication by default. Master-master adds significant complexity and should only be used for active-active multi-region deployments with geographic write distribution.
T+0s: Primary node fails (process crash, hardware failure)T+5s: Heartbeat timeout detected by orchestratorT+8s: Replica promoted to primary (pg_ctl promote)T+10s: Load balancer updated to new primary IPT+70s: All clients with TTL=60s see new primary// Total outage: ~70 seconds (cannot meet 99.99% SLA)
Active-passive failover typically takes 60-120 seconds due to health check timeout + promotion + DNS propagation. This is insufficient for five 9s (99.999% = 5.26 min/year downtime).
Pair circuit breakers with fallback responses — when the circuit opens, return cached values, empty results, or feature degradation instead of propagating errors.
Isolate resources to prevent one failing component from exhausting shared resources:
// Thread pool bulkhead (Java example)class BulkheadExecutor { constructor() { // Separate thread pools per dependency this.dbPool = new ThreadPoolExecutor(10, 20, 60); this.apiPool = new ThreadPoolExecutor(5, 10, 60); this.searchPool = new ThreadPoolExecutor(3, 5, 60); } // Slow external API can't exhaust DB pool async queryDatabase(sql) { return this.dbPool.submit(() => db.query(sql)); } async callExternalAPI(url) { return this.apiPool.submit(() => http.get(url)); }}
// Without bulkhead: shared pool exhausted┌─────────────────────┐│ Thread Pool (50) ││ ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓ │ ← all threads blocked on slow API└─────────────────────┘// With bulkhead: isolated pools┌──────────┬──────────┬──────────┐│ DB (20) │ API (10) │Search (5)││ ░░░ │ ▓▓▓▓▓▓▓ │ ░ │ ← API slow, but DB unaffected└──────────┴──────────┴──────────┘
Without jitter, all clients retry simultaneously after failure, creating a “thundering herd” that overwhelms the recovering service. Always add randomness to retry delays.
Liveness should only verify the process is responsive. Do NOT check external dependencies — that causes restart loops when dependencies are temporarily unavailable.
Question: Can it serve traffic?Action: Remove from load balancer if fails
Set concurrencyPolicy: Forbid on all CronJobs to prevent overlapping runs. A slow job that overlaps the next schedule creates a “cron storm” that exhausts resources.
Do: Calculate end-to-end availability across all dependencies
Map your entire dependency chain and multiply availability percentages. An external API with 99.5% SLA prevents you from achieving better than 99.5% end-to-end.
Do: Test failover monthly under production-equivalent load
Failover that has never been exercised in production will fail when it matters most. Chaos engineering (Chaos Monkey) ensures graceful degradation.
Do: Monitor replication lag and alert at 5s / 30s / 60s thresholds
Rising replication lag signals that replicas cannot keep up with write load. This causes stale reads and eventually failover problems.
Don't: Promise higher availability than your weakest dependency
If your payment provider has 99.5% SLA, your checkout flow cannot exceed 99.5% availability regardless of your infrastructure.
Don't: Use IP hash load balancing with NAT or mobile clients
Many users share one NAT IP, creating severe load imbalance. Use consistent hashing or eliminate sticky sessions instead.