Skip to main content

Availability Fundamentals

Availability is the percentage of time a system is operational. Each additional “nine” reduces allowed downtime by approximately 10x.

SLA Nines

AvailabilityDowntime per YearDowntime per MonthDowntime per Week
90% (one 9)36.5 days3 days16.8 hours
99% (two 9s)3.65 days7.2 hours1.68 hours
99.9% (three 9s)8.76 hours43.8 minutes10.1 minutes
99.99% (four 9s)52.6 minutes4.38 minutes1.01 minutes
99.999% (five 9s)5.26 minutes26.3 seconds6.05 seconds
Five 9s (99.999%) requires zero-downtime deployments, automated failover, multi-region redundancy, and comprehensive chaos engineering. This is extremely expensive and only justified for critical financial or life-safety systems.

Serial vs Parallel Availability

Serial dependencies (all must be up):
// End-to-end availability (serial)
Load Balancer:  99.99%
App Server:     99.9%
PostgreSQL:     99.9%
External API:   99.5%

// Result: 0.9999 × 0.999 × 0.999 × 0.995
= 0.9929 = 99.29% → fails even three 9s!

// External API is the availability ceiling
Parallel redundancy (both must fail):
// Parallel availability for two nodes
Single node:     99.9%
Two parallel:    1 - (1 - 0.999)² = 1 - 0.000001 = 99.9999% (five 9s!)

// Three parallel nodes:
1 - (1 - 0.999)³ = 99.999999% (eight 9s!)
Parallel redundancy dramatically improves availability. Two independent 99.9% nodes in parallel achieve 99.9999% availability — but only if failures are truly independent.

Replication Strategies

Master-Slave Replication

All writes to primary, reads from replicas:
# PostgreSQL streaming replication
Primary (postgresql.conf):
  wal_level: replica
  max_wal_senders: 3
  wal_keep_size: 1GB
  
Replica (standby.signal + postgresql.auto.conf):
  primary_conninfo: 'host=primary port=5432 user=replicator'
  hot_standby: on
  max_standby_streaming_delay: 30s
Architecture:
         ┌──────────┐
         │ Primary  │
         │  (Write) │
         └────┬─────┘
              │ WAL stream
      ┌───────┼───────┐
      ▼       ▼       ▼
  ┌───────┬───────┬───────┐
  │Replica│Replica│Replica│
  │(Read) │(Read) │(Read) │
  └───────┴───────┴───────┘
Key concepts:
  • Synchronous replication: Primary waits for replica ACK (slower writes, zero data loss)
  • Asynchronous replication: Primary doesn’t wait (faster writes, possible data loss)
  • Replication lag: Time delay between primary write and replica visibility
-- PostgreSQL: check replication lag on replica
SELECT 
  now() - pg_last_xact_replay_timestamp() AS lag_duration,
  pg_wal_lsn_diff(pg_last_wal_receive_lsn(), pg_last_wal_replay_lsn()) AS lag_bytes
FROM pg_stat_replication;

-- Alert thresholds
-- Warning: lag > 5 seconds
-- Critical: lag > 30 seconds

Master-Master Replication

Multiple writable nodes with conflict resolution:
    ┌─────────┐      ┌─────────┐
    │Master A │◄────►│Master B │
    │us-east │      │eu-west  │
    └─────────┘      └─────────┘
         ▲                ▲
         │                │
    US clients      EU clients
Conflict resolution strategies:
-- Timestamp-based conflict resolution
CREATE TABLE users (
  id UUID PRIMARY KEY,
  name TEXT,
  email TEXT,
  updated_at TIMESTAMP DEFAULT now()
);

-- Conflict: both masters update same row
-- Master A: UPDATE users SET name='Alice' WHERE id=1; (at T+0s)
-- Master B: UPDATE users SET name='Bob' WHERE id=1;   (at T+2s)

-- Result: 'Bob' wins (later timestamp)
Pros: Simple, deterministicCons: Data loss (earlier write discarded)
Use master-slave replication by default. Master-master adds significant complexity and should only be used for active-active multi-region deployments with geographic write distribution.

Failover Patterns

Active-Passive (Hot Standby)

Primary serves all traffic; standby takes over on failure:
// Normal operation
     ┌─────────┐
     │   LB    │
     └────┬────┘

      ┌───▼────┐       ┌─────────┐
      │Primary │──────►│ Standby │
      │(ACTIVE)│ sync  │(PASSIVE)│
      └────────┘       └─────────┘

// After failover
     ┌─────────┐
     │   LB    │
     └────┬────┘

      ┌───┴────┐       ┌─────────┐
      │Primary │       │ Standby │
      │(FAILED)│       │(ACTIVE) │
      └────────┘       └────┬────┘

                       promoted
Failover timeline:
T+0s:  Primary node fails (process crash, hardware failure)
T+5s:  Heartbeat timeout detected by orchestrator
T+8s:  Replica promoted to primary (pg_ctl promote)
T+10s: Load balancer updated to new primary IP
T+70s: All clients with TTL=60s see new primary

// Total outage: ~70 seconds (cannot meet 99.99% SLA)
Active-passive failover typically takes 60-120 seconds due to health check timeout + promotion + DNS propagation. This is insufficient for five 9s (99.999% = 5.26 min/year downtime).
Implementation: PostgreSQL with Patroni
# Patroni configuration (HA PostgreSQL)
scope: postgres-cluster
namespace: /service/
name: node1

bootstrap:
  dcs:
    ttl: 30
    loop_wait: 10
    retry_timeout: 10
    maximum_lag_on_failover: 1048576  # 1MB

postgresql:
  data_dir: /var/lib/postgresql/14/main
  bin_dir: /usr/lib/postgresql/14/bin
  parameters:
    wal_level: replica
    hot_standby: on
    max_wal_senders: 5
    max_replication_slots: 5
    
  # Automatic failover
  use_pg_rewind: true
  remove_data_directory_on_rewind_failure: true

Active-Active (Multi-Master)

All nodes serve traffic simultaneously:
        ┌─────────┐
        │   LB    │
        └────┬────┘

      ┌──────┼──────┐
      ▼      ▼      ▼
  ┌───────┬───────┬───────┐
  │Node A │Node B │Node C │
  │(ACTIVE│(ACTIVE│(ACTIVE│
  └───┬───┴───┬───┴───┬───┘
      └───────┼───────┘
         replication
Benefits:
  • Utilizes all capacity (no idle standby)
  • Instant failover (LB health check removal)
  • Higher aggregate throughput
Trade-offs:
  • Write conflicts require resolution
  • Eventual consistency
  • More complex application logic
Active-active is ideal for read-heavy workloads with infrequent writes. For write-heavy workloads, use active-passive to avoid write conflicts.

Reliability Patterns

Circuit Breaker

Prevent cascading failures by failing fast when error rate exceeds threshold:
// State machine
CLOSED: calls pass → error rate >50% in 30s → OPEN

OPEN: fail-fast → after 60s timeout → HALF-OPEN

HALF-OPEN: 1 probe call
    success → CLOSED
    fail    → OPEN (reset 60s)
Implementation:
class CircuitBreaker {
  constructor(options = {}) {
    this.failureThreshold = options.failureThreshold || 5;
    this.successThreshold = options.successThreshold || 2;
    this.timeout = options.timeout || 60000;  // 60s
    
    this.state = 'CLOSED';
    this.failures = 0;
    this.successes = 0;
    this.nextAttempt = Date.now();
  }
  
  async call(fn) {
    if (this.state === 'OPEN') {
      if (Date.now() < this.nextAttempt) {
        throw new Error('Circuit breaker is OPEN');
      }
      this.state = 'HALF-OPEN';
    }
    
    try {
      const result = await fn();
      this.onSuccess();
      return result;
    } catch (err) {
      this.onFailure();
      throw err;
    }
  }
  
  onSuccess() {
    this.failures = 0;
    
    if (this.state === 'HALF-OPEN') {
      this.successes++;
      if (this.successes >= this.successThreshold) {
        this.state = 'CLOSED';
        this.successes = 0;
      }
    }
  }
  
  onFailure() {
    this.failures++;
    this.successes = 0;
    
    if (this.failures >= this.failureThreshold) {
      this.state = 'OPEN';
      this.nextAttempt = Date.now() + this.timeout;
    }
  }
}

// Usage
const breaker = new CircuitBreaker({ failureThreshold: 5, timeout: 60000 });

try {
  const data = await breaker.call(() => externalAPI.get('/data'));
} catch (err) {
  // Circuit open: return cached data or default
  return cachedData;
}
Pair circuit breakers with fallback responses — when the circuit opens, return cached values, empty results, or feature degradation instead of propagating errors.

Bulkhead Pattern

Isolate resources to prevent one failing component from exhausting shared resources:
// Thread pool bulkhead (Java example)
class BulkheadExecutor {
  constructor() {
    // Separate thread pools per dependency
    this.dbPool = new ThreadPoolExecutor(10, 20, 60);
    this.apiPool = new ThreadPoolExecutor(5, 10, 60);
    this.searchPool = new ThreadPoolExecutor(3, 5, 60);
  }
  
  // Slow external API can't exhaust DB pool
  async queryDatabase(sql) {
    return this.dbPool.submit(() => db.query(sql));
  }
  
  async callExternalAPI(url) {
    return this.apiPool.submit(() => http.get(url));
  }
}
// Without bulkhead: shared pool exhausted
┌─────────────────────┐
│  Thread Pool (50)   │
│  ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓  │  ← all threads blocked on slow API
└─────────────────────┘

// With bulkhead: isolated pools
┌──────────┬──────────┬──────────┐
│ DB (20)  │ API (10) │Search (5)│
│  ░░░     │  ▓▓▓▓▓▓▓ │  ░       │  ← API slow, but DB unaffected
└──────────┴──────────┴──────────┘

Retry with Exponential Backoff

Retry transient failures with increasing delay:
async function retryWithBackoff(fn, maxRetries = 5) {
  for (let attempt = 0; attempt < maxRetries; attempt++) {
    try {
      return await fn();
    } catch (err) {
      if (attempt === maxRetries - 1) throw err;
      
      // Exponential backoff: 2^attempt × 100ms
      const baseDelay = Math.pow(2, attempt) * 100;
      
      // Jitter: random ±50ms to prevent thundering herd
      const jitter = Math.random() * 100 - 50;
      
      const delay = Math.min(baseDelay + jitter, 30000);  // cap at 30s
      
      console.log(`Retry ${attempt + 1}/${maxRetries} after ${delay}ms`);
      await sleep(delay);
    }
  }
}

// Retry schedule with jitter
// Attempt 1: ~100ms  (2^0 × 100)
// Attempt 2: ~200ms  (2^1 × 100)
// Attempt 3: ~400ms  (2^2 × 100)
// Attempt 4: ~800ms  (2^3 × 100)
// Attempt 5: ~1600ms (2^4 × 100)
Without jitter, all clients retry simultaneously after failure, creating a “thundering herd” that overwhelms the recovering service. Always add randomness to retry delays.

Health Checks

Question: Is the process alive?Action: Restart if fails
// Minimal liveness check
app.get('/health', (req, res) => {
  res.status(200).json({ status: 'ok' });
});
Liveness should only verify the process is responsive. Do NOT check external dependencies — that causes restart loops when dependencies are temporarily unavailable.

Background Jobs

Event-Driven Jobs

// SQS queue consumer
const { SQSClient, ReceiveMessageCommand, DeleteMessageCommand } = require('@aws-sdk/client-sqs');

const sqs = new SQSClient({ region: 'us-east-1' });

async function processQueue() {
  while (true) {
    const { Messages } = await sqs.send(new ReceiveMessageCommand({
      QueueUrl: 'https://sqs.us-east-1.amazonaws.com/123/orders',
      MaxNumberOfMessages: 10,
      WaitTimeSeconds: 20  // long polling
    }));
    
    if (!Messages) continue;
    
    for (const msg of Messages) {
      try {
        await processOrder(JSON.parse(msg.Body));
        
        // Delete after successful processing
        await sqs.send(new DeleteMessageCommand({
          QueueUrl: 'https://sqs.us-east-1.amazonaws.com/123/orders',
          ReceiptHandle: msg.ReceiptHandle
        }));
      } catch (err) {
        console.error('Processing failed:', err);
        // Message returns to queue after visibility timeout
      }
    }
  }
}

Schedule-Driven Jobs (Cron)

# Kubernetes CronJob
apiVersion: batch/v1
kind: CronJob
metadata:
  name: daily-report
spec:
  schedule: "0 2 * * *"  # 2am daily
  concurrencyPolicy: Forbid  # prevent overlapping runs
  successfulJobsHistoryLimit: 3
  failedJobsHistoryLimit: 3
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: report-generator
            image: reports:v1.0
            command: ["python", "generate_daily_report.py"]
          restartPolicy: OnFailure
          activeDeadlineSeconds: 3600  # 1 hour timeout
Set concurrencyPolicy: Forbid on all CronJobs to prevent overlapping runs. A slow job that overlaps the next schedule creates a “cron storm” that exhausts resources.

Best Practices

Map your entire dependency chain and multiply availability percentages. An external API with 99.5% SLA prevents you from achieving better than 99.5% end-to-end.
Failover that has never been exercised in production will fail when it matters most. Chaos engineering (Chaos Monkey) ensures graceful degradation.
Rising replication lag signals that replicas cannot keep up with write load. This causes stale reads and eventually failover problems.
If your payment provider has 99.5% SLA, your checkout flow cannot exceed 99.5% availability regardless of your infrastructure.
Many users share one NAT IP, creating severe load imbalance. Use consistent hashing or eliminate sticky sessions instead.

Next Steps

Load Balancing

Traffic distribution and health checking strategies

Scalability

Horizontal scaling and autoscaling patterns

Databases

Replication, connection pooling, and query optimization

Caching

Design cache strategies for graceful degradation

Build docs developers (and LLMs) love