Building Resilient Systems - System Design 101

Introduction

Have you noticed that the largest incidents are usually caused by something very small? A minor error starts the snowball effect that keeps building up. Suddenly, everything is down. Building resilient systems requires understanding failure modes and implementing patterns that minimize the impact of failures. This guide covers essential strategies for creating robust, fault-tolerant systems. Resiliency Patterns

8 Cloud Design Patterns for Resilience

Here are 8 cloud design patterns to reduce the damage done by failures:

Timeout
Retry
Circuit Breaker
Rate Limiting
Load Shedding
Bulkhead
Back Pressure
Let it Crash

These patterns are usually not used alone. To apply them effectively, we need to understand why we need them, how they work, and their limitations.

1. Timeout Pattern

Set maximum time limits for operations to prevent indefinite waiting. Purpose:

Prevent resource exhaustion
Fail fast instead of hanging
Free up resources for other requests
Improve user experience

Implementation:

Connection Timeout
Request Timeout
Idle Timeout

Time to establish a connection to the server.Typical values: 5-10 seconds

Best Practices:

Set aggressive timeouts for non-critical operations
Use longer timeouts for critical operations
Make timeouts configurable
Log timeout events for monitoring

Shopify’s experience: Lower timeouts help services fail early. They use read timeout of 5 seconds and write timeout of 1 second.

2. Retry Pattern

Automatically retry failed operations with intelligent backoff strategies.

Retry Strategies

Linear Backoff

Wait for progressively increasing fixed interval between retries.Example: 1s, 2s, 3s, 4s, 5sPros: Simple to implement and understandCons: May cause retry storms under high load

Linear Jitter Backoff

Linear backoff with added randomness.Example: 1s±0.5s, 2s±1s, 3s±1.5sPros: Reduces synchronized retriesCons: Still somewhat predictable

Exponential Backoff

Delay increases exponentially between retries.Example: 1s, 2s, 4s, 8s, 16sPros: Significantly reduces system loadCons: May delay resolution for transient issues

Exponential Jitter Backoff

Exponential backoff with randomness (recommended).Example: 1s±0.5s, 2s±1s, 4s±2s, 8s±4sPros: Best balance, reduces collisions significantlyCons: Randomness can cause longer delays

Retry Best Practices

Idempotency

Ensure operations can be safely retried without side effects.

Max Attempts

Set a maximum number of retry attempts (typically 3-5).

Retry Conditions

Only retry transient failures (network errors, timeouts, 5xx errors).

Circuit Breaking

Combine with circuit breaker to stop retrying when service is down.

Errors to Retry:

Network timeouts
Connection failures
HTTP 429 (Too Many Requests)
HTTP 503 (Service Unavailable)
Temporary database unavailability

Errors NOT to Retry:

HTTP 4xx (except 429)
Authentication failures
Validation errors
Resource not found

3. Circuit Breaker Pattern

Prevent cascading failures by stopping requests to failing services. Covered in detail in the Distributed Patterns Guide Shopify Resilience Principles

Shopify’s Approach

Shopify developed Semian to protect services with circuit breakers:

Net::HTTP protection
MySQL protection
Redis protection
gRPC services protection

Key Configuration:

Fast failure with low timeouts
Circuit breaker thresholds tuned per service
Automatic recovery testing

4. Rate Limiting Pattern

Control the rate of requests to protect system resources. Algorithms:

Token Bucket
Leaky Bucket
Fixed Window
Sliding Window

Tokens added at fixed rate, requests consume tokens.Best for: Smooth traffic with occasional bursts

Rate Limiting Strategies:

Per User: Limit requests per user/API key
Per IP: Limit requests per IP address
Global: System-wide rate limits
Endpoint Specific: Different limits per endpoint

Response Handling:

HTTP 429 Too Many Requests
Retry-After: 60
X-RateLimit-Limit: 1000
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1640000000

5. Load Shedding Pattern

Drop low-priority requests when system is overloaded to maintain service for high-priority requests. Priority Levels:

Critical

Core business operations, never shedExamples: Payment processing, order creation

High

Important features, shed only under extreme loadExamples: User authentication, checkout

Medium

Standard features, can be shed under moderate loadExamples: Search, recommendations

Low

Nice-to-have features, shed firstExamples: Analytics, thumbnails, tracking

Implementation Strategies:

Queue-based: Drop from queue tail when full
Probabilistic: Randomly drop based on load level
Admission Control: Reject at entry point
Graceful Degradation: Reduce quality/features

6. Bulkhead Pattern

Isolate system resources to prevent cascading failures. Concept: Like compartments in a ship’s hull, failures in one area don’t sink the entire system. Implementation:

Thread Pool Isolation

Separate thread pools for different operations.One slow operation doesn’t block all threads.

Connection Pool Isolation

Separate connection pools per client/tenant.One heavy user doesn’t exhaust all connections.

Process Isolation

Run services in separate processes.Process crash doesn’t affect others.

Resource Limits

Set CPU, memory, disk limits per component.Prevent resource monopolization.

Example Configuration:

Services:
  critical-service:
    thread_pool_size: 100
    connection_pool_size: 50
    cpu_limit: 2000m
    memory_limit: 2Gi
  
  non-critical-service:
    thread_pool_size: 20
    connection_pool_size: 10
    cpu_limit: 500m
    memory_limit: 512Mi

7. Back Pressure Pattern

Signal to upstream services when downstream is overloaded. Mechanisms:

Bounded Queues

Use fixed-size queues that block or reject when full.Provides natural back pressure to producers.

Flow Control

Explicit signals to slow down/pause sending.Common in streaming protocols (gRPC, HTTP/2).

Reactive Streams

Publisher-subscriber with demand signaling.Subscriber controls how much data it can handle.

Load Feedback

Share load metrics with clients.Clients adjust request rate based on feedback.

Benefits:

Prevents memory exhaustion
Maintains system stability
Better resource utilization
Coordinated flow control

8. Let it Crash Pattern

Allow processes to fail and restart rather than trying to handle every error. Philosophy:

Design for failure and recovery
Supervision trees restart failed processes
Clean state after restart
Simple error handling

Requirements:

Fast startup time
Stateless design or external state storage
Health checks and monitoring
Supervision and orchestration (Kubernetes, Nomad)

Erlang/Elixir Example: This pattern originated from Erlang’s “Let it crash” philosophy:

Processes are lightweight
Supervisors monitor and restart
System remains available
Proven in telecom systems (99.9999999% availability)

High Availability Architecture

What is High Availability?

In system design, availability means all (non-failing) nodes are available for queries. When you send requests to nodes, a non-failing node returns a reasonable response within a reasonable time (no error or timeout). Target: For 4-9’s (99.99%), services can only be down for 52.5 minutes per year.

Availability guarantees you will receive a response, but doesn’t guarantee the data is the most up-to-date.

High Availability Architectures

Primary-Backup
Primary-Secondary
Primary-Primary

Setup:

Backup node is stand-by
Data replicated from primary to backup
Manual failover when primary fails

Pros: Simple, data integrityCons: Backup resources underutilized, manual failoverAvailability: 90% → 99% (if deployed on EC2)

Multi-Region Deployment

Benefits:

Lower latency for global users
Disaster recovery
Regulatory compliance
Higher availability

Challenges:

Data consistency across regions
Increased complexity
Higher costs
Latency between regions

Disaster Recovery Strategies

An effective Disaster Recovery (DR) plan is not just a precaution; it’s a necessity.

Key Metrics

RTO

Recovery Time ObjectiveMaximum acceptable time application can be offline after a disaster.

RPO

Recovery Point ObjectiveMaximum acceptable amount of data loss measured in time.

DR Strategies

Backup and Restore

Regular backups to facilitate post-disaster recovery.RTO: Several hours to daysRPO: Hours to last backupCost: LowestUse case: Non-critical systems, archival

Pilot Light

Critical components in ready-to-activate mode.RTO: Minutes to several hoursRPO: Depends on data sync frequencyCost: Low to mediumUse case: Important but not critical systems

Warm Standby

Semi-active environment with current data.RTO: Minutes to hoursRPO: Last few minutes to hoursCost: Medium to highUse case: Business-critical applications

Hot Site / Multi-Site

Fully operational duplicate environment running in parallel.RTO: Almost immediate (minutes)RPO: Minimal (seconds)Cost: HighestUse case: Mission-critical systems (banking, healthcare)

10 Principles from Shopify

Shopify has valuable tips for building resilient payment systems that apply broadly:

Lower the timeouts

Let services fail early. Use read timeout of 5s and write timeout of 1s.

Install circuit breakers

Use tools like Semian to protect services (Net::HTTP, MySQL, Redis, gRPC).

Capacity management

Understand throughput: 50 requests in queue, 100ms processing = 500 req/sec.

Add monitoring and alerting

Monitor four golden signals: latency, traffic, errors, and saturation.

Implement structured logging

Store logs centrally and make them easily searchable.

Use idempotency keys

Use ULID (Universally Unique Lexicographically Sortable Identifier) for idempotency.

Be consistent with reconciliation

Store reconciliation breaks with financial partners in database.

Incorporate load testing

Regularly simulate high-volume scenarios to get benchmark results.

Get on top of incident management

Define clear roles: Incident Manager, Support Response Manager, Service Owners.

Organize incident retrospectives

Ask three questions: What happened? What assumptions were wrong? How to prevent?

Observability and Monitoring

Four Golden Signals (Google SRE)

Latency

Time to service a request.Track p50, p95, p99 percentiles.

Traffic

Demand on your system.Requests per second, transactions per second.

Errors

Rate of failed requests.Track error types and frequencies.

Saturation

How “full” your service is.CPU, memory, disk, network utilization.

Distributed Tracing

Components:

Trace: End-to-end request flow
Span: Individual operation
Tags: Metadata for filtering
Logs: Time-stamped events

Popular Tools:

Jaeger
Zipkin
AWS X-Ray
Google Cloud Trace
Datadog APM

Structured Logging

Best Practices:

Use JSON format
Include correlation IDs
Add contextual information
Centralize logs (ELK, Splunk, CloudWatch)
Set appropriate log levels

Chaos Engineering

Proactively test system resilience by introducing failures. Principles:

Build a hypothesis around steady state behavior
Vary real-world events (server failures, network issues)
Run experiments in production (carefully)
Automate experiments to run continuously
Minimize blast radius to limit potential impact

Common Experiments:

Kill random instances
Introduce network latency
Simulate network partitions
Exhaust resources (CPU, memory, disk)
Inject errors in dependencies

Tools:

Chaos Monkey (Netflix)
Gremlin
Chaos Toolkit
LitmusChaos
AWS Fault Injection Simulator

Testing for Resilience

Load Testing

Test system behavior under expected load.Tools: JMeter, Gatling, k6

Stress Testing

Push system beyond normal capacity.Identify breaking points.

Soak Testing

Run at normal load for extended period.Detect memory leaks, resource exhaustion.

Spike Testing

Sudden increase in load.Test auto-scaling, surge capacity.

Best Practices Summary

Design for Failure: Assume everything will fail eventually. Build systems that can handle partial failures gracefully.

Key Principles:

✅ Implement multiple layers of resilience
✅ Use timeouts and circuit breakers
✅ Design for graceful degradation
✅ Implement comprehensive monitoring
✅ Practice incident response
✅ Use load testing and chaos engineering
✅ Document failure modes and runbooks
✅ Implement automated recovery
✅ Use bulkheads to isolate failures
✅ Regular disaster recovery drills

Resilience is not a feature you add at the end. It must be designed into the system from the beginning.

AWS Services

Learn about AWS services for resilient systems

Scalability

Scale systems while maintaining resilience

Distributed Patterns

Patterns for building distributed systems

APIs & Web

Databases

Caching & Performance

Security

Cloud & Distributed Systems

DevOps & CI/CD

​Introduction

​8 Cloud Design Patterns for Resilience

​1. Timeout Pattern

​2. Retry Pattern

​Retry Strategies

​Retry Best Practices

Idempotency

Max Attempts

Retry Conditions

Circuit Breaking

​3. Circuit Breaker Pattern

​Shopify’s Approach

​4. Rate Limiting Pattern

​5. Load Shedding Pattern

​6. Bulkhead Pattern

Thread Pool Isolation

Connection Pool Isolation

Process Isolation

Resource Limits

​7. Back Pressure Pattern

​8. Let it Crash Pattern

​High Availability Architecture

​What is High Availability?

​High Availability Architectures

​Multi-Region Deployment

​Disaster Recovery Strategies

​Key Metrics

RTO

RPO

​DR Strategies

​10 Principles from Shopify

​Observability and Monitoring

​Four Golden Signals (Google SRE)

Latency

Traffic

Errors

Saturation

​Distributed Tracing

​Structured Logging

​Chaos Engineering

​Testing for Resilience

Load Testing

Stress Testing

Soak Testing

Spike Testing

​Best Practices Summary

​Related Topics

AWS Services

Scalability

Distributed Patterns

Build docs developers (and LLMs) love

Introduction

8 Cloud Design Patterns for Resilience

1. Timeout Pattern

2. Retry Pattern

Retry Strategies

Retry Best Practices

3. Circuit Breaker Pattern

Shopify’s Approach

4. Rate Limiting Pattern

5. Load Shedding Pattern

6. Bulkhead Pattern

7. Back Pressure Pattern

8. Let it Crash Pattern

High Availability Architecture

What is High Availability?

High Availability Architectures

Multi-Region Deployment

Disaster Recovery Strategies

Key Metrics

DR Strategies

10 Principles from Shopify

Observability and Monitoring

Four Golden Signals (Google SRE)

Distributed Tracing

Structured Logging

Chaos Engineering

Testing for Resilience

Best Practices Summary

Related Topics