Reliable, scalable, and maintainable applications

Introduction

Many applications today are data-intensive rather than compute-intensive. The biggest challenges are usually the amount of data, the complexity of data, and the speed at which it’s changing—not the raw CPU power. A data-intensive application is typically built from standard building blocks:

Databases: Store data for later retrieval
Caches: Remember expensive operation results to speed up reads
Search indexes: Allow users to search/filter data by keywords
Stream processing: Send messages to other processes for asynchronous handling
Batch processing: Periodically crunch large amounts of accumulated data

This chapter focuses on three fundamental concerns in software systems:

Reliability: The system continues working correctly even when things go wrong
Scalability: As the system grows, there are reasonable ways of dealing with that growth
Maintainability: Over time, different people will work on the system, and they should be able to work on it productively

1. Reliability

Reliability means the system continues to work correctly, even when things go wrong. The things that can go wrong are called faults, and systems that anticipate faults and can cope with them are called fault-tolerant or resilient.

A fault is not the same as a failure:

Fault: One component deviating from its spec
Failure: The whole system stops providing the required service to the user

The goal is to prevent faults from causing failures.

1.1 Hardware faults

Hardware faults are common in large datacenters:

Hard disks crash
RAM becomes faulty
Power grid has a blackout
Network cables are unplugged

Hardware redundancy examples:

Disks set up in RAID configuration
Dual power supplies
Hot-swappable CPUs
Backup power (batteries and diesel generators)

Modern trend: Moving toward systems that can tolerate the loss of entire machines by using software fault-tolerance techniques in addition to hardware redundancy. Example: AWS spot instances can be terminated at any moment, so applications must be designed to handle sudden loss of machines.

1.2 Software errors

Software errors are systematic and harder to anticipate: Real-world examples:

Linux leap second bug (2012): Many applications hung due to a kernel bug triggered by leap second
Memory leak: Process gradually uses more memory until system runs out
Cascading failure: One server’s slowdown causes others to overload

Mitigation approaches:

# Example: Circuit breaker pattern to prevent cascading failures
import time
from enum import Enum

class CircuitState(Enum):
    CLOSED = "closed"  # Normal operation
    OPEN = "open"      # Failing, reject requests
    HALF_OPEN = "half_open"  # Testing if service recovered

class CircuitBreaker:
    def __init__(self, failure_threshold=5, timeout=60):
        self.failure_threshold = failure_threshold
        self.timeout = timeout
        self.failure_count = 0
        self.last_failure_time = None
        self.state = CircuitState.CLOSED

    def call(self, func, *args, **kwargs):
        if self.state == CircuitState.OPEN:
            if time.time() - self.last_failure_time > self.timeout:
                self.state = CircuitState.HALF_OPEN
            else:
                raise Exception("Circuit breaker is OPEN")

        try:
            result = func(*args, **kwargs)
            self.on_success()
            return result
        except Exception as e:
            self.on_failure()
            raise e

    def on_success(self):
        self.failure_count = 0
        self.state = CircuitState.CLOSED

    def on_failure(self):
        self.failure_count += 1
        self.last_failure_time = time.time()

        if self.failure_count >= self.failure_threshold:
            self.state = CircuitState.OPEN

# Usage
breaker = CircuitBreaker(failure_threshold=3, timeout=30)

def call_external_service():
    # Simulated external API call
    response = requests.get("https://api.example.com/data")
    return response.json()

try:
    data = breaker.call(call_external_service)
except Exception as e:
    print(f"Service unavailable: {e}")

1.3 Human errors

Humans are known to be unreliable. Studies show that configuration errors by operators are a leading cause of outages. Best practices to minimize human errors:

Design systems that minimize opportunities for error
- Well-designed abstractions, APIs, and admin interfaces
- Example: Admin UI that makes it easy to “do the right thing” and discourages “the wrong thing”
Decouple the places where people make mistakes from the places where they can cause failures
- Provide fully featured sandbox environments where people can explore safely
- Example: Separate staging environment identical to production
Test thoroughly at all levels
- Unit tests, integration tests, and manual tests
- Automated testing for corner cases
Allow quick and easy recovery from human errors
- Fast rollback of configuration changes
- Roll out new code gradually
- Provide tools to recompute data
Set up detailed and clear monitoring
- Telemetry: performance metrics and error rates
- Early warning signals

# Example: Safe database operations with confirmations
class SafeDatabase:
    def __init__(self, db_connection):
        self.db = db_connection
        self.staging_mode = False

    def delete_users(self, user_ids, confirm=False):
        """
        Safely delete users with confirmation requirement
        """
        if not confirm:
            raise ValueError(
                f"About to delete {len(user_ids)} users. "
                "Set confirm=True to proceed."
            )

        # Additional safety: require staging mode for bulk operations
        if len(user_ids) > 100 and not self.staging_mode:
            raise ValueError(
                f"Bulk deletion of {len(user_ids)} users not allowed in production. "
                "Use staging environment for large operations."
            )

        # Log the operation before executing
        self.log_operation("DELETE_USERS", user_ids)

        # Perform deletion
        self.db.execute(
            "DELETE FROM users WHERE id IN (%s)" % ','.join(['%s'] * len(user_ids)),
            user_ids
        )

        return f"Deleted {len(user_ids)} users"

    def log_operation(self, operation, details):
        """Log all operations for audit trail"""
        timestamp = datetime.now().isoformat()
        print(f"[{timestamp}] {operation}: {details}")

# Usage - requires explicit confirmation
db = SafeDatabase(connection)
# db.delete_users([1, 2, 3])  # Raises error
db.delete_users([1, 2, 3], confirm=True)  # Works

2. Scalability

Scalability is the term we use to describe a system’s ability to cope with increased load. Even if a system is working reliably today, that doesn’t mean it will necessarily work reliably in the future with 10x or 100x the load.

2.1 Describing load

Load can be described with a few numbers called load parameters. The best choice depends on your system architecture: Example: Twitter’s scaling challenge (circa 2012) Twitter has two main operations:

Post tweet: User publishes a new message (4.6k requests/sec on average, 12k at peak)
Home timeline: User views tweets from people they follow (300k requests/sec)

Approach 1: Insert tweet and query at read time

-- Posting a tweet
INSERT INTO tweets (user_id, content, timestamp)
VALUES (current_user_id, 'Hello world', now());

-- Reading home timeline
SELECT tweets.*, users.*
FROM tweets
JOIN follows ON follows.followee_id = tweets.user_id
WHERE follows.follower_id = current_user_id
ORDER BY tweets.timestamp DESC
LIMIT 100;

Approach 2: Maintain a cache for each user’s home timeline

# Posting a tweet - fan-out on write
def post_tweet(user_id, content):
    # Insert tweet
    tweet_id = db.insert_tweet(user_id, content)

    # Get all followers
    followers = db.get_followers(user_id)

    # Insert into each follower's timeline cache
    for follower_id in followers:
        cache.add_to_timeline(follower_id, tweet_id)

    return tweet_id

# Reading home timeline - simple cache read
def get_home_timeline(user_id):
    return cache.get_timeline(user_id, limit=100)

Challenge: Users with millions of followers create a fan-out load challenge. If a celebrity with 30 million followers posts a tweet, that’s 30 million writes to home timeline caches! Twitter’s solution: Hybrid approach

Most users: fan-out on write (fast reads)
Celebrities: fan-out on read (avoid massive writes)

2.2 Describing performance

Once you’ve described load, you can investigate what happens when load increases:

When load increases and system resources stay the same, how is performance affected?
When load increases, how much do you need to increase resources to keep performance unchanged?

Use percentiles not averages to measure response time.

Example: Monitoring response time percentiles

import numpy as np
from collections import deque
import time

class ResponseTimeMonitor:
    def __init__(self, window_size=1000):
        self.response_times = deque(maxlen=window_size)

    def record(self, response_time_ms):
        self.response_times.append(response_time_ms)

    def get_percentiles(self):
        if not self.response_times:
            return {}

        times = np.array(self.response_times)
        return {
            'p50': np.percentile(times, 50),
            'p95': np.percentile(times, 95),
            'p99': np.percentile(times, 99),
            'p999': np.percentile(times, 99.9),
            'mean': np.mean(times),
            'max': np.max(times)
        }

    def print_stats(self):
        stats = self.get_percentiles()
        print("Response Time Statistics (ms):")
        print(f"  p50  (median): {stats['p50']:.2f}")
        print(f"  p95:           {stats['p95']:.2f}")
        print(f"  p99:           {stats['p99']:.2f}")
        print(f"  p99.9:         {stats['p999']:.2f}")
        print(f"  mean:          {stats['mean']:.2f}")
        print(f"  max:           {stats['max']:.2f}")

# Usage
monitor = ResponseTimeMonitor()

def handle_request():
    start = time.time()
    # Process request...
    time.sleep(np.random.exponential(0.1))  # Simulated work
    duration_ms = (time.time() - start) * 1000
    monitor.record(duration_ms)
    return duration_ms

# Simulate requests
for _ in range(1000):
    handle_request()

monitor.print_stats()

Queueing delays and head-of-line blocking: Even if only a small percentage of backend calls are slow, if a user request requires multiple backend calls, the probability of getting a slow call increases (tail latency amplification).

2.3 Approaches for coping with load

How do we maintain good performance when load parameters increase by an order of magnitude? Scaling strategies:

Vertical scaling (scaling up): Moving to a more powerful machine
- Simpler to implement
- Limited by maximum machine size
- Single point of failure
Horizontal scaling (scaling out): Distributing load across multiple smaller machines
- Can scale indefinitely
- More complex (distributed systems challenges)
- Better fault tolerance

Reality: Most large-scale systems use a pragmatic mixture of both approaches. Example: Designing for scale

# Bad: Shared state makes horizontal scaling difficult
class BadCounter:
    def __init__(self):
        self.count = 0  # Shared state in memory

    def increment(self):
        self.count += 1
        return self.count

# Good: Stateless design for easy horizontal scaling
class GoodCounter:
    def __init__(self, redis_client):
        self.redis = redis_client

    def increment(self, key):
        # State stored in external system
        return self.redis.incr(key)

# Even better: Partitioned for horizontal scaling
class ShardedCounter:
    def __init__(self, redis_clients):
        self.shards = redis_clients

    def increment(self, key):
        shard = hash(key) % len(self.shards)
        return self.shards[shard].incr(key)

    def get_total(self, key):
        total = 0
        for shard in self.shards:
            total += int(shard.get(key) or 0)
        return total

There is no one-size-fits-all scalable architecture. The architecture of systems that operate at large scale is usually highly specific to the application.

Example volume calculations:

def calculate_capacity_requirements():
    """
    Example: Planning capacity for a social media timeline
    """
    # Load parameters
    daily_active_users = 100_000_000
    tweets_per_day_per_user = 0.5
    follows_per_user = 200

    # Derived metrics
    total_tweets_per_day = daily_active_users * tweets_per_day_per_user
    tweets_per_second = total_tweets_per_day / (24 * 3600)

    # Fan-out calculation
    timeline_writes_per_tweet = follows_per_user
    timeline_writes_per_second = tweets_per_second * timeline_writes_per_tweet

    print(f"Load Analysis:")
    print(f"  Daily active users: {daily_active_users:,}")
    print(f"  Tweets per second: {tweets_per_second:,.0f}")
    print(f"  Timeline writes per second: {timeline_writes_per_second:,.0f}")
    print(f"\nCapacity needed:")
    print(f"  Database write capacity: {timeline_writes_per_second:,.0f} writes/sec")

    # Storage calculation
    avg_tweet_size_bytes = 500
    storage_per_day_gb = (total_tweets_per_day * avg_tweet_size_bytes) / (1024**3)
    storage_per_year_tb = storage_per_day_gb * 365 / 1024

    print(f"  Storage per day: {storage_per_day_gb:.2f} GB")
    print(f"  Storage per year: {storage_per_year_tb:.2f} TB")

calculate_capacity_requirements()

3. Maintainability

The majority of the cost of software is in ongoing maintenance:

Fixing bugs
Keeping systems operational
Investigating failures
Adapting to new platforms
Modifying for new use cases
Repaying technical debt
Adding new features

3.1 Operability: Making life easy for operations

Good operability means making routine tasks easy, allowing operations teams to focus on high-value activities. Good operability practices:

# Example: Health check endpoint
from flask import Flask, jsonify
import psutil
import time

app = Flask(__name__)
start_time = time.time()

@app.route('/health')
def health_check():
    """
    Comprehensive health check for monitoring systems
    """
    health_status = {
        'status': 'healthy',
        'timestamp': time.time(),
        'uptime_seconds': time.time() - start_time,
        'checks': {}
    }

    # Check database connectivity
    try:
        db.execute('SELECT 1')
        health_status['checks']['database'] = 'ok'
    except Exception as e:
        health_status['checks']['database'] = f'error: {str(e)}'
        health_status['status'] = 'unhealthy'

    # Check cache connectivity
    try:
        cache.ping()
        health_status['checks']['cache'] = 'ok'
    except Exception as e:
        health_status['checks']['cache'] = f'error: {str(e)}'
        health_status['status'] = 'degraded'

    # Check system resources
    cpu_percent = psutil.cpu_percent()
    memory_percent = psutil.virtual_memory().percent
    disk_percent = psutil.disk_usage('/').percent

    health_status['resources'] = {
        'cpu_percent': cpu_percent,
        'memory_percent': memory_percent,
        'disk_percent': disk_percent
    }

    # Warn if resources are high
    if cpu_percent > 80 or memory_percent > 80 or disk_percent > 80:
        health_status['status'] = 'degraded'

    return jsonify(health_status)

@app.route('/metrics')
def metrics():
    """
    Prometheus-compatible metrics endpoint
    """
    return f"""
# HELP http_requests_total Total HTTP requests
# TYPE http_requests_total counter
http_requests_total{{method="GET",status="200"}} {request_count_200}
http_requests_total{{method="GET",status="500"}} {request_count_500}

# HELP http_request_duration_seconds HTTP request latency
# TYPE http_request_duration_seconds histogram
http_request_duration_seconds_bucket{{le="0.1"}} {bucket_100ms}
http_request_duration_seconds_bucket{{le="0.5"}} {bucket_500ms}
http_request_duration_seconds_bucket{{le="1.0"}} {bucket_1s}
"""

3.2 Simplicity: Managing complexity

As projects grow, they often become very complex and difficult to understand. This complexity slows down everyone who needs to work on the system. Solution: Abstraction - hiding implementation details behind clean, simple-to-understand interfaces. Example: Good abstraction

# Bad: Complexity exposed
def get_user_orders(user_id):
    # Direct database access with complex logic
    connection = psycopg2.connect(
        host="db.example.com",
        database="orders",
        user="app",
        password="secret"
    )
    cursor = connection.cursor()

    # Complex query with joins
    cursor.execute("""
        SELECT o.*, oi.*, p.*
        FROM orders o
        JOIN order_items oi ON o.id = oi.order_id
        JOIN products p ON oi.product_id = p.id
        WHERE o.user_id = %s
        AND o.deleted_at IS NULL
        ORDER BY o.created_at DESC
    """, (user_id,))

    # Manual result processing
    results = cursor.fetchall()
    orders = {}
    for row in results:
        order_id = row[0]
        if order_id not in orders:
            orders[order_id] = {
                'id': row[0],
                'total': row[2],
                'items': []
            }
        orders[order_id]['items'].append({
            'product_id': row[5],
            'quantity': row[6]
        })

    cursor.close()
    connection.close()
    return list(orders.values())


# Good: Clean abstraction
class OrderRepository:
    def __init__(self, db):
        self.db = db

    def get_user_orders(self, user_id):
        """Get all orders for a user with related items"""
        return self.db.query(Order)\
            .filter_by(user_id=user_id, deleted_at=None)\
            .options(joinedload(Order.items))\
            .order_by(Order.created_at.desc())\
            .all()

# Usage
order_repo = OrderRepository(db)
orders = order_repo.get_user_orders(user_id)

3.3 Evolvability: Making change easy

System requirements change constantly:

New facts are learned
Unexpected use cases emerge
Business priorities shift
Users request new features
New platforms emerge
Legal or regulatory requirements change
Growth in scale requires architectural changes

Example: Designing for evolvability with clean interfaces

# Payment processing with strategy pattern for easy evolution

from abc import ABC, abstractmethod

class PaymentProcessor(ABC):
    """Abstract interface for payment processing"""

    @abstractmethod
    def process_payment(self, amount, currency, customer_id):
        pass

    @abstractmethod
    def refund_payment(self, transaction_id, amount):
        pass

class StripeProcessor(PaymentProcessor):
    def __init__(self, api_key):
        self.stripe = stripe
        self.stripe.api_key = api_key

    def process_payment(self, amount, currency, customer_id):
        charge = self.stripe.Charge.create(
            amount=amount,
            currency=currency,
            customer=customer_id
        )
        return charge.id

    def refund_payment(self, transaction_id, amount):
        refund = self.stripe.Refund.create(
            charge=transaction_id,
            amount=amount
        )
        return refund.id

class PayPalProcessor(PaymentProcessor):
    def __init__(self, client_id, secret):
        self.paypal = PayPalClient(client_id, secret)

    def process_payment(self, amount, currency, customer_id):
        # PayPal-specific implementation
        payment = self.paypal.create_payment(
            amount=amount,
            currency=currency,
            customer=customer_id
        )
        return payment.id

    def refund_payment(self, transaction_id, amount):
        # PayPal-specific implementation
        refund = self.paypal.refund(transaction_id, amount)
        return refund.id

# Application code doesn't depend on specific processor
class OrderService:
    def __init__(self, payment_processor: PaymentProcessor):
        self.payment = payment_processor

    def complete_order(self, order_id, amount, currency, customer_id):
        # Easy to switch payment processors without changing this code
        transaction_id = self.payment.process_payment(
            amount, currency, customer_id
        )

        # Update order
        order = Order.query.get(order_id)
        order.transaction_id = transaction_id
        order.status = 'paid'
        order.save()

        return order

# Configuration determines which processor to use
if config.PAYMENT_PROVIDER == 'stripe':
    processor = StripeProcessor(config.STRIPE_API_KEY)
elif config.PAYMENT_PROVIDER == 'paypal':
    processor = PayPalProcessor(config.PAYPAL_CLIENT_ID, config.PAYPAL_SECRET)

order_service = OrderService(processor)

Designing for change: Make it easy to add new payment providers without modifying existing code (Open/Closed Principle).

Summary

This chapter explored three fundamental concerns in software systems:

Reliability

Key takeaways:

Anticipate and handle faults to prevent failures
Use redundancy for hardware faults
Minimize opportunities for human error
Test thoroughly and monitor continuously

Scalability

Key takeaways:

Describe load with specific parameters relevant to your system
Measure performance with percentiles, not just averages
No universal scalability solution—design for your specific needs
Plan for growth: vertical scaling (up) or horizontal scaling (out)

Maintainability

Key takeaways:

Operability: Make it easy to keep the system running smoothly
Simplicity: Manage complexity with good abstractions
Evolvability: Design for change from the start

These three concerns (reliability, scalability, maintainability) are the foundation for building data-intensive applications that can grow and evolve successfully over time.

Get Started

Part I: Foundations of Data Systems

Part II: Distributed Data

Part III: Derived Data

Introduction

1. Reliability

1.1 Hardware faults

1.2 Software errors

1.3 Human errors

2. Scalability

2.1 Describing load

2.2 Describing performance

2.3 Approaches for coping with load

3. Maintainability

3.1 Operability: Making life easy for operations

3.2 Simplicity: Managing complexity

3.3 Evolvability: Making change easy

Summary

Reliability

Scalability

Maintainability

Build docs developers (and LLMs) love

Get Started

Part I: Foundations of Data Systems

Part II: Distributed Data

Part III: Derived Data

​Introduction

​1. Reliability

​1.1 Hardware faults

​1.2 Software errors

​1.3 Human errors

​2. Scalability

​2.1 Describing load

​2.2 Describing performance

​2.3 Approaches for coping with load

​3. Maintainability

​3.1 Operability: Making life easy for operations

​3.2 Simplicity: Managing complexity

​3.3 Evolvability: Making change easy

​Summary

​Reliability

​Scalability

​Maintainability

Build docs developers (and LLMs) love

Introduction

1. Reliability

1.1 Hardware faults

1.2 Software errors

1.3 Human errors

2. Scalability

2.1 Describing load

2.2 Describing performance

2.3 Approaches for coping with load

3. Maintainability

3.1 Operability: Making life easy for operations

3.2 Simplicity: Managing complexity

3.3 Evolvability: Making change easy

Summary

Reliability

Scalability

Maintainability