Skip to main content

What is System Design?

System design defines the architecture, components, data flows, and interfaces of a system to satisfy requirements at scale — balancing performance, reliability, and cost trade-offs.
System design bridges requirements and implementation at the architectural level. It demands reasoning simultaneously about performance (latency, throughput), scalability (handling growth), reliability (fault tolerance), and cost. There is rarely one correct answer — every design involves explicit trade-offs with stated justifications.

Interview Framework

When approaching system design problems, follow this structured process:
// System Design Interview Framework
1. Clarify: functional + non-functional requirements
2. Estimate: scale, storage, bandwidth, RPS
3. Define: API contracts, data model, core flows
4. Design: high-level components → deep-dive bottleneck
5. Trade-offs: state choices and their consequences

// Estimation: 10M users × 5 writes/day
50M writes/day ÷ 86400 ≈ 578 WPS avg → ~1750 WPS peak

Key Principles

  • Clarify requirements first: functional (features) and non-functional (SLAs, scale)
  • Estimate scale: users, RPS, data volume, storage, bandwidth — numbers reveal bottlenecks
  • Back-of-envelope: 100M DAU × 10 req/day ≈ 12k RPS peak (with 3x buffer)
  • Identify the primary bottleneck: CPU, memory, disk I/O, network, or DB connections
  • Iterate architecture: start simple (monolith + single DB), add complexity only where proven necessary
  • CAP theorem, consistency models, and availability patterns constrain every design decision
Spend the first 5 minutes in a design interview clarifying requirements before drawing a single box — the best designs solve the stated problem, not an assumed one.

Performance vs Scalability

Performance is how fast a single request is served. Scalability is maintaining performance as concurrent load increases. The two require different strategies and diagnostic tools.

Understanding the Difference

You cannot always scale away a performance problem, and fast single-request performance does not guarantee scale.
  • Performance metric: P50/P99 latency for a single request
  • Scalability metric: RPS at which latency SLA breaks
  • Horizontal scaling (add nodes) works for stateless services — state is the barrier
  • Vertical scaling (bigger machine) has a hardware ceiling and a cost cliff
  • Connection pool exhaustion is the most common scalability wall in web backends

Diagnostic Example

// Performance vs Scalability diagnostic
Single user P99: 150ms     → performance: OK
At 1000 RPS:    8000ms    → scalability: BROKEN

// Root cause: DB connection pool exhausted
max_connections = 100
app_servers     = 10, threads_each = 50
// 10×50=500 threads competing for 100 connections
Fix: PgBouncer connection pooler or reduce thread count
Connection pool exhaustion manifests as sudden latency spikes under load. Profile before optimizing — the bottleneck is almost never where you expect.

Latency vs Throughput

Latency is time from request to response. Throughput is the number of requests handled per unit time. Optimizing one often degrades the other; Little’s Law links them.

Little’s Law

N = λ × W: Concurrent requests in flight equals arrival rate times response time. When latency spikes, concurrent requests balloon, exhausting thread pools and memory.
// Little's Law: latency spike impact
Normal: 1000 RPS × 50ms  = 50 concurrent requests
Spike:  1000 RPS × 500ms = 500 concurrent requests
// 10× concurrency → likely OOM or cascade timeout

// Throughput vs latency trade-off in Kafka consumer
fetch.min.bytes=1      // low latency, low throughput
fetch.min.bytes=1048576 // high throughput, +latency

Key Concepts

  • P99 latency matters: the 1% of slow requests drive user churn and support tickets
  • Throughput = sustained RPS the system can serve within its latency SLA
  • Batching: commit to disk every 100ms (throughput) vs every write (latency)
  • Streaming vs batch: real-time pipelines optimize latency; batch jobs optimize throughput
  • Async I/O decouples thread count from I/O concurrency — Node.js, Go goroutines, Java virtual threads
Report P99 latency on dashboards and set SLOs around it — median latency hides the slowest requests that users actually experience as painful.

CAP Theorem

CAP Theorem states that a distributed system guarantees at most two of: Consistency, Availability, and Partition Tolerance. Since partitions are unavoidable, systems must choose between C and A when a partition occurs.
Network partitions are inevitable — packets drop, links fail. Every distributed system must choose: refuse to serve stale data (CP) or always serve data that may be stale (AP).

System Classifications

// Cassandra: CAP choice per operation
WRITE: QUORUM  // ⌈N/2⌉+1 nodes confirm → consistent write
READ:  QUORUM  // read from majority → consistent read
// N=3: QUORUM=2, tolerates 1 node failure

WRITE: ONE    // 1 node confirms → fast but possibly stale
READ:  ONE    // may read from lagging replica

Categories

CP Systems

Return error or block during partition to avoid stale reads.Examples: PostgreSQL, Zookeeper, etcd, HBaseUse for: Bank balance, stock reservation, distributed locks

AP Systems

Return possibly stale data — always available, eventually consistent.Examples: Cassandra, CouchDB, DynamoDBUse for: Shopping cart, view counters, DNS records

PACELC Extension

PACELC extends CAP: even without partition, there is a latency-consistency trade-off on every operation.
  • Cassandra tunable consistency: QUORUM reads/writes = CP-like; ONE = AP-like per operation
  • Google Spanner: achieves external consistency via TrueTime atomic clocks — practical CP at global scale
Default to AP + eventual consistency, then carve out CP for specific entities where stale reads cause harm. This minimizes the surface area of unavailability.

Consistency Patterns

Consistency patterns define when and how data changes become visible across distributed nodes.

Consistency Levels

LevelGuaranteeExamples
WeakNo guarantee — best effortVideo streams, VoIP, real-time gaming
EventualConverges given no new writesDNS, Cassandra, DynamoDB, S3
Strong (Linearizable)Reads always reflect latest writePostgreSQL, Zookeeper, etcd
Read-your-own-writesUser sees their own writesProfile updates, settings
Monotonic readsOnce seen, older values never returnedSession consistency
Bounded stalenessEventual with max lag guaranteeAzure Cosmos DB

DNS Example: Classic Eventual Consistency

// DNS: classic eventual consistency
ns1: A record → 1.2.3.4  (updated)
ns2: A record → 1.2.3.3  (stale, TTL not expired)
// Both respond differently — converge after TTL expiry

// DynamoDB consistency options
ConsistentRead: false → eventually consistent (half cost)
ConsistentRead: true  → strongly consistent  (full cost)
Eventual consistency is a provable guarantee of convergence, not “sometimes consistent.” The design question is: what is the acceptable staleness window for this specific data?

Best Practices

  • Use strong consistency for money, inventory levels, and auth tokens
  • Use eventual consistency with idempotency keys for user-generated content
  • Document CAP choices per service in Architecture Decision Records
  • Don’t use AP for financial transactions or inventory reservations
  • Don’t use CP for social media likes, view counts, or feed caching
  • Don’t ignore replication lag monitoring in eventually consistent systems

Next Steps

Scalability

Learn about horizontal scaling, sharding, and load distribution

Databases

Deep dive into SQL vs NoSQL, sharding, and optimization

Caching

Explore caching strategies and multi-layer cache architecture

Load Balancing

Understand traffic distribution and load balancing algorithms

Build docs developers (and LLMs) love