Introduction
This book has covered many aspects of data systems: storage, retrieval, replication, partitioning, transactions, consistency, and processing. Now we bring these ideas together to think about how to design better data systems. Key questions:- How do we integrate disparate systems?
- How do we ensure correctness across systems?
- How do we evolve systems over time?
- How do we maintain data quality and integrity?
1. Data Integration
Most applications need multiple different data systems working together. Challenge: Keep all these systems synchronized and consistent.Combining Specialized Tools
No single database is good at everything.The Problem of Data Integration
Traditional approach: Dual writes from application Problems with dual writes: Example of race condition:Better Approach: Single Source of Truth
Benefits:- Single source of truth
- Guaranteed ordering (within partition)
- Async consumers can process at own pace
- Easy to add new derived views
2. Unbundling Databases
Traditional databases bundle many features together. Modern trend: Unbundle and use specialized systemsComposing Data Storage Technologies
Database features as separate services:Designing Applications Around Dataflow
Dataflow architecture: Data flows through system as events Command Query Responsibility Segregation (CQRS): Example:3. Derived Data
Principle: Some data is derived from other data.Lambda Architecture
Lambda Architecture: Batch + stream processing for derived views Lambda example: Problems with Lambda:Kappa Architecture
Kappa Architecture: Stream processing only (simpler) Comparison: Example:4. End-to-End Argument for Data Systems
End-to-end argument: For reliability, need end-to-end checks Solution: End-to-end checksExactly-Once Semantics
Challenge: Achieving exactly-once in distributed systems Idempotence as solution: Example:Duplicate Suppression
Methods for detecting duplicates: Windowed deduplication:5. Enforcing Constraints
Challenge: Maintaining integrity across distributed systemsUniqueness Constraints
In single database: Easy (unique index) Across systems: Harder Solutions: Example: Username uniquenessTimeliness and Integrity
Trade-off: Speed vs correctness Apology-based approach:Coordination-Avoidance
CALM theorem: Consistency As Logical Monotonicity6. Trust, But Verify
Principle: Don’t blindly trust componentsAuditing
Immutable event log for auditing: Example:Designing for Auditability
7. Doing the Right Thing
Ethical considerations in data systems:Privacy and Data Protection
Example: GDPR complianceSummary
Key Takeaways:-
Data Integration:
- Avoid dual writes
- Use event logs as integration backbone
- Maintain single source of truth
-
Unbundling Databases:
- Combine specialized systems
- Event log enables loose coupling
- CQRS separates reads and writes
-
Derived Data:
- Distinguish system of record from derived views
- Lambda vs Kappa architectures
- Stream processing for maintaining views
-
End-to-End Correctness:
- Database transactions not enough
- Need application-level checks
- Idempotence critical for reliability
-
Enforcing Constraints:
- Uniqueness requires coordination
- Trade-off: timeliness vs integrity
- Some operations can avoid coordination (CALM)
-
Trust and Verification:
- Audit everything
- Immutable event logs
- Design for forensics
-
Ethical Responsibilities:
- Privacy by design
- Data minimization
- Right to be forgotten
- Fairness and transparency
| Pattern | Pros | Cons | Use When |
|---|---|---|---|
| Traditional DB | Simple, ACID guarantees | Limited scalability, single tool | Small applications |
| Dual Writes | Appears simple | Race conditions, inconsistency | ❌ Don’t use |
| Event Log + CDC | Reliable, ordered, extensible | More complex | Multiple derived views |
| Lambda | Batch + stream | Two code paths | Historical + real-time |
| Kappa | Single code path | Requires replayable log | Event-driven systems |
| CQRS | Optimize reads/writes separately | More components | Complex read patterns |
- Unbundled architectures: Specialized tools working together
- Event-driven design: Data flows as immutable events
- Derived state: Views maintained from event log
- End-to-end thinking: Correctness at application level
- Ethical design: Privacy, fairness, and transparency
Previous: Chapter 11: Stream Processing Conclusion: This concludes our journey through Designing Data-Intensive Applications. We’ve covered storage, distribution, processing, and now integration—the complete picture of building robust, scalable, and maintainable data systems.