Introduction
On the most fundamental level, a database needs to do two things:- Store data when you give it
- Retrieve data when you ask for it
- Transactional workloads (OLTP - Online Transaction Processing)
- Analytics workloads (OLAP - Online Analytical Processing)
1. Data structures that power your database
The simplest database
Let’s start with the simplest possible database:- Writes: O(1) - very fast! Just append to file
- Reads: O(n) - very slow! Must scan entire file
Indexes
An index is an additional structure derived from the primary data. It helps locate data without scanning everything.Key insight: Indexes slow down writes because you must update both the data and the index. This is an important trade-off in database design.
2. Hash indexes
The simplest indexing strategy is a hash map that keeps an in-memory index of every key’s location in the data file. Implementation:- Writes: O(1) - append to file, update hash map
- Reads: O(1) - hash map lookup, one disk seek
Compaction
Problem: The file grows forever, even for updates! Solution: Compaction. Compaction process: Keep only the latest value for each key and discard old values.Segmentation
Real implementations (like Bitcask) use segments: Why segments?- Freeze segments: Once written, immutable (easier to manage)
- Concurrent reads: Read from old segments while writing new
- Compaction: Merge old segments in background
- Crash recovery: Simpler to recover from crashes
- Hash table must fit in memory
- Range queries are not efficient (must look up each key)
3. SSTables and LSM-trees
Sorted String Table (SSTable)
An SSTable is like a segment file, but with one key difference: keys are sorted. Advantages of sorted keys:- Sparse index: Don’t need to index every key
- Range queries are efficient: Keys are sorted, so scanning a range is sequential
- Compression: Group records into blocks and compress
How to create SSTables
Since writes arrive in random order, how do we create sorted SSTables? LSM-Tree write process:Compaction strategies
Over time, we accumulate many SSTables. Need to merge them: Size-tiered compaction (used by Cassandra):- Merge SSTables of similar size
- Creates exponentially larger SSTables
- Write amplification: each write might be rewritten multiple times
- Each level is 10x larger than previous
- Within each level, SSTables have non-overlapping key ranges
- More predictable performance
4. B-trees
B-trees are the most widely used indexing structure (used by almost all relational databases and many non-relational ones).B-tree properties
- Data organized in fixed-size pages (usually 4 KB)
- Each page contains many keys and references to child pages
- Tree is balanced: All leaf pages at same depth
- Branching factor: Typically hundreds of children per page
B-tree search
B-tree vs LSM-tree
| Use Case | B-Tree | LSM-Tree |
|---|---|---|
| Write-heavy workload | Slower | Much faster |
| Read-heavy workload | Faster | Slower |
| Point lookups | One page read | Multiple SSTables |
| Range scans | Good | Excellent |
| Transaction support | Easier | More complex |
| Storage efficiency | Fragmentation | Better compression |
When to use each
When to use each
B-trees: Choose for read-heavy workloads, point lookups, and when you need strong transaction support with easy implementation.LSM-trees: Choose for write-heavy workloads, when you need better compression, or when you can tolerate slightly slower reads in exchange for much faster writes.
5. Other indexing structures
Secondary indexes
Secondary indexes are non-unique and can have duplicates. They’re crucial for querying on non-primary key columns.Clustered vs non-clustered index
Clustered index: Index contains actual row data, data stored in index order, only one per table Non-clustered index: Index contains pointer to row, data stored separately, multiple per tableMulti-column indexes
The order of columns in a composite index matters! The index can only be used efficiently if the query filters on the leftmost columns first.Full-text search indexes
Use an inverted index where each word maps to the documents containing it.6. Transaction processing vs analytics
There are two main patterns of data access:| Property | OLTP | OLAP |
|---|---|---|
| Read pattern | Small number of records by key | Aggregate over large number of records |
| Write pattern | Random-access, low-latency writes | Bulk import (ETL) or event stream |
| Used by | End user, via web application | Internal analyst, for decision support |
| Data representation | Latest state of data | History of events over time |
| Dataset size | GB to TB | TB to PB |
| Bottleneck | Disk seek time | Disk bandwidth |
Data warehousing
Why separate database for analytics?- OLTP databases optimized for transactions, not analytics
- Analytics queries are expensive, would slow down user-facing app
- Data warehouse can denormalize data for better query performance
- Can combine data from multiple sources
Star schema
- Fact table: Center of star, records events (sales transactions)
- Dimension tables: Surrounding points, describe attributes
7. Column-oriented storage
Traditional row-oriented storage stores entire rows together. Column-oriented storage stores each column separately. Advantages:- Only read columns needed for query
- Better compression (similar data together)
- Vectorized processing possible
Column compression
Columns compress very well because similar data is stored together. Example:Column-oriented storage is a huge win for analytics workloads but not suitable for OLTP workloads that need to access entire rows.
Writing to column-oriented storage
Solution: LSM-tree approach- Writes go to in-memory store (row-oriented for flexibility)
- Periodically merge to disk in column-oriented format
- Queries check both in-memory and disk storage
8. Aggregation: Data cubes and materialized views
Materialized views
A materialized view caches query results on disk for fast access, but must be updated when data changes.Data cubes
A data cube is a special kind of materialized view: a grid of aggregates grouped by different dimensions. Advantage: Extremely fast queries (just lookup in cube) Disadvantage: Less flexible (can only query on pre-aggregated dimensions)Summary
Key takeaways:-
Two main families of storage engines:
- Log-structured (LSM-trees): Append-only, fast writes
- Update-in-place (B-trees): Sorted pages, established and widely used
-
OLTP vs OLAP:
- OLTP: Short transactions, user-facing, row-oriented
- OLAP: Large aggregations, analytics, column-oriented
-
Column-oriented storage is huge win for analytics:
- Only load required columns
- Better compression
- Vectorized processing
-
Indexes speed up reads but slow down writes:
- Choose indexes based on query patterns
- Trade-off between read and write performance
-
Different workloads need different storage engines:
- No one-size-fits-all solution
- Understanding trade-offs is crucial