Database Design & Optimization

SQL vs NoSQL

Choose SQL when data is relational, consistency is critical, and queries are ad-hoc. Choose NoSQL when access patterns are specific and predictable, horizontal write scalability is required, or data shape is inherently non-relational.

Decision Framework

// Use SQL (PostgreSQL) when:
→ Complex joins and ad-hoc queries
→ ACID multi-table transactions required
→ Schema is stable, well-understood

// Use NoSQL when:
→ Write throughput > single-node SQL capacity
→ Access pattern is key-value, range, or time-series
→ Data is documents, graphs, or IoT streams

SQL Databases

Strengths:

ACID transactions
Joins and foreign keys
Schema enforcement
Mature tooling and ecosystem

Examples: PostgreSQL, MySQL, SQL ServerUse for: Financial transactions, inventory, user accounts

NoSQL Databases

Strengths:

Horizontal write scalability
Flexible schema
Access-pattern optimized
High throughput

Types: Key-value (Redis), Document (MongoDB), Wide-column (Cassandra), Graph (Neo4j)Use for: Caching, sessions, logs, metrics, social graphs

Start with PostgreSQL — it handles relational, JSONB, full-text search, and time-series workloads natively. Add specialized stores only when PostgreSQL becomes the proven bottleneck.

NoSQL Database Types

Key-Value
Document
Wide-Column
Graph

Pattern: GET key → valueExamples: Redis, DynamoDB, MemcachedBest for:

Session storage
Cache layer
Rate limiting counters
Real-time leaderboards

# Redis examples
SET user:1001:session "jwt_token_here" EX 3600
GET user:1001:session
INCR rate_limit:user:1001
ZADD leaderboard 9500 "player123"

Pattern: Store JSON documents with flexible schemaExamples: MongoDB, CouchDB, DynamoDBBest for:

Product catalogs
User profiles
Content management
Event logging

// MongoDB document
db.products.insertOne({
  _id: "prod_123",
  name: "Laptop",
  specs: {
    cpu: "M2",
    ram: "16GB"
  },
  tags: ["electronics", "computers"],
  price: 1299.99
});

Pattern: Column-family storage optimized for write-heavy workloadsExamples: Cassandra, HBase, ScyllaDBBest for:

Time-series data
IoT sensor data
Write-heavy analytics
Distributed logs

-- Cassandra schema
CREATE TABLE sensor_data (
  sensor_id UUID,
  timestamp TIMESTAMP,
  temperature FLOAT,
  humidity FLOAT,
  PRIMARY KEY ((sensor_id), timestamp)
) WITH CLUSTERING ORDER BY (timestamp DESC);

Pattern: Nodes and edges for relationship-heavy dataExamples: Neo4j, Amazon Neptune, ArangoDBBest for:

Social networks
Recommendation engines
Fraud detection
Knowledge graphs

// Neo4j query: find friends of friends
MATCH (user:Person {name: "Alice"})
  -[:FRIEND]->(friend)
  -[:FRIEND]->(fof)
WHERE NOT (user)-[:FRIEND]->(fof)
RETURN fof.name, COUNT(*) as mutual
ORDER BY mutual DESC;

NewSQL: Best of Both Worlds

NewSQL databases (CockroachDB, Google Spanner, YugabyteDB) provide SQL semantics with NoSQL horizontal write scalability:

-- CockroachDB: distributed SQL with automatic sharding
CREATE TABLE orders (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  user_id UUID NOT NULL,
  total DECIMAL(10,2),
  created_at TIMESTAMP DEFAULT now()
);

-- Automatically distributed across nodes
-- ACID transactions across shards
-- SQL interface with joins

Consider NewSQL when you need SQL semantics (joins, transactions) but have outgrown single-node write capacity. Trade-off: increased latency for distributed consensus.

SQL Query Optimization

EXPLAIN ANALYZE

The primary tuning tool for understanding query execution:

-- Before optimization
EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM orders WHERE user_id = 42 ORDER BY created_at DESC LIMIT 10;

-- Output shows:
Seq Scan on orders  (cost=0.00..45823.00 rows=1000000 width=120)
  Filter: (user_id = 42)
  Rows Removed by Filter: 999500
  Buffers: shared hit=8000 read=12000
Planning Time: 0.5ms
Execution Time: 2847.3ms

Sequential scan with high “Rows Removed by Filter” indicates a missing index. The database is scanning the entire table and discarding most rows.

Index Strategy

B-tree Index
Partial Index
Expression Index
Full-Text Search

Default index type for most use cases:

-- Single-column index
CREATE INDEX idx_user_id ON orders(user_id);

-- Composite index (order matters!)
CREATE INDEX idx_user_status_date 
  ON orders(user_id, status, created_at DESC);

-- Covering index (includes extra columns)
CREATE INDEX idx_user_status_date_covering
  ON orders(user_id, status, created_at DESC)
  INCLUDE (total, item_count);

Column order rules:

Equality filters first (user_id = ?)
Range filters next (created_at > ?)
ORDER BY last

Index only a subset of rows:

-- Only index active orders (90% are completed)
CREATE INDEX idx_active_orders 
  ON orders(user_id, created_at) 
  WHERE status IN ('pending', 'processing');

-- 10x smaller index, faster queries for active orders

Benefits:

Smaller index size
Faster writes (fewer index updates)
Better cache hit rate

Index on computed expressions:

-- Index for case-insensitive email lookup
CREATE INDEX idx_email_lower 
  ON users(LOWER(email));

-- Query must use same expression
SELECT * FROM users WHERE LOWER(email) = '[email protected]';

-- JSONB field index
CREATE INDEX idx_metadata_type 
  ON products((metadata->>'type'));

PostgreSQL built-in full-text search:

-- Add tsvector column
ALTER TABLE articles 
  ADD COLUMN search_vector tsvector;

-- Update trigger to maintain search vector
CREATE TRIGGER articles_search_update
  BEFORE INSERT OR UPDATE ON articles
  FOR EACH ROW EXECUTE FUNCTION
    tsvector_update_trigger(
      search_vector, 'pg_catalog.english', 
      title, body
    );

-- GIN index for full-text search
CREATE INDEX idx_articles_search 
  ON articles USING GIN(search_vector);

-- Query with ranking
SELECT title, 
  ts_rank(search_vector, query) as rank
FROM articles, 
  to_tsquery('english', 'postgresql & performance') query
WHERE search_vector @@ query
ORDER BY rank DESC;

Common Query Anti-Patterns

N+1 Queries

Problem: Loop issuing one query per row

// BAD: 1 + N queries
const orders = await db.all('SELECT * FROM orders');
for (const o of orders) {
  o.user = await db.get(`SELECT * FROM users WHERE id=${o.userId}`);
}

// GOOD: 1 query with JOIN
const orders = await db.all(`
  SELECT o.*, u.name, u.email
  FROM orders o
  JOIN users u ON o.user_id = u.id
`);

Detection: APM tools (Datadog, New Relic) show “span repetition” in traces

SELECT *

Problem: Fetching unused columns wastes I/O and network bandwidth

-- BAD: fetches 50 columns, uses 2
SELECT * FROM users WHERE id = 42;

-- GOOD: explicit column list
SELECT id, name, email FROM users WHERE id = 42;

Impact: Covering index possible only with explicit columns

Missing WHERE in UPDATE/DELETE

Problem: Accidentally modifying all rows

-- DANGEROUS: updates every row!
UPDATE orders SET status = 'cancelled';
-- Missing: WHERE id = 123

-- Safe: always include WHERE
UPDATE orders SET status = 'cancelled' WHERE id = 123;

Enable sql_safe_updates in MySQL or use transactions with explicit commit to prevent accidental bulk modifications.

Implicit Type Conversion

Problem: WHERE clause type mismatch prevents index usage

-- BAD: user_id is INT, comparing to STRING
SELECT * FROM orders WHERE user_id = '42';
-- Forces sequential scan!

-- GOOD: matching types
SELECT * FROM orders WHERE user_id = 42;

Database Maintenance

-- Vacuum and analyze (PostgreSQL)
-- Reclaim space and update query planner statistics
VACUUM ANALYZE orders;

-- Auto-vacuum configuration
ALTER TABLE orders SET (
  autovacuum_vacuum_scale_factor = 0.05,
  autovacuum_analyze_scale_factor = 0.02
);

-- Reindex (rebuild indexes to remove bloat)
REINDEX INDEX CONCURRENTLY idx_user_id;

-- Check index usage
SELECT schemaname, tablename, indexname, idx_scan, idx_tup_read
FROM pg_stat_user_indexes
WHERE idx_scan = 0
  AND indexname NOT LIKE 'pg_toast%'
ORDER BY pg_relation_size(indexrelid) DESC;

Run VACUUM ANALYZE regularly on write-heavy tables. Stale statistics cause the query planner to choose suboptimal execution plans.

Connection Pooling

Database connections are expensive. Connection poolers sit between application and database:

PgBouncer
HikariCP (Java)
Sizing Guide

Lightweight connection pooler for PostgreSQL:

# pgbouncer.ini
[databases]
mydb = host=postgres port=5432 dbname=production

[pgbouncer]
listen_addr = 0.0.0.0
listen_port = 6432
auth_type = md5
auth_file = /etc/pgbouncer/userlist.txt

# Pool settings
pool_mode = transaction        # or session, statement
max_client_conn = 10000       # app can have 10k connections
default_pool_size = 25        # but only 25 to PostgreSQL
reserve_pool_size = 5
reserve_pool_timeout = 3

Pool modes:

session: connection held for entire client session
transaction: connection returned after each transaction (recommended)
statement: connection returned after each statement (breaks transactions)

High-performance JDBC connection pool:

HikariConfig config = new HikariConfig();
config.setJdbcUrl("jdbc:postgresql://localhost:5432/mydb");
config.setUsername("user");
config.setPassword("pass");

// Pool sizing: connections = ((core_count * 2) + effective_spindle_count)
config.setMaximumPoolSize(20);
config.setMinimumIdle(5);

// Connection lifetime
config.setMaxLifetime(1800000);      // 30 min
config.setConnectionTimeout(30000);  // 30 sec
config.setIdleTimeout(600000);       // 10 min

// Health check
config.setConnectionTestQuery("SELECT 1");

HikariDataSource ds = new HikariDataSource(config);

Formula:

connections = ((core_count * 2) + effective_spindle_count)

For SSD (spindle count ≈ 1):

4 cores: ~10 connections
8 cores: ~20 connections
16 cores: ~30 connections

More connections ≠ better performance. Excessive connections cause context switching overhead and degrade throughput.

Monitoring:

-- PostgreSQL active connections
SELECT count(*) FROM pg_stat_activity 
WHERE state = 'active';

-- Connection limit
SHOW max_connections;

Polyglot Persistence

Use specialized databases for specific access patterns:

// Typical microservice data architecture
┌─────────────────┐
│  API Gateway    │
└────────┬────────┘
         │
    ┌────┴─────┬─────────┬──────────┐
    │          │         │          │
┌───▼────┐ ┌──▼───┐ ┌──▼────┐ ┌───▼─────┐
│ Users  │ │Orders│ │Product│ │Analytics│
│Service │ │Service│ │Service│ │ Service │
└───┬────┘ └──┬───┘ └──┬────┘ └───┬─────┘
    │          │         │          │
┌───▼────┐ ┌──▼───┐ ┌──▼────┐ ┌───▼─────┐
│Postgres│ │Postgres│Elastic│ │ClickHouse│
│        │ │      │ │Search │ │ (OLAP)  │
└────────┘ └──────┘ └───────┘ └─────────┘
     │          │
     └────┬─────┘
          │
     ┌────▼────┐
     │  Redis  │  (shared cache + sessions)
     └─────────┘

When to Add Specialized Stores

Elasticsearch: Full-text search, log aggregation, analytics
Redis: Caching, sessions, rate limiting, real-time leaderboards
ClickHouse: OLAP analytics, time-series aggregation
S3/Object Storage: Binary files, backups, data lake
Neo4j: Graph traversal, recommendations

Each additional database type adds operational complexity. Add specialized stores only when PostgreSQL + proper indexing is insufficient for that specific access pattern.

Next Steps

Caching

Reduce database load with multi-layer caching strategies

Scalability

Learn about sharding, replication, and horizontal scaling

Load Balancing

Distribute database reads across replicas effectively

Availability

High availability through replication and failover

Azure Certification

System Design

Software Architecture

SQL vs NoSQL

Decision Framework

SQL Databases

NoSQL Databases

NoSQL Database Types

NewSQL: Best of Both Worlds

SQL Query Optimization

EXPLAIN ANALYZE

Index Strategy

Common Query Anti-Patterns

Database Maintenance

Connection Pooling

Polyglot Persistence

When to Add Specialized Stores

Next Steps

Caching

Scalability

Load Balancing

Availability

Build docs developers (and LLMs) love

Azure Certification

System Design

Software Architecture

​SQL vs NoSQL

​Decision Framework

SQL Databases

NoSQL Databases

​NoSQL Database Types

​NewSQL: Best of Both Worlds

​SQL Query Optimization

​EXPLAIN ANALYZE

​Index Strategy

​Common Query Anti-Patterns

​Database Maintenance

​Connection Pooling

​Polyglot Persistence

​When to Add Specialized Stores

​Next Steps

Caching

Scalability

Load Balancing

Availability

Build docs developers (and LLMs) love

SQL vs NoSQL

Decision Framework

NoSQL Database Types

NewSQL: Best of Both Worlds

SQL Query Optimization

EXPLAIN ANALYZE

Index Strategy

Common Query Anti-Patterns

Database Maintenance

Connection Pooling

Polyglot Persistence

When to Add Specialized Stores

Next Steps