Skip to main content

Database Requirements

PostgreSQL Version

GOV.UK Notify API requires PostgreSQL 15 or later. Reference: docker/Dockerfile:132

Local Development

For local development, use Postgres.app or Docker:
# Using notifications-local (recommended)
# Check https://github.com/alphagov/notifications-local

# Or install via Homebrew
brew install postgresql@15
brew services start postgresql@15
Reference: README.md:50-57

Database Setup

Initial Database Creation

# Create the database
createdb notification_api

# Run migrations
flask db upgrade
Reference: Makefile:19-20

Connection Configuration

Configure database connections via environment variables:
# Primary database (read/write)
export SQLALCHEMY_DATABASE_URI='postgresql+psycopg2://user:password@localhost:5432/notification_api'

# Bulk/read replica (optional, falls back to primary)
export SQLALCHEMY_DATABASE_URI_BULK='postgresql+psycopg2://replica1:5432,replica2:5432/notification_api'
The bulk connection supports multiple hosts for high availability. Reference: config.py:112-119

Database Architecture

Dual Database Binds

The application uses SQLAlchemy binds for two connection types:
  1. Default bind - Primary database for transactional queries
  2. Bulk bind - Read replica for analytics and reporting queries
The bulk bind automatically:
  • Sets readonly = True on connections
  • Uses replica-specific timeouts and worker settings
  • Falls back to primary if no replica is configured
Reference: config.py:113-119, 187-190

Connection Pool Settings

SQLALCHEMY_ENGINE_OPTIONS = {
    "pool_size": 5,              # Connections per process
    "pool_timeout": 30,          # Seconds to wait for connection
    "pool_recycle": 300,         # Recycle connections after 5 minutes
    "connect_args": {
        "connect_timeout": "5",  # Connection timeout
        "tcp_user_timeout": "5000",  # TCP timeout (milliseconds)
        "options": "-c statement_timeout=1200000"  # 20 minute fallback
    }
}
Reference: config.py:174-186

Performance Configuration

The application automatically detects primary vs. replica connections and applies appropriate settings: Primary Database:
  • DATABASE_STATEMENT_TIMEOUT_MS - Default: 1,200,000ms (20 minutes)
  • DATABASE_MAX_PARALLEL_WORKERS - Optional, uses database default if not set
  • Optimized for short transactional queries
Replica Database:
  • DATABASE_STATEMENT_TIMEOUT_REPLICA_MS - Default: 1,200,000ms (20 minutes)
  • DATABASE_MAX_PARALLEL_WORKERS_REPLICA - Optional
  • Allows longer-running analytical queries
Reference: config.py:200-212

Connection Event Handlers

The application configures PostgreSQL session parameters on connect:
# At connection time:
1. Check if connected to primary or replica (pg_is_in_recovery())
2. Set appropriate statement_timeout
3. Set max_parallel_workers_per_gather (if configured)
4. Record inet_server_addr for metrics
5. Set readonly mode for bulk bind
Reference: app/init.py:491-541

Database Migrations

Migration System

GOV.UK Notify uses Alembic via Flask-Migrate for database migrations. Reference: migrations/README.md:1-6

Creating Migrations

# Auto-generate migration from model changes
flask db migrate -m "Description of changes"

# Manually create migration
flask db revision -m "Description of changes"

# Edit the generated migration file
vim migrations/versions/<revision>.py

# Rename to sequential number (project convention)
mv migrations/versions/<hash>_description.py migrations/versions/0123_description.py
Important: Manually rename auto-generated files to use sequential numbers instead of SHA prefixes for easier navigation. Reference: migrations/README.md:5

Running Migrations

# Apply all pending migrations
flask db upgrade

# Apply specific migration
flask db upgrade <revision>

# Downgrade to previous version
flask db downgrade

# Downgrade to specific version
flask db downgrade <revision>

# Show current version
flask db current

# Show migration history
flask db history

Migration in Docker

# Using entrypoint
docker run notifications-api migration

# Using make
make run-migrations-with-docker
Reference: entrypoint.sh:21-22, Makefile:58-60

Migration Safety Features

1. Advisory Locks

Migrations use PostgreSQL advisory locks to prevent concurrent execution:
# Session-level lock using "alembic" as identifier
lock_id = struct.unpack(">q", struct.pack("8s", b"alembic"))[0]
connection.execute("SELECT pg_advisory_lock(:id)", {"id": lock_id})
This prevents multiple migration processes from running simultaneously. Reference: migrations/env.py:94-101

2. Lock Timeout

Migrations set a 1-second lock timeout:
connection.execute("SET lock_timeout = 1000")  # 1000ms
If a table lock cannot be acquired within 1 second, the migration fails and should be retried during a quieter period. Reference: migrations/env.py:108

3. Retry Logic

Migrations automatically retry up to 10 times with 10-second delays on lock errors:
retry_on_lock_error(
    func=run_migrations_online,
    max_retries=10,
    delay_secs=10
)
Reference: migrations/env.py:151

4. Transaction Per Migration

Each migration runs in its own transaction for isolation:
context.configure(
    transaction_per_migration=True
)
Reference: migrations/env.py:64, 91

5. Current Head Tracking

The system tracks the current migration head to prevent merge conflicts:
# Writes to migrations/.current-alembic-head
if engine.url.database == "notification_api":
    with open(".current-alembic-head", "w") as f:
        f.write(context.get_head_revision() + "\n")
Reference: migrations/env.py:114-119

Testing Migrations

# Test migration downgrade
./scripts/test_sql_migration_downgrade.sh

# Lint migrations locally
./scripts/lint_sql_migrations_locally.sh

# Lint migrations in CI
./scripts/lint_sql_migrations_ci.sh

Checking for Pending Migrations

# Check if there are new migrations
make check-if-migrations-to-run

# This runs:
python3 scripts/check_if_new_migration.py
Reference: Makefile:121-123

Production Migration Strategy

Pre-Deployment

  1. Test migrations thoroughly in staging environment
  2. Check lock requirements - Ensure migrations can acquire locks
  3. Review migration SQL - Verify generated DDL is safe
  4. Plan for rollback - Have downgrade script ready

Deployment Sequence

  1. Run migrations first (before deploying code)
    flask db upgrade
    
  2. Verify migration success
    flask db current
    
  3. Deploy application code
  4. Monitor for errors

Zero-Downtime Migrations

For zero-downtime deployments:
  1. Additive changes first - Add columns/tables without removing old ones
  2. Deploy code that works with both old and new schema
  3. Backfill data if needed
  4. Remove old schema in subsequent migration after code is deployed

High-Risk Migrations

For migrations that may take locks on busy tables:
  1. Run during low-traffic period
  2. Increase lock_timeout if appropriate
  3. Consider manual execution rather than automatic deployment
  4. Have rollback plan ready

Database Maintenance

Test Database Management

# Drop all test databases
make drop-test-dbs

# Drop test databases in Docker
make drop-test-dbs-in-docker
Test databases are named:
  • test_notification_api_master - Master test database
  • test_notification_api_gw0, test_notification_api_gw1, etc. - Worker databases
Reference: Makefile:70-86

Monitoring

The application exports database metrics: Connection Metrics:
  • db_connection_total_connected - Total connections held by server
  • db_connection_total_checked_out - Connections currently in use
  • db_connection_open_duration_seconds - How long connections are held
All labeled by:
  • bind - “default” or “bulk”
  • inet_server_addr - Actual database server IP (primary vs replica)
  • method, host, path - Request context
Reference: app/init.py:469-485

Backup and Recovery

Backup Strategy:
  1. Use PostgreSQL continuous archiving (WAL archiving)
  2. Take regular base backups
  3. Retain backups according to retention policy
  4. Test restore procedures regularly
Point-in-Time Recovery:
# Restore base backup
pg_restore -d notification_api /path/to/backup

# Recover to specific timestamp
recovery_target_time = '2024-03-03 14:30:00'

Troubleshooting

Connection Pool Exhausted

Symptoms: “QueuePool limit of size X overflow Y reached” Solutions:
  1. Increase SQLALCHEMY_POOL_SIZE
  2. Check for connection leaks (unclosed sessions)
  3. Verify pool_recycle is set to prevent stale connections
  4. Monitor db_connection_total_checked_out metric

Statement Timeout

Symptoms: sqlalchemy.exc.OperationalError: (psycopg2.errors.QueryCanceled) canceling statement due to statement timeout Solutions:
  1. Optimize slow queries
  2. Increase DATABASE_STATEMENT_TIMEOUT_MS if appropriate
  3. Use bulk bind for long-running analytics queries
  4. Add indexes to improve query performance

Lock Timeout During Migrations

Symptoms: psycopg2.errors.LockNotAvailable Solutions:
  1. Run migration during low-traffic period
  2. Wait for automatic retry (up to 10 attempts)
  3. Manually terminate blocking queries if safe
  4. Consider increasing lock_timeout for specific migration

Replica Lag

Symptoms: Stale data from bulk queries Solutions:
  1. Monitor replication lag
  2. Use primary database for time-sensitive queries
  3. Increase replica resources if consistently lagging
  4. Check network connectivity between primary and replica

Connection Refused

Symptoms: psycopg2.OperationalError: could not connect to server Solutions:
  1. Verify database is running: pg_isready -h <host> -p <port>
  2. Check SQLALCHEMY_DATABASE_URI is correct
  3. Verify network connectivity and firewall rules
  4. Check PostgreSQL pg_hba.conf allows connections
  5. Verify credentials are correct

Build docs developers (and LLMs) love