Database Requirements
PostgreSQL Version
GOV.UK Notify API requires PostgreSQL 15 or later. Reference: docker/Dockerfile:132Local Development
For local development, use Postgres.app or Docker:Database Setup
Initial Database Creation
Connection Configuration
Configure database connections via environment variables:Database Architecture
Dual Database Binds
The application uses SQLAlchemy binds for two connection types:- Default bind - Primary database for transactional queries
- Bulk bind - Read replica for analytics and reporting queries
- Sets
readonly = Trueon connections - Uses replica-specific timeouts and worker settings
- Falls back to primary if no replica is configured
Connection Pool Settings
Performance Configuration
The application automatically detects primary vs. replica connections and applies appropriate settings: Primary Database:DATABASE_STATEMENT_TIMEOUT_MS- Default: 1,200,000ms (20 minutes)DATABASE_MAX_PARALLEL_WORKERS- Optional, uses database default if not set- Optimized for short transactional queries
DATABASE_STATEMENT_TIMEOUT_REPLICA_MS- Default: 1,200,000ms (20 minutes)DATABASE_MAX_PARALLEL_WORKERS_REPLICA- Optional- Allows longer-running analytical queries
Connection Event Handlers
The application configures PostgreSQL session parameters on connect:Database Migrations
Migration System
GOV.UK Notify uses Alembic via Flask-Migrate for database migrations. Reference: migrations/README.md:1-6Creating Migrations
Running Migrations
Migration in Docker
Migration Safety Features
1. Advisory Locks
Migrations use PostgreSQL advisory locks to prevent concurrent execution:2. Lock Timeout
Migrations set a 1-second lock timeout:3. Retry Logic
Migrations automatically retry up to 10 times with 10-second delays on lock errors:4. Transaction Per Migration
Each migration runs in its own transaction for isolation:5. Current Head Tracking
The system tracks the current migration head to prevent merge conflicts:Testing Migrations
Checking for Pending Migrations
Production Migration Strategy
Pre-Deployment
- Test migrations thoroughly in staging environment
- Check lock requirements - Ensure migrations can acquire locks
- Review migration SQL - Verify generated DDL is safe
- Plan for rollback - Have downgrade script ready
Deployment Sequence
-
Run migrations first (before deploying code)
-
Verify migration success
- Deploy application code
- Monitor for errors
Zero-Downtime Migrations
For zero-downtime deployments:- Additive changes first - Add columns/tables without removing old ones
- Deploy code that works with both old and new schema
- Backfill data if needed
- Remove old schema in subsequent migration after code is deployed
High-Risk Migrations
For migrations that may take locks on busy tables:- Run during low-traffic period
- Increase lock_timeout if appropriate
- Consider manual execution rather than automatic deployment
- Have rollback plan ready
Database Maintenance
Test Database Management
test_notification_api_master- Master test databasetest_notification_api_gw0,test_notification_api_gw1, etc. - Worker databases
Monitoring
The application exports database metrics: Connection Metrics:db_connection_total_connected- Total connections held by serverdb_connection_total_checked_out- Connections currently in usedb_connection_open_duration_seconds- How long connections are held
bind- “default” or “bulk”inet_server_addr- Actual database server IP (primary vs replica)method,host,path- Request context
Backup and Recovery
Backup Strategy:- Use PostgreSQL continuous archiving (WAL archiving)
- Take regular base backups
- Retain backups according to retention policy
- Test restore procedures regularly
Troubleshooting
Connection Pool Exhausted
Symptoms: “QueuePool limit of size X overflow Y reached” Solutions:- Increase
SQLALCHEMY_POOL_SIZE - Check for connection leaks (unclosed sessions)
- Verify
pool_recycleis set to prevent stale connections - Monitor
db_connection_total_checked_outmetric
Statement Timeout
Symptoms:sqlalchemy.exc.OperationalError: (psycopg2.errors.QueryCanceled) canceling statement due to statement timeout
Solutions:
- Optimize slow queries
- Increase
DATABASE_STATEMENT_TIMEOUT_MSif appropriate - Use bulk bind for long-running analytics queries
- Add indexes to improve query performance
Lock Timeout During Migrations
Symptoms:psycopg2.errors.LockNotAvailable
Solutions:
- Run migration during low-traffic period
- Wait for automatic retry (up to 10 attempts)
- Manually terminate blocking queries if safe
- Consider increasing lock_timeout for specific migration
Replica Lag
Symptoms: Stale data from bulk queries Solutions:- Monitor replication lag
- Use primary database for time-sensitive queries
- Increase replica resources if consistently lagging
- Check network connectivity between primary and replica
Connection Refused
Symptoms:psycopg2.OperationalError: could not connect to server
Solutions:
- Verify database is running:
pg_isready -h <host> -p <port> - Check
SQLALCHEMY_DATABASE_URIis correct - Verify network connectivity and firewall rules
- Check PostgreSQL
pg_hba.confallows connections - Verify credentials are correct