Overview
GOV.UK Notify API is deployed as a containerized application using Docker. The application consists of multiple components:- Web API (Gunicorn with eventlet workers)
- Celery workers for background processing
- Celery Beat for scheduled tasks
- PostgreSQL database
- Redis for caching
- AWS SQS for task queues
Prerequisites
System Requirements
- Python 3.13
- PostgreSQL 15+
- Redis (optional, enabled via
REDIS_ENABLED=1) - AWS credentials with access to SQS, S3, and SES
- Docker (for containerized deployment)
Build Tools
Building the Application
Docker Build
The application uses a multi-stage Dockerfile with the following targets:Local Build
Deployment Process
1. Database Migrations
Always run migrations before deploying new code:- Session-level advisory locks to prevent concurrent migrations
- 1-second lock timeout on table locks
- Transaction-per-migration for safety
2. Deploy API Service
The API runs using Gunicorn with eventlet workers:- 4 workers (configurable)
- Eventlet worker class for async operations
- 8 worker connections (limits runaway greenthreads)
- 30-second timeout (configurable via
HTTP_SERVE_TIMEOUT_SECONDS) - Keepalive disabled by default
- StatsD integration for metrics
3. Deploy Celery Workers
Multiple specialized worker types handle different queues:4. Deploy Celery Beat
Celery Beat schedules periodic tasks:Deployment Environments
Development
Production
Production deployments use:- Container orchestration (ECS recommended)
- Separate worker pools for different queue types
- Auto-scaling based on queue depth and CPU
- Health checks on
/statusendpoint
Health Checks
The API provides health check endpoints:Scaling Considerations
API Workers
- Default: 4 Gunicorn workers
- Each worker handles 8 concurrent connections
- Scale horizontally based on CPU and request latency
Celery Workers
- Default concurrency: 4 (configurable via
CONCURRENCYenv var) - Can override via
CELERYD_PREFETCH_MULTIPLIERfor long-running tasks - Separate worker pools prevent queue starvation
Database Connections
- Default pool size: 5 connections per process
- Pool timeout: 30 seconds
- Connection recycling: 300 seconds
- Configurable via
SQLALCHEMY_POOL_SIZE
Zero-Downtime Deployments
- Deploy database migrations first (they are backward compatible)
- Deploy new API containers - Gunicorn gracefully handles SIGTERM
- Monitor health checks - Ensure new containers are healthy
- Deploy Celery workers - Old workers finish current tasks before shutdown
- Deploy Celery Beat last - Only one instance should run
Rollback Procedure
- Revert to previous container image
- Check database migrations - Migrations are not automatically rolled back
- Monitor error rates and queue depths
- If database rollback needed, run migration downgrade:
Running One-Off Tasks
Tasks can be run through Flask commands:Performance Monitoring
The application exports metrics to:- StatsD - Configured via
STATSD_HOST - Prometheus - Multiprocess mode for Gunicorn workers
Troubleshooting
Worker Not Processing Tasks
- Check SQS queue visibility
- Verify
NOTIFICATION_QUEUE_PREFIXmatches queue names - Check worker logs for connection errors
- Verify AWS credentials and permissions
Database Connection Errors
- Check connection pool exhaustion metrics
- Increase
SQLALCHEMY_POOL_SIZEif needed - Check for long-running queries blocking connections
- Verify
statement_timeoutsettings
High Memory Usage
- Check for memory leaks in long-running Celery workers
- Restart workers periodically using
max-tasks-per-child - Monitor eventlet greenthread creation
Security Considerations
- Run containers as non-root user (
notify:notify) - Use secrets management for sensitive environment variables
- Rotate API keys regularly (DVLA keys rotate monthly)
- Enable TLS for all external connections
- Use VPC endpoints for AWS services