Skip to main content

Overview

GOV.UK Notify API is deployed as a containerized application using Docker. The application consists of multiple components:
  • Web API (Gunicorn with eventlet workers)
  • Celery workers for background processing
  • Celery Beat for scheduled tasks
  • PostgreSQL database
  • Redis for caching
  • AWS SQS for task queues

Prerequisites

System Requirements

  • Python 3.13
  • PostgreSQL 15+
  • Redis (optional, enabled via REDIS_ENABLED=1)
  • AWS credentials with access to SQS, S3, and SES
  • Docker (for containerized deployment)

Build Tools

# Install uv for Python dependency management
curl -LsSf https://astral.sh/uv/install.sh | sh

# Install pre-commit for code quality
brew install pre-commit
pre-commit install --install-hooks
Reference: README.md:76-89

Building the Application

Docker Build

The application uses a multi-stage Dockerfile with the following targets:
# Production image
docker build -f docker/Dockerfile --target production -t notifications-api .

# Test image (includes test dependencies)
docker build -f docker/Dockerfile --target test -t notifications-api:test .
Reference: docker/Dockerfile:1-163

Local Build

# Bootstrap the application
make bootstrap

# Generate version file
make generate-version-file
Reference: Makefile:16-20

Deployment Process

1. Database Migrations

Always run migrations before deploying new code:
# Check if there are migrations to run
make check-if-migrations-to-run

# Run migrations
flask db upgrade

# In Docker/ECS
docker run notifications-api migration
Migrations use Alembic with:
  • Session-level advisory locks to prevent concurrent migrations
  • 1-second lock timeout on table locks
  • Transaction-per-migration for safety
Reference: migrations/env.py:94-111

2. Deploy API Service

The API runs using Gunicorn with eventlet workers:
# Using entrypoint.sh
./entrypoint.sh api

# Direct command
gunicorn -c gunicorn_config.py application
Gunicorn Configuration:
  • 4 workers (configurable)
  • Eventlet worker class for async operations
  • 8 worker connections (limits runaway greenthreads)
  • 30-second timeout (configurable via HTTP_SERVE_TIMEOUT_SECONDS)
  • Keepalive disabled by default
  • StatsD integration for metrics
Reference: gunicorn_config.py:18-23

3. Deploy Celery Workers

Multiple specialized worker types handle different queues:
# Generic worker (all queues)
./entrypoint.sh worker

# Specialized workers
./entrypoint.sh api-worker-sender           # SMS/Email sending
./entrypoint.sh api-worker-letters          # Letter processing
./entrypoint.sh api-worker-jobs             # Batch jobs
./entrypoint.sh api-worker-receipts         # Delivery receipts
./entrypoint.sh api-worker-periodic         # Periodic tasks
./entrypoint.sh api-worker-reporting        # Reporting tasks
./entrypoint.sh api-worker-internal         # Internal Notify tasks
./entrypoint.sh api-worker-service-callbacks # Service callbacks
./entrypoint.sh api-worker-retry-tasks      # Retries
./entrypoint.sh api-worker-research         # Research mode
Reference: entrypoint.sh:11-66

4. Deploy Celery Beat

Celery Beat schedules periodic tasks:
./entrypoint.sh celery-beat
Only run one instance of Celery Beat per environment. Reference: entrypoint.sh:67-68

Deployment Environments

Development

export NOTIFY_ENVIRONMENT='development'

# Run Flask development server
make run-flask

# Run Celery worker
make run-celery

# Run Celery Beat
make run-celery-beat
Reference: README.md:90-104

Production

Production deployments use:
  • Container orchestration (ECS recommended)
  • Separate worker pools for different queue types
  • Auto-scaling based on queue depth and CPU
  • Health checks on /status endpoint

Health Checks

The API provides health check endpoints:
# Basic health check
curl http://localhost:6011/status

# Response format
{
  "status": "ok",
  "db": "ok",
  "git_commit": "<commit-sha>",
  "build_time": "<timestamp>"
}

Scaling Considerations

API Workers

  • Default: 4 Gunicorn workers
  • Each worker handles 8 concurrent connections
  • Scale horizontally based on CPU and request latency

Celery Workers

  • Default concurrency: 4 (configurable via CONCURRENCY env var)
  • Can override via CELERYD_PREFETCH_MULTIPLIER for long-running tasks
  • Separate worker pools prevent queue starvation
Reference: config.py:480-482

Database Connections

  • Default pool size: 5 connections per process
  • Pool timeout: 30 seconds
  • Connection recycling: 300 seconds
  • Configurable via SQLALCHEMY_POOL_SIZE
Reference: config.py:174-186

Zero-Downtime Deployments

  1. Deploy database migrations first (they are backward compatible)
  2. Deploy new API containers - Gunicorn gracefully handles SIGTERM
  3. Monitor health checks - Ensure new containers are healthy
  4. Deploy Celery workers - Old workers finish current tasks before shutdown
  5. Deploy Celery Beat last - Only one instance should run

Rollback Procedure

  1. Revert to previous container image
  2. Check database migrations - Migrations are not automatically rolled back
  3. Monitor error rates and queue depths
  4. If database rollback needed, run migration downgrade:
flask db downgrade <revision>

Running One-Off Tasks

Tasks can be run through Flask commands:
# List available commands
flask command --help

# Example: Purge functional test data
flask command purge_functional_test_data -u <prefix>

# In ECS, use ecs-exec.sh script
./scripts/ecs-exec/ecs-exec.sh
<select notify-api>
flask command <command-name>
Reference: README.md:128-146

Performance Monitoring

The application exports metrics to:
  • StatsD - Configured via STATSD_HOST
  • Prometheus - Multiprocess mode for Gunicorn workers
export PROMETHEUS_MULTIPROC_DIR="/tmp"
export STATSD_HOST="statsd.example.com"
Reference: entrypoint.sh:3

Troubleshooting

Worker Not Processing Tasks

  1. Check SQS queue visibility
  2. Verify NOTIFICATION_QUEUE_PREFIX matches queue names
  3. Check worker logs for connection errors
  4. Verify AWS credentials and permissions

Database Connection Errors

  1. Check connection pool exhaustion metrics
  2. Increase SQLALCHEMY_POOL_SIZE if needed
  3. Check for long-running queries blocking connections
  4. Verify statement_timeout settings

High Memory Usage

  1. Check for memory leaks in long-running Celery workers
  2. Restart workers periodically using max-tasks-per-child
  3. Monitor eventlet greenthread creation

Security Considerations

  • Run containers as non-root user (notify:notify)
  • Use secrets management for sensitive environment variables
  • Rotate API keys regularly (DVLA keys rotate monthly)
  • Enable TLS for all external connections
  • Use VPC endpoints for AWS services
Reference: docker/Dockerfile:45-46

Build docs developers (and LLMs) love