Deployment Guide

Overview

GOV.UK Notify API is deployed as a containerized application using Docker. The application consists of multiple components:

Web API (Gunicorn with eventlet workers)
Celery workers for background processing
Celery Beat for scheduled tasks
PostgreSQL database
Redis for caching
AWS SQS for task queues

Prerequisites

System Requirements

Python 3.13
PostgreSQL 15+
Redis (optional, enabled via REDIS_ENABLED=1)
AWS credentials with access to SQS, S3, and SES
Docker (for containerized deployment)

Build Tools

# Install uv for Python dependency management
curl -LsSf https://astral.sh/uv/install.sh | sh

# Install pre-commit for code quality
brew install pre-commit
pre-commit install --install-hooks

Reference: README.md:76-89

Building the Application

Docker Build

The application uses a multi-stage Dockerfile with the following targets:

# Production image
docker build -f docker/Dockerfile --target production -t notifications-api .

# Test image (includes test dependencies)
docker build -f docker/Dockerfile --target test -t notifications-api:test .

Reference: docker/Dockerfile:1-163

Local Build

# Bootstrap the application
make bootstrap

# Generate version file
make generate-version-file

Reference: Makefile:16-20

Deployment Process

1. Database Migrations

Always run migrations before deploying new code:

# Check if there are migrations to run
make check-if-migrations-to-run

# Run migrations
flask db upgrade

# In Docker/ECS
docker run notifications-api migration

Migrations use Alembic with:

Session-level advisory locks to prevent concurrent migrations
1-second lock timeout on table locks
Transaction-per-migration for safety

Reference: migrations/env.py:94-111

2. Deploy API Service

The API runs using Gunicorn with eventlet workers:

# Using entrypoint.sh
./entrypoint.sh api

# Direct command
gunicorn -c gunicorn_config.py application

Gunicorn Configuration:

4 workers (configurable)
Eventlet worker class for async operations
8 worker connections (limits runaway greenthreads)
30-second timeout (configurable via HTTP_SERVE_TIMEOUT_SECONDS)
Keepalive disabled by default
StatsD integration for metrics

Reference: gunicorn_config.py:18-23

3. Deploy Celery Workers

Multiple specialized worker types handle different queues:

# Generic worker (all queues)
./entrypoint.sh worker

# Specialized workers
./entrypoint.sh api-worker-sender           # SMS/Email sending
./entrypoint.sh api-worker-letters          # Letter processing
./entrypoint.sh api-worker-jobs             # Batch jobs
./entrypoint.sh api-worker-receipts         # Delivery receipts
./entrypoint.sh api-worker-periodic         # Periodic tasks
./entrypoint.sh api-worker-reporting        # Reporting tasks
./entrypoint.sh api-worker-internal         # Internal Notify tasks
./entrypoint.sh api-worker-service-callbacks # Service callbacks
./entrypoint.sh api-worker-retry-tasks      # Retries
./entrypoint.sh api-worker-research         # Research mode

Reference: entrypoint.sh:11-66

4. Deploy Celery Beat

Celery Beat schedules periodic tasks:

./entrypoint.sh celery-beat

Only run one instance of Celery Beat per environment. Reference: entrypoint.sh:67-68

Deployment Environments

Development

export NOTIFY_ENVIRONMENT='development'

# Run Flask development server
make run-flask

# Run Celery worker
make run-celery

# Run Celery Beat
make run-celery-beat

Reference: README.md:90-104

Production

Production deployments use:

Container orchestration (ECS recommended)
Separate worker pools for different queue types
Auto-scaling based on queue depth and CPU
Health checks on /status endpoint

Health Checks

The API provides health check endpoints:

# Basic health check
curl http://localhost:6011/status

# Response format
{
  "status": "ok",
  "db": "ok",
  "git_commit": "<commit-sha>",
  "build_time": "<timestamp>"
}

Scaling Considerations

API Workers

Default: 4 Gunicorn workers
Each worker handles 8 concurrent connections
Scale horizontally based on CPU and request latency

Celery Workers

Default concurrency: 4 (configurable via CONCURRENCY env var)
Can override via CELERYD_PREFETCH_MULTIPLIER for long-running tasks
Separate worker pools prevent queue starvation

Reference: config.py:480-482

Database Connections

Default pool size: 5 connections per process
Pool timeout: 30 seconds
Connection recycling: 300 seconds
Configurable via SQLALCHEMY_POOL_SIZE

Reference: config.py:174-186

Zero-Downtime Deployments

Deploy database migrations first (they are backward compatible)
Deploy new API containers - Gunicorn gracefully handles SIGTERM
Monitor health checks - Ensure new containers are healthy
Deploy Celery workers - Old workers finish current tasks before shutdown
Deploy Celery Beat last - Only one instance should run

Rollback Procedure

Revert to previous container image
Check database migrations - Migrations are not automatically rolled back
Monitor error rates and queue depths
If database rollback needed, run migration downgrade:

flask db downgrade <revision>

Running One-Off Tasks

Tasks can be run through Flask commands:

# List available commands
flask command --help

# Example: Purge functional test data
flask command purge_functional_test_data -u <prefix>

# In ECS, use ecs-exec.sh script
./scripts/ecs-exec/ecs-exec.sh
<select notify-api>
flask command <command-name>

Reference: README.md:128-146

Performance Monitoring

The application exports metrics to:

StatsD - Configured via STATSD_HOST
Prometheus - Multiprocess mode for Gunicorn workers

export PROMETHEUS_MULTIPROC_DIR="/tmp"
export STATSD_HOST="statsd.example.com"

Reference: entrypoint.sh:3

Troubleshooting

Worker Not Processing Tasks

Check SQS queue visibility
Verify NOTIFICATION_QUEUE_PREFIX matches queue names
Check worker logs for connection errors
Verify AWS credentials and permissions

Database Connection Errors

Check connection pool exhaustion metrics
Increase SQLALCHEMY_POOL_SIZE if needed
Check for long-running queries blocking connections
Verify statement_timeout settings

High Memory Usage

Check for memory leaks in long-running Celery workers
Restart workers periodically using max-tasks-per-child
Monitor eventlet greenthread creation

Security Considerations

Run containers as non-root user (notify:notify)
Use secrets management for sensitive environment variables
Rotate API keys regularly (DVLA keys rotate monthly)
Enable TLS for all external connections
Use VPC endpoints for AWS services

Reference: docker/Dockerfile:45-46

Get Started

Core Concepts

API Guide

Operations

Overview

Prerequisites

System Requirements

Build Tools

Building the Application

Docker Build

Local Build

Deployment Process

1. Database Migrations

2. Deploy API Service

3. Deploy Celery Workers

4. Deploy Celery Beat

Deployment Environments

Development

Production

Health Checks

Scaling Considerations

API Workers

Celery Workers

Database Connections

Zero-Downtime Deployments

Rollback Procedure

Running One-Off Tasks

Performance Monitoring

Troubleshooting

Worker Not Processing Tasks

Database Connection Errors

High Memory Usage

Security Considerations

Build docs developers (and LLMs) love

Get Started

Core Concepts

API Guide

Operations

​Overview

​Prerequisites

​System Requirements

​Build Tools

​Building the Application

​Docker Build

​Local Build

​Deployment Process

​1. Database Migrations

​2. Deploy API Service

​3. Deploy Celery Workers

​4. Deploy Celery Beat

​Deployment Environments

​Development

​Production

​Health Checks

​Scaling Considerations

​API Workers

​Celery Workers

​Database Connections

​Zero-Downtime Deployments

​Rollback Procedure

​Running One-Off Tasks

​Performance Monitoring

​Troubleshooting

​Worker Not Processing Tasks

​Database Connection Errors

​High Memory Usage

​Security Considerations

Build docs developers (and LLMs) love

Overview

Prerequisites

System Requirements

Build Tools

Building the Application

Docker Build

Local Build

Deployment Process

1. Database Migrations

2. Deploy API Service

3. Deploy Celery Workers

4. Deploy Celery Beat

Deployment Environments

Development

Production

Health Checks

Scaling Considerations

API Workers

Celery Workers

Database Connections

Zero-Downtime Deployments

Rollback Procedure

Running One-Off Tasks

Performance Monitoring

Troubleshooting

Worker Not Processing Tasks

Database Connection Errors

High Memory Usage

Security Considerations