Skip to main content

Architecture Overview

GOV.UK Notify is a distributed system built for high availability and scalability. It consists of three main components:
  1. Public REST API - Accepts notification requests from services
  2. Admin Web Interface - Service management and template creation (separate repository)
  3. Asynchronous Workers - Process and deliver notifications to providers
This documentation focuses on the API service component (notifications-api). The admin interface is maintained in a separate repository.

High-Level Architecture

Core Components

1. Public REST API

The public API is a Flask application that provides RESTful endpoints for:
  • Sending SMS, email, and letter notifications
  • Querying notification status
  • Managing templates
  • Retrieving inbound SMS messages
Technology Stack:
  • Framework: Flask (Python 3.13)
  • Web Server: Gunicorn with multiple worker processes
  • Authentication: JWT tokens signed with service API keys
  • Validation: JSON Schema validation for all requests
  • Metrics: Prometheus metrics via gds-metrics
Key Files:
  • app/v2/notifications/post_notifications.py - Notification creation endpoints
  • app/v2/notifications/get_notifications.py - Notification query endpoints
  • app/authentication/auth.py - JWT authentication and authorization
The API uses separate database connections for read and write operations, with read replicas used for query endpoints to reduce load on the primary database.

2. Celery Task Queue

Asynchronous notification processing is handled by Celery workers that:
  1. Accept tasks from the API via Redis queues
  2. Process notifications (render templates, validate, format)
  3. Send to providers (AWS SES, Firetext, MMG, DVLA)
  4. Update status based on provider callbacks
  5. Retry failures according to configured retry policies
Queue Structure:
Queue NamePurpose
send-sms-tasksSMS delivery tasks
send-email-tasksEmail delivery tasks
create-letters-pdf-tasksLetter PDF generation
priority-tasksHigh-priority notifications
database-tasksDB maintenance operations
periodic-tasksScheduled jobs (cleanup, reporting)
research-mode-tasksTest notifications
Celery Beat Scheduler: Runs periodic tasks such as:
  • Daily statistics aggregation
  • Old notification cleanup
  • Provider status checks
  • Usage report generation

3. Database Layer

PostgreSQL is the primary data store with the following structure: Connection Strategy:
  • Primary database: All writes, critical reads
  • Read replicas: Query endpoints, reporting, analytics
  • Connection pooling: SQLAlchemy with configurable pool sizes
  • Read-only enforcement: Bulk queries use dedicated read-only sessions
Key Tables:
  • services - Service configurations and settings
  • api_keys - Service API keys with type and expiry
  • templates - Notification templates with versioning
  • notifications - All sent notifications with status
  • users - Admin users and permissions
  • jobs - Bulk sending jobs (CSV uploads)
  • inbound_sms - Received SMS messages
Performance Features:
  • Indexed on service_id, created_at, status
  • Partitioning for large tables (notifications)
  • Query timeout protection (configurable per environment)
  • Parallel query execution controls
See app/__init__.py:466-619 for database connection event handling and metrics.

4. Caching Layer (Redis)

Redis serves dual purposes:
  1. Celery Message Broker - Queue management for asynchronous tasks
  2. Application Cache - Service data, templates, rate limit counters
Cached Data:
  • Service configurations (reduces DB load)
  • Template definitions (faster rendering)
  • Rate limiting counters (per-service, per-hour)
  • API key validation results
Redis caching must be explicitly enabled via the REDIS_ENABLED environment variable. Without caching, all data is fetched from PostgreSQL on every request.

Notification Flow

SMS/Email Notification Lifecycle

1

API Request

Client sends POST request to /v2/notifications/sms or /v2/notifications/email with:
  • JWT token in Authorization header
  • Template ID and personalisation data
  • Recipient (phone number or email)
  • Optional reference and reply-to settings
2

Authentication & Validation

  1. JWT token decoded and validated against service API keys
  2. Service active status checked
  3. Service permissions verified (SMS/email/letter)
  4. Rate limits checked (per-service limits)
  5. Request JSON validated against schema
3

Template Processing

  1. Template fetched from database or cache
  2. Personalisation placeholders replaced with data
  3. Content length validated (SMS: 612 chars, Email: no limit)
  4. Template rendering errors caught and returned
4

Notification Persistence

Notification record created in PostgreSQL with:
  • Unique notification ID (UUID)
  • Initial status: created
  • Service ID, template ID, API key ID
  • Recipient and personalisation data
  • Client reference (if provided)
Status returned immediately to client (201 Created).
5

Queue Dispatch

Task enqueued to appropriate Celery queue:
  • Simulated recipients (test): No queue, marked delivered immediately
  • Real recipients: Task sent to send-sms-tasks or send-email-tasks
  • Priority notifications: Use priority-tasks queue
6

Worker Processing

Celery worker:
  1. Picks up task from queue
  2. Fetches notification from database
  3. Updates status to sending
  4. Sends to provider (AWS SES / Firetext / MMG)
  5. Updates status to pending or failed
  6. Logs metrics and delivery attempts
7

Provider Callback

Provider sends delivery receipt to callback endpoint:
  • SMS: POST /notifications/sms/{provider}/delivery-receipt
  • Email: SES sends via SNS webhook
Status updated to delivered, permanent-failure, or temporary-failure.

Letter Notification Lifecycle

Letters follow a different flow due to physical delivery:
  1. API Request - Template-based or precompiled PDF
  2. PDF Generation - Celery task renders letter as PDF
  3. Virus Scanning - AntiVirus check (if enabled)
  4. Address Extraction - OCR for precompiled letters
  5. Upload to S3 - PDF stored for printing
  6. DVLA Submission - Letter sent to printing provider
  7. Status Updates - Callbacks from DVLA as letter progresses
See app/v2/notifications/post_notifications.py:291-393 for letter processing logic.

Multi-Provider Strategy

SMS Providers

GOV.UK Notify supports multiple SMS providers with automatic failover:
  • Firetext - Primary UK SMS provider
  • MMG - Secondary provider for redundancy
Provider Selection:
  • Configured per-service in database
  • Can specify primary and fallback providers
  • Failed deliveries automatically retry with alternate provider
Implementation: See app/__init__.py:75-137 for provider client initialization.

Email Providers

  • AWS SES - Primary email provider
  • AWS SES Stub - Development/testing (local mock server)
Configuration:
# Production: Use real AWS SES
aws_ses_client = AwsSesClient(config['AWS_REGION'])

# Development: Use stub if SES_STUB_URL is set
if config['SES_STUB_URL']:
    aws_ses_stub_client = AwsSesStubClient(
        config['AWS_REGION'],
        stub_url=config['SES_STUB_URL']
    )

Security Architecture

Authentication Flow

  1. API Key Generation - Created in admin interface, stored as hash in database
  2. JWT Token Creation - Client signs JWT with API key (HS256 algorithm)
  3. Token Validation - API verifies signature, expiry, and service status
  4. Request Authorization - Service permissions checked for operation
JWT Token Structure:
{
  "iss": "service-id-uuid",
  "iat": 1234567890,
  "jti": "unique-token-id"
}
Security Features:
  • Tokens expire after 30 seconds (clock skew tolerance)
  • Service must be active (not archived)
  • API key must not be revoked
  • Rate limiting per service
  • HTTPS required for all API calls
See app/authentication/auth.py:84-123 for authentication implementation.

Data Protection

  • At Rest: PostgreSQL encryption, S3 server-side encryption
  • In Transit: TLS 1.2+ required for all connections
  • Secrets Management: Environment variables, AWS Secrets Manager
  • PII Handling: Notification content not logged, redacted in metrics

Monitoring & Observability

Metrics Collection

Prometheus metrics exposed at /metrics endpoint:
  • Request metrics: HTTP status codes, response times, concurrent requests
  • Database metrics: Connection pool usage, query duration, transaction counts
  • Queue metrics: Task queue lengths, processing times, retry counts
  • Provider metrics: Delivery success rates, provider response times
Key Metrics:
  • concurrent_web_request_count - Active request count
  • db_connection_total_connected - Open database connections
  • db_connection_total_checked_out - Connections in use
  • post_notification_json_parse_duration_seconds - JSON parsing time
See app/__init__.py:64-67 and app/__init__.py:469-485 for metric definitions.

Logging

Structured logging with:
  • Request IDs - Trace requests across components
  • Service IDs - Track per-service usage
  • Error details - Stack traces for failures
  • Performance data - Slow query logs, timeout warnings
Log Levels:
  • INFO - Normal operations, authentication events
  • WARNING - Recoverable errors, rate limit warnings
  • ERROR - Failed requests, provider errors
  • CRITICAL - System failures, database unavailability

Scalability Considerations

Horizontal Scaling

  • API servers: Stateless, can scale to many instances behind load balancer
  • Celery workers: Scale independently based on queue depth
  • Database: Read replicas for query load distribution
  • Redis: Can use Redis Cluster for cache sharding

Performance Tuning

  • Connection pooling (default: 5 connections per worker)
  • Read replica routing for GET endpoints
  • Query timeout enforcement (configurable)
  • Parallel workers configuration
  • Statement timeout per environment
Configure in app/config.py:
DATABASE_STATEMENT_TIMEOUT_MS = 1200
DATABASE_STATEMENT_TIMEOUT_REPLICA_MS = 3000
DATABASE_MAX_PARALLEL_WORKERS = 4
DATABASE_MAX_PARALLEL_WORKERS_REPLICA = 8
  • Separate queues for different notification types
  • Priority queues for urgent notifications
  • Prefetch limits to prevent worker starvation
  • Task time limits and soft/hard timeouts
  • Retry policies with exponential backoff
Enable Redis caching to reduce database load:
export REDIS_ENABLED=1
Cached objects:
  • Service configurations (TTL: 5 minutes)
  • Template definitions (TTL: 10 minutes)
  • API key validation results (TTL: 1 minute)

Deployment Architecture

AWS Infrastructure

GOV.UK Notify runs on AWS with:
  • ECS (Elastic Container Service) - Docker containers for API and workers
  • RDS (PostgreSQL) - Managed database with Multi-AZ deployment
  • ElastiCache (Redis) - Managed Redis for caching and queues
  • S3 - Document storage (file uploads, letter PDFs)
  • CloudFront - CDN for document downloads
  • SES - Email sending service
  • CloudWatch - Logs and metrics aggregation

Environment Configuration

Supported environments defined in app/config.py:
  • Development - Local development with stub providers
  • Preview - Integration testing environment
  • Staging - Pre-production testing
  • Production - Live service
Key Environment Variables:
NOTIFY_ENVIRONMENT=production
FLASK_APP=application.py
DATABASE_URL=postgresql://...
REDIS_URL=redis://...
AWS_REGION=eu-west-1
MMG_API_KEY=xxx
FIRETEXT_API_KEY=xxx
ANTIVIRUS_ENABLED=1

Further Reading

API Development

Guidelines for developing new public API endpoints

Database Schema

Detailed database schema documentation

Celery Tasks

Worker task definitions and queue management

Provider Integration

How to integrate new SMS/email providers

Build docs developers (and LLMs) love