Architecture Overview
GOV.UK Notify is a distributed system built for high availability and scalability. It consists of three main components:- Public REST API - Accepts notification requests from services
- Admin Web Interface - Service management and template creation (separate repository)
- Asynchronous Workers - Process and deliver notifications to providers
This documentation focuses on the API service component (notifications-api). The admin interface is maintained in a separate repository.
High-Level Architecture
Core Components
1. Public REST API
The public API is a Flask application that provides RESTful endpoints for:- Sending SMS, email, and letter notifications
- Querying notification status
- Managing templates
- Retrieving inbound SMS messages
- Framework: Flask (Python 3.13)
- Web Server: Gunicorn with multiple worker processes
- Authentication: JWT tokens signed with service API keys
- Validation: JSON Schema validation for all requests
- Metrics: Prometheus metrics via
gds-metrics
app/v2/notifications/post_notifications.py- Notification creation endpointsapp/v2/notifications/get_notifications.py- Notification query endpointsapp/authentication/auth.py- JWT authentication and authorization
The API uses separate database connections for read and write operations, with read replicas used for query endpoints to reduce load on the primary database.
2. Celery Task Queue
Asynchronous notification processing is handled by Celery workers that:- Accept tasks from the API via Redis queues
- Process notifications (render templates, validate, format)
- Send to providers (AWS SES, Firetext, MMG, DVLA)
- Update status based on provider callbacks
- Retry failures according to configured retry policies
| Queue Name | Purpose |
|---|---|
send-sms-tasks | SMS delivery tasks |
send-email-tasks | Email delivery tasks |
create-letters-pdf-tasks | Letter PDF generation |
priority-tasks | High-priority notifications |
database-tasks | DB maintenance operations |
periodic-tasks | Scheduled jobs (cleanup, reporting) |
research-mode-tasks | Test notifications |
- Daily statistics aggregation
- Old notification cleanup
- Provider status checks
- Usage report generation
3. Database Layer
PostgreSQL is the primary data store with the following structure: Connection Strategy:- Primary database: All writes, critical reads
- Read replicas: Query endpoints, reporting, analytics
- Connection pooling: SQLAlchemy with configurable pool sizes
- Read-only enforcement: Bulk queries use dedicated read-only sessions
services- Service configurations and settingsapi_keys- Service API keys with type and expirytemplates- Notification templates with versioningnotifications- All sent notifications with statususers- Admin users and permissionsjobs- Bulk sending jobs (CSV uploads)inbound_sms- Received SMS messages
- Indexed on
service_id,created_at,status - Partitioning for large tables (notifications)
- Query timeout protection (configurable per environment)
- Parallel query execution controls
app/__init__.py:466-619 for database connection event handling and metrics.
4. Caching Layer (Redis)
Redis serves dual purposes:- Celery Message Broker - Queue management for asynchronous tasks
- Application Cache - Service data, templates, rate limit counters
- Service configurations (reduces DB load)
- Template definitions (faster rendering)
- Rate limiting counters (per-service, per-hour)
- API key validation results
Notification Flow
SMS/Email Notification Lifecycle
API Request
Client sends POST request to
/v2/notifications/sms or /v2/notifications/email with:- JWT token in
Authorizationheader - Template ID and personalisation data
- Recipient (phone number or email)
- Optional reference and reply-to settings
Authentication & Validation
- JWT token decoded and validated against service API keys
- Service active status checked
- Service permissions verified (SMS/email/letter)
- Rate limits checked (per-service limits)
- Request JSON validated against schema
Template Processing
- Template fetched from database or cache
- Personalisation placeholders replaced with data
- Content length validated (SMS: 612 chars, Email: no limit)
- Template rendering errors caught and returned
Notification Persistence
Notification record created in PostgreSQL with:
- Unique notification ID (UUID)
- Initial status:
created - Service ID, template ID, API key ID
- Recipient and personalisation data
- Client reference (if provided)
Queue Dispatch
Task enqueued to appropriate Celery queue:
- Simulated recipients (test): No queue, marked delivered immediately
- Real recipients: Task sent to
send-sms-tasksorsend-email-tasks - Priority notifications: Use
priority-tasksqueue
Worker Processing
Celery worker:
- Picks up task from queue
- Fetches notification from database
- Updates status to
sending - Sends to provider (AWS SES / Firetext / MMG)
- Updates status to
pendingorfailed - Logs metrics and delivery attempts
Letter Notification Lifecycle
Letters follow a different flow due to physical delivery:- API Request - Template-based or precompiled PDF
- PDF Generation - Celery task renders letter as PDF
- Virus Scanning - AntiVirus check (if enabled)
- Address Extraction - OCR for precompiled letters
- Upload to S3 - PDF stored for printing
- DVLA Submission - Letter sent to printing provider
- Status Updates - Callbacks from DVLA as letter progresses
app/v2/notifications/post_notifications.py:291-393 for letter processing logic.
Multi-Provider Strategy
SMS Providers
GOV.UK Notify supports multiple SMS providers with automatic failover:- Firetext - Primary UK SMS provider
- MMG - Secondary provider for redundancy
- Configured per-service in database
- Can specify primary and fallback providers
- Failed deliveries automatically retry with alternate provider
app/__init__.py:75-137 for provider client initialization.
Email Providers
- AWS SES - Primary email provider
- AWS SES Stub - Development/testing (local mock server)
Security Architecture
Authentication Flow
- API Key Generation - Created in admin interface, stored as hash in database
- JWT Token Creation - Client signs JWT with API key (HS256 algorithm)
- Token Validation - API verifies signature, expiry, and service status
- Request Authorization - Service permissions checked for operation
- Tokens expire after 30 seconds (clock skew tolerance)
- Service must be active (not archived)
- API key must not be revoked
- Rate limiting per service
- HTTPS required for all API calls
app/authentication/auth.py:84-123 for authentication implementation.
Data Protection
- At Rest: PostgreSQL encryption, S3 server-side encryption
- In Transit: TLS 1.2+ required for all connections
- Secrets Management: Environment variables, AWS Secrets Manager
- PII Handling: Notification content not logged, redacted in metrics
Monitoring & Observability
Metrics Collection
Prometheus metrics exposed at/metrics endpoint:
- Request metrics: HTTP status codes, response times, concurrent requests
- Database metrics: Connection pool usage, query duration, transaction counts
- Queue metrics: Task queue lengths, processing times, retry counts
- Provider metrics: Delivery success rates, provider response times
concurrent_web_request_count- Active request countdb_connection_total_connected- Open database connectionsdb_connection_total_checked_out- Connections in usepost_notification_json_parse_duration_seconds- JSON parsing time
app/__init__.py:64-67 and app/__init__.py:469-485 for metric definitions.
Logging
Structured logging with:- Request IDs - Trace requests across components
- Service IDs - Track per-service usage
- Error details - Stack traces for failures
- Performance data - Slow query logs, timeout warnings
INFO- Normal operations, authentication eventsWARNING- Recoverable errors, rate limit warningsERROR- Failed requests, provider errorsCRITICAL- System failures, database unavailability
Scalability Considerations
Horizontal Scaling
- API servers: Stateless, can scale to many instances behind load balancer
- Celery workers: Scale independently based on queue depth
- Database: Read replicas for query load distribution
- Redis: Can use Redis Cluster for cache sharding
Performance Tuning
Database Optimization
Database Optimization
- Connection pooling (default: 5 connections per worker)
- Read replica routing for GET endpoints
- Query timeout enforcement (configurable)
- Parallel workers configuration
- Statement timeout per environment
app/config.py:Celery Queue Management
Celery Queue Management
- Separate queues for different notification types
- Priority queues for urgent notifications
- Prefetch limits to prevent worker starvation
- Task time limits and soft/hard timeouts
- Retry policies with exponential backoff
Caching Strategy
Caching Strategy
Enable Redis caching to reduce database load:Cached objects:
- Service configurations (TTL: 5 minutes)
- Template definitions (TTL: 10 minutes)
- API key validation results (TTL: 1 minute)
Deployment Architecture
AWS Infrastructure
GOV.UK Notify runs on AWS with:- ECS (Elastic Container Service) - Docker containers for API and workers
- RDS (PostgreSQL) - Managed database with Multi-AZ deployment
- ElastiCache (Redis) - Managed Redis for caching and queues
- S3 - Document storage (file uploads, letter PDFs)
- CloudFront - CDN for document downloads
- SES - Email sending service
- CloudWatch - Logs and metrics aggregation
Environment Configuration
Supported environments defined inapp/config.py:
- Development - Local development with stub providers
- Preview - Integration testing environment
- Staging - Pre-production testing
- Production - Live service
Further Reading
API Development
Guidelines for developing new public API endpoints
Database Schema
Detailed database schema documentation
Celery Tasks
Worker task definitions and queue management
Provider Integration
How to integrate new SMS/email providers