Overview
GOV.UK Notify uses Celery for asynchronous task processing with:- Amazon SQS as the message broker
- Multiple specialized worker pools for different task types
- Celery Beat for scheduled tasks
- Integration with StatsD and Prometheus for monitoring
Celery Configuration
Broker Configuration
Celery uses Amazon SQS as the message broker:Required Environment Variables
Prefix for all SQS queue names. Used to separate environments.Example:
production-, staging-, local_dev_johnAWS region for SQS queues.Default:
eu-west-1AWS account ID for constructing SQS queue URLs.Default:
123456789012Queue Structure
Task Queues
The application defines specialized queues for different task types:External Queues
External services write to these queues:Running Celery Workers
Local Development
Docker Development
For development environments with pycurl issues:Production Workers
In production, run specialized workers for each queue type:API Worker (Generic)
Sender Worker (SMS/Email)
Letter Sender Worker
Jobs Worker
Letters Worker
Receipts Worker
Periodic Worker
Reporting Worker
Internal Worker
Service Callbacks Worker
Callbacks Retry Worker
Retry Worker
Research Mode Worker
Report Requests Worker
Worker Configuration
Concurrency
Control the number of concurrent tasks per worker:Prefetch Multiplier
For long-running tasks, set prefetch multiplier to prevent task hoarding:Log Levels
Celery Beat (Scheduled Tasks)
Running Celery Beat
Scheduled Tasks
Celery Beat runs tasks on predefined schedules:High-Frequency Tasks (Every Minute)
generate-sms-delivery-stats- SMS delivery statisticsswitch-current-sms-provider-on-slow-delivery- Provider failovercheck-job-status- Monitor job completionupdate-status-of-fully-processed-jobs- Update job statuses
Periodic Tasks (Every 5-15 Minutes)
run-scheduled-jobs- Every 15 minutes (0, 15, 30, 45)tend-providers-back-to-middle- Every 5 minutescheck-for-missing-rows-in-completed-jobs- Every 10 minutesreplay-created-notifications- Every 15 minutes
Nightly Tasks
timeout-sending-notifications- 00:05 UTCarchive-unsubscribe-requests- 00:05 UTCcreate-nightly-billing- 00:15 UTCcreate-nightly-notification-status- 00:30 UTCdelete-notifications-older-than-retention- 03:00 UTCdelete-inbound-sms- 01:40 UTCsave-daily-notification-processing-time- 02:00 UTCremove_sms_email_jobs- 04:00 UTCremove_letter_jobs- 04:20 UTC
Letter Processing Tasks
check-if-letters-still-in-created- Weekdays at 07:00 UTCcheck-if-letters-still-pending-virus-check- Every 10 minutes + nightlycheck-time-to-collate-letters- 16:50 and 17:50 UTCraise-alert-if-letter-notifications-still-sending- 19:00 UTC
Weekly Tasks
check-for-low-available-inbound-sms-numbers- Monday 09:00 UTCweekly-dwp-report- Monday 09:00 UTCweekly-user-research-email- Wednesday 10:00 UTC
Monthly Tasks
change-dvla-api-key- First Tuesday, 09:00 UTCchange-dvla-password- Wednesday after first Tuesday, 09:00 UTCrun-populate-annual-billing- April 1st, 02:01 UTC
Task Modules
Tasks are organized in separate modules:app.celery.letters_pdf_tasks- Letter PDF generationapp.celery.process_letter_client_response_tasks- Letter delivery updatesapp.celery.process_ses_receipts_tasks- Email delivery receiptsapp.celery.process_sms_client_response_tasks- SMS delivery receiptsapp.celery.provider_tasks- Provider communicationapp.celery.research_mode_tasks- Research/test modeapp.celery.service_callback_tasks- Service webhooks
Monitoring and Metrics
Prometheus Metrics
Celery workers export Prometheus metrics:StatsD Metrics
Celery tasks emit StatsD metrics: Callback Metrics:callback.{provider}.{status}- Delivery receipt callbackscallback.ses.{status}- Email delivery statusinternational-sms.{status}.{prefix}- International SMS by country
callback-to-notification-created- Time from notification creation to callback
Queue Monitoring
Monitor SQS queue metrics:ApproximateNumberOfMessages- Messages waiting in queueApproximateNumberOfMessagesNotVisible- Messages being processedApproximateAgeOfOldestMessage- Queue backlog age
Worker Scaling
Horizontal Scaling
Scale workers by running multiple instances:- Polls the same SQS queue
- Processes tasks independently
- Uses SQS visibility timeout to prevent duplicate processing
Vertical Scaling
Increase concurrency within workers:- Each concurrent task uses a database connection
- Monitor connection pool usage
- CPU-bound tasks benefit from more workers
- I/O-bound tasks benefit from higher concurrency
Auto-Scaling
Scale workers based on queue depth:Troubleshooting
Worker Not Consuming Tasks
Symptoms: Tasks queued but not processed Solutions:- Check worker is running and connected to SQS
- Verify
NOTIFICATION_QUEUE_PREFIXmatches queue names - Check AWS credentials and permissions
- Verify queue exists in AWS console
- Check worker logs for connection errors
Task Timeouts
Symptoms: Tasks fail with timeout errors Solutions:- Increase task timeout in task definition
- Optimize slow operations
- Break large tasks into smaller chunks
- Check database query performance
Memory Leaks
Symptoms: Worker memory usage grows over time Solutions:- Enable
max-tasks-per-childto restart workers periodically: - Profile tasks for memory leaks
- Check for unclosed database connections
- Monitor eventlet greenthread creation
Duplicate Task Execution
Symptoms: Tasks executed multiple times Solutions:- Verify only one Celery Beat instance is running
- Check SQS visibility timeout is appropriate
- Ensure tasks are idempotent
- Check for worker restarts during task execution
Connection Pool Exhaustion
Symptoms:QueuePool limit exceeded errors
Solutions:
- Reduce worker concurrency
- Increase
SQLALCHEMY_POOL_SIZE - Check for connection leaks in tasks
- Monitor
db_connection_total_checked_outmetric
PyCURL Issues (Mac M1)
Symptoms: Import errors or version mismatches Solutions:- Follow installation instructions: GitHub Issue #1216
- Use Docker for local development:
Best Practices
- Use specialized workers - Separate critical tasks from batch processing
- Monitor queue depths - Alert on growing queues
- Make tasks idempotent - Tasks should be safe to retry
- Set appropriate timeouts - Prevent tasks from running indefinitely
- Log task failures - Include context for debugging
- Test task retry logic - Ensure failures are handled gracefully
- Use database transactions - Ensure consistency in task operations
- Limit task size - Break large operations into smaller tasks
- Monitor worker health - Alert on worker crashes
- Regular worker restarts - Use
max-tasks-per-childto prevent memory leaks