Overview
GOV.UK Notify API provides comprehensive monitoring through:- Prometheus metrics for time-series data
- StatsD for real-time metric collection
- Structured logging with JSON output
- Health check endpoints for service health
- Cronitor for scheduled task monitoring
Metrics Systems
Prometheus
The application exports Prometheus metrics via the GDS Metrics library. Reference: app/init.py:22, 59Configuration
Metrics Endpoint
Prometheus metrics are exposed at:Key Prometheus Metrics
Web Request Metrics:Number of concurrent requests currently being served.Incremented on request start, decremented on completion.
Total database connections held by the server (including idle).Labels:
bind- “default” or “bulk”inet_server_addr- Database server IP address
Database connections currently checked out by requests.Labels:
bind- “default” or “bulk”inet_server_addr- Database server IP address
Duration connections are held open.Labels:
method- HTTP method or “celery”host- Request host or worker namepath- URL path or task namebind- “default” or “bulk”inet_server_addr- Database server IP
StatsD
StatsD metrics provide real-time operational insights.Configuration
StatsD server hostname.Example:
statsd.internal.example.comStatsD server port.Default:
8125Automatically enabled when
STATSD_HOST is set.SMS Delivery Metrics
Provider Success/Failure:clients.mmg.success- MMG successful sendsclients.mmg.error- MMG send errorsclients.firetext.success- Firetext successful sendsclients.firetext.error- Firetext send errorsclients.{provider}.request-time- Provider API request duration
sms.total-time- Total time to send SMSsms.test-key.total-time- SMS send time for test keyssms.live-key.total-time- SMS send time for live keys
international-sms.{status}.{country_prefix}- International SMS by country and status
Email Delivery Metrics
Provider Success/Failure:clients.ses.success- AWS SES successful sendsclients.ses.error- AWS SES send errorsclients.ses.request-time- SES API request durationclients.ses_stub.success- SES stub successful sends (testing)clients.ses_stub.error- SES stub errorsclients.ses_stub.request-time- SES stub request duration
email.test-key.total-time- Email send time for test keysemail.live-key.total-time- Email send time for live keys
Callback Metrics
SMS Callbacks:callback.mmg.{status}- MMG delivery receipts by statuscallback.firetext.{status}- Firetext delivery receipts by statuscallback-to-notification-created- Time from creation to callback
callback.ses.{status}- SES delivery receipts by statuscallback-to-notification-created- Time from creation to callback
Task Metrics
Tasks decorated with@statsd decorator report:
tasks.{task_name}.{status}- Task execution statustasks.{task_name}.time- Task execution time
Logging
Log Configuration
Application log level.Options:
DEBUG, INFO, WARNING, ERROR, CRITICALDefault: INFOHandler log level. Defaults to
NOTIFY_LOG_LEVEL.HTTP request logging level.Default:
INFOCelery Logging
Celery worker log level.Default:
CRITICAL (production), INFO (development)Celery Beat scheduler log level.Default:
INFOStructured Logging
The application uses structured logging with JSON output:Log Contexts
Logs include contextual information:- Request ID - Unique identifier for each request
- User ID - Authenticated user (if applicable)
- Service ID - Service performing action
- Job ID - Batch job identifier
- Notification ID - Individual notification
Health Checks
Status Endpoint
The API provides a health check endpoint:- Application is running
- Database connectivity
- Returns build information
200 OK- All checks passed500 Internal Server Error- One or more checks failed
Container Health Checks
For container orchestration:Cronitor Monitoring
Cronitor monitors scheduled task execution.Configuration
Enable Cronitor monitoring.Set to
1 to enable.Default: 0JSON mapping of task names to Cronitor monitor keys.Format:
{"task-name": "cronitor-key"}Monitored Tasks
Scheduled tasks ping Cronitor on:- Task start
- Task completion
- Task failure
- Task doesn’t run on schedule
- Task runs longer than expected
- Task fails repeatedly
Performance Monitoring
Request Profiling
For slow request diagnostics:Log detailed diagnostics for requests exceeding this duration.Includes:
- CPU profiling
- Process CPU usage
- Memory usage
- Request timing
Eventlet Statistics
Enable eventlet greenthread statistics.Set to
1 to enable.Default: 0Alerting
SMS Provider Monitoring
Automatic SMS provider switching:Monitor SMS delivery speed and switch providers if slow.Set to
1 to enable.Default: 0switch-current-sms-provider-on-slow-delivery runs every minute to:
- Check delivery statistics
- Detect slow delivery
- Switch to alternate provider
- Send alerts
Zendesk Alerts
Send operational alerts to Zendesk.Set to
1 to enable.Default: 0- High failure rates
- Services sending to TV numbers
- Low inbound SMS number availability
- Letters stuck in processing
- Other operational issues
Key Performance Indicators
API Performance
Monitor:- Request latency (p50, p95, p99)
- Error rate (4xx, 5xx)
- Requests per second
- Concurrent requests (
concurrent_web_request_count)
- p95 latency < 500ms for notification creation
- Error rate < 1%
- No sustained connection pool exhaustion
Database Performance
Monitor:- Connection pool utilization
- Statement timeout occurrences
- Query execution time
- Replication lag (for replicas)
db_connection_total_checked_out/db_connection_total_connectedratio
- Connection pool usage < 80%
- Replication lag < 10 seconds
- No frequent statement timeouts
Notification Delivery
Monitor:- SMS delivery rate
- Email delivery rate
- Letter processing rate
- Time to delivery (created → sent)
- Provider failure rates
- SMS sent within 30 seconds
- Email sent within 60 seconds
- Provider error rate < 5%
Queue Depth
Monitor:- SQS
ApproximateNumberOfMessagesfor all queues - Message age (
ApproximateAgeOfOldestMessage) - Messages in flight
- Queue depth < 1000 messages
- Message age < 5 minutes for critical queues
- No growing queues over time
Worker Health
Monitor:- Worker process count
- Task processing rate
- Task failure rate
- Worker memory usage
- Worker restarts
- Expected number of workers running
- Task failure rate < 1%
- Worker memory < 1GB per process
Dashboards
Recommended Grafana Dashboards
API Overview Dashboard
- Request rate by endpoint
- Response time percentiles
- Error rate by status code
- Concurrent requests
- Database connection usage
Notification Delivery Dashboard
- Notifications created per minute
- Notifications sent per minute by channel (SMS/Email/Letter)
- Delivery success rate
- Time to delivery
- Provider distribution
Queue Dashboard
- Queue depth by queue name
- Message age
- Messages processed per minute
- Worker count by queue
Database Dashboard
- Connection pool usage
- Query execution time
- Slow queries
- Replication lag
- Transaction rate
Troubleshooting with Metrics
High Error Rate
- Check
callback.{provider}.{status}metrics for provider failures - Review application logs for error patterns
- Check database connection metrics for pool exhaustion
- Verify provider status pages
Slow Response Times
- Check
db_connection_open_duration_secondsfor slow queries - Review concurrent request count
- Check for database lock contention
- Enable request profiling with
NOTIFY_GUNICORN_DEBUG_POST_REQUEST_LOG_THRESHOLD_SECONDS
Growing Queues
- Check worker count and concurrency
- Review task processing rate
- Check for failing tasks causing retries
- Scale up workers if needed
- Review
ApproximateAgeOfOldestMessagemetric
Database Connection Exhaustion
- Review
db_connection_total_checked_outmetric - Check for connection leaks (connections not returned)
- Review slow queries holding connections
- Increase
SQLALCHEMY_POOL_SIZEif appropriate - Reduce worker concurrency
Best Practices
- Set up alerting - Don’t just collect metrics, alert on anomalies
- Monitor queue depths - Growing queues indicate capacity issues
- Track provider health - Provider failures impact delivery
- Use structured logging - Makes log analysis easier
- Correlate metrics with logs - Use request IDs to trace issues
- Monitor database separately - Database metrics from PostgreSQL itself
- Set retention policies - Balance metric storage with cost
- Regular dashboard reviews - Identify trends before they become issues
- Document runbooks - Link alerts to remediation steps
- Test alerting - Verify alerts fire as expected