Skip to main content

Overview

GOV.UK Notify API provides comprehensive monitoring through:
  • Prometheus metrics for time-series data
  • StatsD for real-time metric collection
  • Structured logging with JSON output
  • Health check endpoints for service health
  • Cronitor for scheduled task monitoring

Metrics Systems

Prometheus

The application exports Prometheus metrics via the GDS Metrics library. Reference: app/init.py:22, 59

Configuration

# For multi-process Gunicorn workers
export PROMETHEUS_MULTIPROC_DIR="/tmp"
Reference: entrypoint.sh:3

Metrics Endpoint

Prometheus metrics are exposed at:
GET /metrics
This endpoint is not protected by authentication to allow Prometheus scraping. Reference: Test file reference to metrics route

Key Prometheus Metrics

Web Request Metrics:
concurrent_web_request_count
Gauge
Number of concurrent requests currently being served.Incremented on request start, decremented on completion.
Reference: app/init.py:64-67 Database Connection Metrics:
db_connection_total_connected
Gauge
Total database connections held by the server (including idle).Labels:
  • bind - “default” or “bulk”
  • inet_server_addr - Database server IP address
db_connection_total_checked_out
Gauge
Database connections currently checked out by requests.Labels:
  • bind - “default” or “bulk”
  • inet_server_addr - Database server IP address
db_connection_open_duration_seconds
Histogram
Duration connections are held open.Labels:
  • method - HTTP method or “celery”
  • host - Request host or worker name
  • path - URL path or task name
  • bind - “default” or “bulk”
  • inet_server_addr - Database server IP
Reference: app/init.py:469-485

StatsD

StatsD metrics provide real-time operational insights.

Configuration

STATSD_HOST
string
StatsD server hostname.Example: statsd.internal.example.com
STATSD_PORT
integer
StatsD server port.Default: 8125
STATSD_ENABLED
boolean
Automatically enabled when STATSD_HOST is set.
Reference: config.py:484-486, gunicorn_config.py:21

SMS Delivery Metrics

Provider Success/Failure:
  • clients.mmg.success - MMG successful sends
  • clients.mmg.error - MMG send errors
  • clients.firetext.success - Firetext successful sends
  • clients.firetext.error - Firetext send errors
  • clients.{provider}.request-time - Provider API request duration
Reference: app/clients/sms/init.py:57-79 Delivery Timing:
  • sms.total-time - Total time to send SMS
  • sms.test-key.total-time - SMS send time for test keys
  • sms.live-key.total-time - SMS send time for live keys
Reference: app/delivery/send_to_providers.py:104-109 International SMS:
  • international-sms.{status}.{country_prefix} - International SMS by country and status
Reference: app/delivery/send_to_providers.py:101

Email Delivery Metrics

Provider Success/Failure:
  • clients.ses.success - AWS SES successful sends
  • clients.ses.error - AWS SES send errors
  • clients.ses.request-time - SES API request duration
  • clients.ses_stub.success - SES stub successful sends (testing)
  • clients.ses_stub.error - SES stub errors
  • clients.ses_stub.request-time - SES stub request duration
Reference: app/clients/email/aws_ses.py:104-127, app/clients/email/aws_ses_stub.py:47-55 Delivery Timing:
  • email.test-key.total-time - Email send time for test keys
  • email.live-key.total-time - Email send time for live keys
Reference: app/delivery/send_to_providers.py:187-189

Callback Metrics

SMS Callbacks:
  • callback.mmg.{status} - MMG delivery receipts by status
  • callback.firetext.{status} - Firetext delivery receipts by status
  • callback-to-notification-created - Time from creation to callback
Reference: app/celery/process_sms_client_response_tasks.py:95-102 Email Callbacks:
  • callback.ses.{status} - SES delivery receipts by status
  • callback-to-notification-created - Time from creation to callback
Reference: app/celery/process_ses_receipts_tasks.py:87-91

Task Metrics

Tasks decorated with @statsd decorator report:
  • tasks.{task_name}.{status} - Task execution status
  • tasks.{task_name}.time - Task execution time
Reference: app/commands.py:20, 409, 464, 509

Logging

Log Configuration

NOTIFY_LOG_LEVEL
string
Application log level.Options: DEBUG, INFO, WARNING, ERROR, CRITICALDefault: INFO
NOTIFY_LOG_LEVEL_HANDLERS
string
Handler log level. Defaults to NOTIFY_LOG_LEVEL.
NOTIFY_REQUEST_LOG_LEVEL
string
HTTP request logging level.Default: INFO
Reference: config.py:151-154

Celery Logging

CELERY_WORKER_LOG_LEVEL
string
Celery worker log level.Default: CRITICAL (production), INFO (development)
CELERY_BEAT_LOG_LEVEL
string
Celery Beat scheduler log level.Default: INFO
Reference: config.py:91-92

Structured Logging

The application uses structured logging with JSON output:
current_app.logger.info(
    "Processing job",
    extra={
        "job_id": job_id,
        "service_id": service.id,
        "notification_count": count
    }
)

Log Contexts

Logs include contextual information:
  • Request ID - Unique identifier for each request
  • User ID - Authenticated user (if applicable)
  • Service ID - Service performing action
  • Job ID - Batch job identifier
  • Notification ID - Individual notification

Health Checks

Status Endpoint

The API provides a health check endpoint:
GET /status
Response:
{
  "status": "ok",
  "db": "ok",
  "git_commit": "abc123...",
  "build_time": "2024-03-03T14:30:00"
}
The endpoint checks:
  • Application is running
  • Database connectivity
  • Returns build information
Status Codes:
  • 200 OK - All checks passed
  • 500 Internal Server Error - One or more checks failed

Container Health Checks

For container orchestration:
# Example Docker health check
healthcheck:
  test: ["CMD", "curl", "-f", "http://localhost:6011/status"]
  interval: 30s
  timeout: 5s
  retries: 3
  start_period: 40s

Cronitor Monitoring

Cronitor monitors scheduled task execution.

Configuration

CRONITOR_ENABLED
string
Enable Cronitor monitoring.Set to 1 to enable.Default: 0
CRONITOR_KEYS
json
JSON mapping of task names to Cronitor monitor keys.Format: {"task-name": "cronitor-key"}
Reference: config.py:157-158

Monitored Tasks

Scheduled tasks ping Cronitor on:
  • Task start
  • Task completion
  • Task failure
Cronitor alerts if:
  • Task doesn’t run on schedule
  • Task runs longer than expected
  • Task fails repeatedly

Performance Monitoring

Request Profiling

For slow request diagnostics:
NOTIFY_GUNICORN_DEBUG_POST_REQUEST_LOG_THRESHOLD_SECONDS
float
Log detailed diagnostics for requests exceeding this duration.Includes:
  • CPU profiling
  • Process CPU usage
  • Memory usage
  • Request timing
Reference: gunicorn_config.py:25-114

Eventlet Statistics

NOTIFY_EVENTLET_STATS
string
Enable eventlet greenthread statistics.Set to 1 to enable.Default: 0
Reference: config.py:163

Alerting

SMS Provider Monitoring

Automatic SMS provider switching:
CHECK_SLOW_TEXT_MESSAGE_DELIVERY
string
Monitor SMS delivery speed and switch providers if slow.Set to 1 to enable.Default: 0
The scheduled task switch-current-sms-provider-on-slow-delivery runs every minute to:
  1. Check delivery statistics
  2. Detect slow delivery
  3. Switch to alternate provider
  4. Send alerts
Reference: config.py:553, config.py:304-308

Zendesk Alerts

SEND_ZENDESK_ALERTS_ENABLED
string
Send operational alerts to Zendesk.Set to 1 to enable.Default: 0
Reference: config.py:552 Scheduled tasks can create Zendesk tickets for:
  • High failure rates
  • Services sending to TV numbers
  • Low inbound SMS number availability
  • Letters stuck in processing
  • Other operational issues

Key Performance Indicators

API Performance

Monitor:
  • Request latency (p50, p95, p99)
  • Error rate (4xx, 5xx)
  • Requests per second
  • Concurrent requests (concurrent_web_request_count)
Thresholds:
  • p95 latency < 500ms for notification creation
  • Error rate < 1%
  • No sustained connection pool exhaustion

Database Performance

Monitor:
  • Connection pool utilization
  • Statement timeout occurrences
  • Query execution time
  • Replication lag (for replicas)
  • db_connection_total_checked_out / db_connection_total_connected ratio
Thresholds:
  • Connection pool usage < 80%
  • Replication lag < 10 seconds
  • No frequent statement timeouts

Notification Delivery

Monitor:
  • SMS delivery rate
  • Email delivery rate
  • Letter processing rate
  • Time to delivery (created → sent)
  • Provider failure rates
Thresholds:
  • SMS sent within 30 seconds
  • Email sent within 60 seconds
  • Provider error rate < 5%

Queue Depth

Monitor:
  • SQS ApproximateNumberOfMessages for all queues
  • Message age (ApproximateAgeOfOldestMessage)
  • Messages in flight
Thresholds:
  • Queue depth < 1000 messages
  • Message age < 5 minutes for critical queues
  • No growing queues over time

Worker Health

Monitor:
  • Worker process count
  • Task processing rate
  • Task failure rate
  • Worker memory usage
  • Worker restarts
Thresholds:
  • Expected number of workers running
  • Task failure rate < 1%
  • Worker memory < 1GB per process

Dashboards

API Overview Dashboard

  • Request rate by endpoint
  • Response time percentiles
  • Error rate by status code
  • Concurrent requests
  • Database connection usage

Notification Delivery Dashboard

  • Notifications created per minute
  • Notifications sent per minute by channel (SMS/Email/Letter)
  • Delivery success rate
  • Time to delivery
  • Provider distribution

Queue Dashboard

  • Queue depth by queue name
  • Message age
  • Messages processed per minute
  • Worker count by queue

Database Dashboard

  • Connection pool usage
  • Query execution time
  • Slow queries
  • Replication lag
  • Transaction rate

Troubleshooting with Metrics

High Error Rate

  1. Check callback.{provider}.{status} metrics for provider failures
  2. Review application logs for error patterns
  3. Check database connection metrics for pool exhaustion
  4. Verify provider status pages

Slow Response Times

  1. Check db_connection_open_duration_seconds for slow queries
  2. Review concurrent request count
  3. Check for database lock contention
  4. Enable request profiling with NOTIFY_GUNICORN_DEBUG_POST_REQUEST_LOG_THRESHOLD_SECONDS

Growing Queues

  1. Check worker count and concurrency
  2. Review task processing rate
  3. Check for failing tasks causing retries
  4. Scale up workers if needed
  5. Review ApproximateAgeOfOldestMessage metric

Database Connection Exhaustion

  1. Review db_connection_total_checked_out metric
  2. Check for connection leaks (connections not returned)
  3. Review slow queries holding connections
  4. Increase SQLALCHEMY_POOL_SIZE if appropriate
  5. Reduce worker concurrency

Best Practices

  1. Set up alerting - Don’t just collect metrics, alert on anomalies
  2. Monitor queue depths - Growing queues indicate capacity issues
  3. Track provider health - Provider failures impact delivery
  4. Use structured logging - Makes log analysis easier
  5. Correlate metrics with logs - Use request IDs to trace issues
  6. Monitor database separately - Database metrics from PostgreSQL itself
  7. Set retention policies - Balance metric storage with cost
  8. Regular dashboard reviews - Identify trends before they become issues
  9. Document runbooks - Link alerts to remediation steps
  10. Test alerting - Verify alerts fire as expected

Build docs developers (and LLMs) love