Monitoring and Metrics - GOV.UK Notify API

Overview

GOV.UK Notify API provides comprehensive monitoring through:

Prometheus metrics for time-series data
StatsD for real-time metric collection
Structured logging with JSON output
Health check endpoints for service health
Cronitor for scheduled task monitoring

Metrics Systems

Prometheus

The application exports Prometheus metrics via the GDS Metrics library. Reference: app/init.py:22, 59

Configuration

# For multi-process Gunicorn workers
export PROMETHEUS_MULTIPROC_DIR="/tmp"

Reference: entrypoint.sh:3

Metrics Endpoint

Prometheus metrics are exposed at:

GET /metrics

This endpoint is not protected by authentication to allow Prometheus scraping. Reference: Test file reference to metrics route

Key Prometheus Metrics

Web Request Metrics:

concurrent_web_request_count

Gauge

Number of concurrent requests currently being served.Incremented on request start, decremented on completion.

Reference: app/init.py:64-67 Database Connection Metrics:

db_connection_total_connected

Gauge

Total database connections held by the server (including idle).Labels:

bind - “default” or “bulk”
inet_server_addr - Database server IP address

db_connection_total_checked_out

Gauge

Database connections currently checked out by requests.Labels:

bind - “default” or “bulk”
inet_server_addr - Database server IP address

db_connection_open_duration_seconds

Histogram

Duration connections are held open.Labels:

method - HTTP method or “celery”
host - Request host or worker name
path - URL path or task name
bind - “default” or “bulk”
inet_server_addr - Database server IP

Reference: app/init.py:469-485

StatsD

StatsD metrics provide real-time operational insights.

Configuration

STATSD_HOST

string

StatsD server hostname.Example: statsd.internal.example.com

STATSD_PORT

integer

StatsD server port.Default: 8125

STATSD_ENABLED

boolean

Automatically enabled when STATSD_HOST is set.

Reference: config.py:484-486, gunicorn_config.py:21

SMS Delivery Metrics

Provider Success/Failure:

clients.mmg.success - MMG successful sends
clients.mmg.error - MMG send errors
clients.firetext.success - Firetext successful sends
clients.firetext.error - Firetext send errors
clients.{provider}.request-time - Provider API request duration

Reference: app/clients/sms/init.py:57-79 Delivery Timing:

sms.total-time - Total time to send SMS
sms.test-key.total-time - SMS send time for test keys
sms.live-key.total-time - SMS send time for live keys

Reference: app/delivery/send_to_providers.py:104-109 International SMS:

international-sms.{status}.{country_prefix} - International SMS by country and status

Reference: app/delivery/send_to_providers.py:101

Email Delivery Metrics

Provider Success/Failure:

clients.ses.success - AWS SES successful sends
clients.ses.error - AWS SES send errors
clients.ses.request-time - SES API request duration
clients.ses_stub.success - SES stub successful sends (testing)
clients.ses_stub.error - SES stub errors
clients.ses_stub.request-time - SES stub request duration

Reference: app/clients/email/aws_ses.py:104-127, app/clients/email/aws_ses_stub.py:47-55 Delivery Timing:

email.test-key.total-time - Email send time for test keys
email.live-key.total-time - Email send time for live keys

Reference: app/delivery/send_to_providers.py:187-189

Callback Metrics

SMS Callbacks:

callback.mmg.{status} - MMG delivery receipts by status
callback.firetext.{status} - Firetext delivery receipts by status
callback-to-notification-created - Time from creation to callback

Reference: app/celery/process_sms_client_response_tasks.py:95-102 Email Callbacks:

callback.ses.{status} - SES delivery receipts by status
callback-to-notification-created - Time from creation to callback

Reference: app/celery/process_ses_receipts_tasks.py:87-91

Task Metrics

Tasks decorated with @statsd decorator report:

tasks.{task_name}.{status} - Task execution status
tasks.{task_name}.time - Task execution time

Reference: app/commands.py:20, 409, 464, 509

Logging

Log Configuration

NOTIFY_LOG_LEVEL

string

Application log level.Options: DEBUG, INFO, WARNING, ERROR, CRITICALDefault: INFO

NOTIFY_LOG_LEVEL_HANDLERS

string

Handler log level. Defaults to NOTIFY_LOG_LEVEL.

NOTIFY_REQUEST_LOG_LEVEL

string

HTTP request logging level.Default: INFO

Reference: config.py:151-154

Celery Logging

CELERY_WORKER_LOG_LEVEL

string

Celery worker log level.Default: CRITICAL (production), INFO (development)

CELERY_BEAT_LOG_LEVEL

string

Celery Beat scheduler log level.Default: INFO

Reference: config.py:91-92

Structured Logging

The application uses structured logging with JSON output:

current_app.logger.info(
    "Processing job",
    extra={
        "job_id": job_id,
        "service_id": service.id,
        "notification_count": count
    }
)

Log Contexts

Logs include contextual information:

Request ID - Unique identifier for each request
User ID - Authenticated user (if applicable)
Service ID - Service performing action
Job ID - Batch job identifier
Notification ID - Individual notification

Health Checks

Status Endpoint

The API provides a health check endpoint:

GET /status

Response:

{
  "status": "ok",
  "db": "ok",
  "git_commit": "abc123...",
  "build_time": "2024-03-03T14:30:00"
}

The endpoint checks:

Application is running
Database connectivity
Returns build information

Status Codes:

200 OK - All checks passed
500 Internal Server Error - One or more checks failed

Container Health Checks

For container orchestration:

# Example Docker health check
healthcheck:
  test: ["CMD", "curl", "-f", "http://localhost:6011/status"]
  interval: 30s
  timeout: 5s
  retries: 3
  start_period: 40s

Cronitor Monitoring

Cronitor monitors scheduled task execution.

Configuration

CRONITOR_ENABLED

string

Enable Cronitor monitoring.Set to 1 to enable.Default: 0

CRONITOR_KEYS

json

JSON mapping of task names to Cronitor monitor keys.Format: {"task-name": "cronitor-key"}

Reference: config.py:157-158

Monitored Tasks

Scheduled tasks ping Cronitor on:

Task start
Task completion
Task failure

Cronitor alerts if:

Task doesn’t run on schedule
Task runs longer than expected
Task fails repeatedly

Performance Monitoring

Request Profiling

For slow request diagnostics:

NOTIFY_GUNICORN_DEBUG_POST_REQUEST_LOG_THRESHOLD_SECONDS

float

Log detailed diagnostics for requests exceeding this duration.Includes:

CPU profiling
Process CPU usage
Memory usage
Request timing

Reference: gunicorn_config.py:25-114

Eventlet Statistics

NOTIFY_EVENTLET_STATS

string

Enable eventlet greenthread statistics.Set to 1 to enable.Default: 0

Reference: config.py:163

Alerting

SMS Provider Monitoring

Automatic SMS provider switching:

CHECK_SLOW_TEXT_MESSAGE_DELIVERY

string

Monitor SMS delivery speed and switch providers if slow.Set to 1 to enable.Default: 0

The scheduled task switch-current-sms-provider-on-slow-delivery runs every minute to:

Check delivery statistics
Detect slow delivery
Switch to alternate provider
Send alerts

Reference: config.py:553, config.py:304-308

Zendesk Alerts

SEND_ZENDESK_ALERTS_ENABLED

string

Send operational alerts to Zendesk.Set to 1 to enable.Default: 0

Reference: config.py:552 Scheduled tasks can create Zendesk tickets for:

High failure rates
Services sending to TV numbers
Low inbound SMS number availability
Letters stuck in processing
Other operational issues

Key Performance Indicators

API Performance

Monitor:

Request latency (p50, p95, p99)
Error rate (4xx, 5xx)
Requests per second
Concurrent requests (concurrent_web_request_count)

Thresholds:

p95 latency < 500ms for notification creation
Error rate < 1%
No sustained connection pool exhaustion

Database Performance

Monitor:

Connection pool utilization
Statement timeout occurrences
Query execution time
Replication lag (for replicas)
db_connection_total_checked_out / db_connection_total_connected ratio

Thresholds:

Connection pool usage < 80%
Replication lag < 10 seconds
No frequent statement timeouts

Notification Delivery

Monitor:

SMS delivery rate
Email delivery rate
Letter processing rate
Time to delivery (created → sent)
Provider failure rates

Thresholds:

SMS sent within 30 seconds
Email sent within 60 seconds
Provider error rate < 5%

Queue Depth

Monitor:

SQS ApproximateNumberOfMessages for all queues
Message age (ApproximateAgeOfOldestMessage)
Messages in flight

Thresholds:

Queue depth < 1000 messages
Message age < 5 minutes for critical queues
No growing queues over time

Worker Health

Monitor:

Worker process count
Task processing rate
Task failure rate
Worker memory usage
Worker restarts

Thresholds:

Expected number of workers running
Task failure rate < 1%
Worker memory < 1GB per process

Dashboards

Recommended Grafana Dashboards

API Overview Dashboard

Request rate by endpoint
Response time percentiles
Error rate by status code
Concurrent requests
Database connection usage

Notification Delivery Dashboard

Notifications created per minute
Notifications sent per minute by channel (SMS/Email/Letter)
Delivery success rate
Time to delivery
Provider distribution

Queue Dashboard

Queue depth by queue name
Message age
Messages processed per minute
Worker count by queue

Database Dashboard

Connection pool usage
Query execution time
Slow queries
Replication lag
Transaction rate

Troubleshooting with Metrics

High Error Rate

Check callback.{provider}.{status} metrics for provider failures
Review application logs for error patterns
Check database connection metrics for pool exhaustion
Verify provider status pages

Slow Response Times

Check db_connection_open_duration_seconds for slow queries
Review concurrent request count
Check for database lock contention
Enable request profiling with NOTIFY_GUNICORN_DEBUG_POST_REQUEST_LOG_THRESHOLD_SECONDS

Growing Queues

Check worker count and concurrency
Review task processing rate
Check for failing tasks causing retries
Scale up workers if needed
Review ApproximateAgeOfOldestMessage metric

Database Connection Exhaustion

Review db_connection_total_checked_out metric
Check for connection leaks (connections not returned)
Review slow queries holding connections
Increase SQLALCHEMY_POOL_SIZE if appropriate
Reduce worker concurrency

Best Practices

Set up alerting - Don’t just collect metrics, alert on anomalies
Monitor queue depths - Growing queues indicate capacity issues
Track provider health - Provider failures impact delivery
Use structured logging - Makes log analysis easier
Correlate metrics with logs - Use request IDs to trace issues
Monitor database separately - Database metrics from PostgreSQL itself
Set retention policies - Balance metric storage with cost
Regular dashboard reviews - Identify trends before they become issues
Document runbooks - Link alerts to remediation steps
Test alerting - Verify alerts fire as expected

Get Started

Core Concepts

API Guide

Operations

​Overview

​Metrics Systems

​Prometheus

​Configuration

​Metrics Endpoint

​Key Prometheus Metrics

​StatsD

​Configuration

​SMS Delivery Metrics

​Email Delivery Metrics

​Callback Metrics

​Task Metrics

​Logging

​Log Configuration

​Celery Logging

​Structured Logging

​Log Contexts

​Health Checks

​Status Endpoint

​Container Health Checks

​Cronitor Monitoring

​Configuration

​Monitored Tasks

​Performance Monitoring

​Request Profiling

​Eventlet Statistics

​Alerting

​SMS Provider Monitoring

​Zendesk Alerts

​Key Performance Indicators

​API Performance

​Database Performance

​Notification Delivery

​Queue Depth

​Worker Health

​Dashboards

​Recommended Grafana Dashboards

​API Overview Dashboard

​Notification Delivery Dashboard

​Queue Dashboard

​Database Dashboard

​Troubleshooting with Metrics

​High Error Rate

​Slow Response Times

​Growing Queues

​Database Connection Exhaustion

​Best Practices

Build docs developers (and LLMs) love

Overview

Metrics Systems

Prometheus

Configuration

Metrics Endpoint

Key Prometheus Metrics

StatsD

Configuration

SMS Delivery Metrics

Email Delivery Metrics

Callback Metrics

Task Metrics

Logging

Log Configuration

Celery Logging

Structured Logging

Log Contexts

Health Checks

Status Endpoint

Container Health Checks

Cronitor Monitoring

Configuration

Monitored Tasks

Performance Monitoring

Request Profiling

Eventlet Statistics

Alerting

SMS Provider Monitoring

Zendesk Alerts

Key Performance Indicators

API Performance

Database Performance

Notification Delivery

Queue Depth

Worker Health

Dashboards

Recommended Grafana Dashboards

API Overview Dashboard

Notification Delivery Dashboard

Queue Dashboard

Database Dashboard

Troubleshooting with Metrics

High Error Rate

Slow Response Times

Growing Queues

Database Connection Exhaustion

Best Practices