Alert Configuration

Overview

Permission Mongo includes pre-configured Prometheus alerting rules that monitor critical system conditions. Alerts are defined in monitoring/alerts.yml and evaluated by Prometheus every 15 seconds.

Alert Rules

All alerting rules are organized in the permission-mongo-alerts group.

HighErrorRate

Severity: Critical
Condition: Error rate > 5% over 5 minutes
Duration: 5 minutes

- alert: HighErrorRate
  expr: |
    (
      sum(rate(http_requests_total{status=~"5.."}[5m]))
      /
      sum(rate(http_requests_total[5m]))
    ) > 0.05
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: "High HTTP error rate detected"
    description: "Error rate is {{ $value | humanizePercentage }} over the last 5 minutes"

Trigger Condition: More than 5% of requests return 5xx status codes Response Actions:

Check application logs for errors
Verify database connectivity
Review recent deployments
Check infrastructure health

HighLatency

Severity: Warning
Condition: P95 latency > 500ms
Duration: 5 minutes

- alert: HighLatency
  expr: |
    histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
    > 0.5
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "High request latency detected"
    description: "95th percentile latency is {{ $value }}s"

Trigger Condition: 95th percentile latency exceeds 500ms Response Actions:

Investigate slow database queries
Check cache hit rate
Review RBAC policy complexity
Examine resource utilization

ServiceDown

Severity: Critical
Condition: Target unreachable
Duration: 1 minute

- alert: ServiceDown
  expr: up{job="permission-mongo"} == 0 or up{job="permission-mongo-local"} == 0
  for: 1m
  labels:
    severity: critical
  annotations:
    summary: "Permission Mongo service is down"
    description: "The service has been unreachable for more than 1 minute"

Trigger Condition: Prometheus cannot scrape metrics endpoint Response Actions:

Verify service is running: docker ps or systemctl status
Check application logs
Verify network connectivity
Check port availability (8080)

HighGoroutineCount

Severity: Warning
Condition: Goroutines > 10,000
Duration: 5 minutes

- alert: HighGoroutineCount
  expr: goroutines_count > 10000
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "High goroutine count"
    description: "Goroutine count is {{ $value }}"

Trigger Condition: More than 10,000 active goroutines Response Actions:

Investigate potential goroutine leaks
Review recent code changes
Check for blocked operations
Consider restarting the service

Persistently high goroutine counts indicate a goroutine leak that will eventually cause memory exhaustion.

MongoDBErrors

Severity: Warning
Condition: Error rate > 1/second
Duration: 5 minutes

- alert: MongoDBErrors
  expr: rate(mongodb_errors_total[5m]) > 1
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "MongoDB errors detected"
    description: "MongoDB error rate is {{ $value }}/s"

Trigger Condition: More than 1 MongoDB error per second Response Actions:

Check MongoDB logs
Verify database connectivity
Check disk space on MongoDB server
Review query patterns

MongoDBHighLatency

Severity: Warning
Condition: P95 latency > 100ms
Duration: 5 minutes

- alert: MongoDBHighLatency
  expr: |
    histogram_quantile(0.95, sum(rate(mongodb_operation_duration_seconds_bucket[5m])) by (le, operation))
    > 0.1
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "High MongoDB latency"
    description: "95th percentile MongoDB latency is {{ $value }}s"

Trigger Condition: 95th percentile database operation latency exceeds 100ms Response Actions:

Check MongoDB server resources (CPU, memory, disk I/O)
Review slow query logs
Verify indexes are present
Check for long-running operations

LowCacheHitRate

Severity: Warning
Condition: Hit rate < 50%
Duration: 10 minutes

- alert: LowCacheHitRate
  expr: |
    (
      sum(rate(cache_hits_total[5m]))
      /
      (sum(rate(cache_hits_total[5m])) + sum(rate(cache_misses_total[5m])))
    ) < 0.5
  for: 10m
  labels:
    severity: warning
  annotations:
    summary: "Low cache hit rate"
    description: "Cache hit rate is {{ $value | humanizePercentage }}"

Trigger Condition: Cache hit rate below 50% Response Actions:

Review cache TTL settings
Check cache memory limits
Verify cache eviction policies
Consider increasing cache size

AuditQueueBacklog

Severity: Warning
Condition: Queue size > 900
Duration: 5 minutes

- alert: AuditQueueBacklog
  expr: audit_queue_size > 900
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "Audit log queue backlog"
    description: "Audit queue size is {{ $value }}, approaching capacity"

Trigger Condition: Audit queue approaching capacity (900 of 1000 max) Response Actions:

Check MongoDB write performance
Verify audit batch processing
Consider increasing queue size
Review audit log volume

AuditLogsDropped

Severity: Critical
Condition: Any drops occurring
Duration: 1 minute

- alert: AuditLogsDropped
  expr: rate(audit_logs_dropped_total[5m]) > 0
  for: 1m
  labels:
    severity: critical
  annotations:
    summary: "Audit logs are being dropped"
    description: "{{ $value }} audit logs/s are being dropped"

Trigger Condition: Any audit logs being dropped Response Actions:

Immediately investigate audit system
Check MongoDB connectivity and performance
Review audit queue configuration
Consider increasing queue size or batch processing rate

Dropped audit logs represent a compliance risk. This alert should be treated with highest priority.

Alert Summary Table

Alert	Threshold	Duration	Severity
HighErrorRate	> 5% 5xx errors	5 minutes	Critical
HighLatency	P95 > 500ms	5 minutes	Warning
ServiceDown	Target unreachable	1 minute	Critical
HighGoroutineCount	> 10,000 goroutines	5 minutes	Warning
MongoDBErrors	> 1 error/second	5 minutes	Warning
MongoDBHighLatency	P95 > 100ms	5 minutes	Warning
LowCacheHitRate	< 50% hit rate	10 minutes	Warning
AuditQueueBacklog	Queue > 900	5 minutes	Warning
AuditLogsDropped	Any drops	1 minute	Critical

Setting Up Alertmanager

To receive alert notifications, configure Alertmanager:

Create alertmanager.yml

monitoring/alertmanager.yml

global:
  resolve_timeout: 5m

route:
  group_by: ['alertname', 'severity']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 12h
  receiver: 'default'

receivers:
  - name: 'default'
    email_configs:
      - to: '[email protected]'
        from: '[email protected]'
        smarthost: 'smtp.example.com:587'
        auth_username: '[email protected]'
        auth_password: 'password'

Add Alertmanager to Docker Compose

docker-compose.monitoring.yaml

alertmanager:
  image: prom/alertmanager
  ports:
    - "9093:9093"
  volumes:
    - ./monitoring/alertmanager.yml:/etc/alertmanager/alertmanager.yml
  command:
    - '--config.file=/etc/alertmanager/alertmanager.yml'

Update Prometheus configuration

monitoring/prometheus.yml

alerting:
  alertmanagers:
    - static_configs:
        - targets: ['alertmanager:9093']

Restart the monitoring stack

docker-compose -f docker-compose.monitoring.yaml restart

Notification Channels

Alertmanager supports multiple notification channels:

Email

receivers:
  - name: 'email-alerts'
    email_configs:
      - to: '[email protected]'

Slack

receivers:
  - name: 'slack-alerts'
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/YOUR/WEBHOOK/URL'
        channel: '#alerts'
        title: 'Permission Mongo Alert'

PagerDuty

receivers:
  - name: 'pagerduty-critical'
    pagerduty_configs:
      - service_key: 'YOUR_SERVICE_KEY'
        severity: 'critical'

Webhook

receivers:
  - name: 'webhook-alerts'
    webhook_configs:
      - url: 'https://your-webhook-endpoint.com/alerts'

Advanced Routing

Route different severity levels to different channels:

monitoring/alertmanager.yml

route:
  group_by: ['alertname']
  receiver: 'default'
  routes:
    - match:
        severity: critical
      receiver: 'pagerduty-critical'
      continue: true
    
    - match:
        severity: warning
      receiver: 'slack-warnings'

Silencing Alerts

Temporarily silence alerts during maintenance:

Web UI
CLI

Navigate to http://localhost:9093 (Alertmanager)
Click Silences → New Silence
Configure matchers (e.g., alertname=HighLatency)
Set duration
Add comment
Click Create

amtool silence add alertname=HighLatency \
  --duration=2h \
  --comment="Maintenance window" \
  --alertmanager.url=http://localhost:9093

Testing Alerts

Manually trigger alerts to verify notification setup:

# Send test alert
curl -X POST http://localhost:9093/api/v1/alerts -d '[
  {
    "labels": {
      "alertname": "TestAlert",
      "severity": "warning"
    },
    "annotations": {
      "summary": "This is a test alert",
      "description": "Testing alert notification pipeline"
    }
  }
]'

Custom Alert Rules

To add custom alerting rules:

Edit alerts.yml

monitoring/alerts.yml

groups:
  - name: permission-mongo-alerts
    rules:
      # ... existing rules ...
      
      # Add your custom rule
      - alert: CustomAlert
        expr: your_metric > threshold
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Custom alert triggered"
          description: "{{ $value }}"

Validate configuration

promtool check rules monitoring/alerts.yml

Reload Prometheus

curl -X POST http://localhost:9090/-/reload

Or restart the container:

docker-compose -f docker-compose.monitoring.yaml restart prometheus

Viewing Active Alerts

Check currently firing alerts:

Prometheus UI
Alertmanager UI
API

Navigate to http://localhost:9090/alerts

# Active alerts
curl http://localhost:9090/api/v1/alerts

# Alert rules
curl http://localhost:9090/api/v1/rules

Best Practices

Set appropriate thresholds

Avoid alert fatigue by setting realistic thresholds
Use historical data to establish baselines
Tune thresholds based on actual incidents

Use duration wisely

Short durations for critical issues (1-2 minutes)
Longer durations for warnings (5-10 minutes)
Prevents flapping alerts from transient issues

Provide actionable context

Include specific values in descriptions
Add runbook links for complex issues
Use clear, descriptive summaries

Test alert pipelines

Regularly test notification channels
Verify on-call rotations receive alerts
Conduct alert drills

Getting Started

Core Concepts

Configuration

Guides

Monitoring & Observability

Advanced

SDK

Overview

Alert Rules

HighErrorRate

HighLatency

ServiceDown

HighGoroutineCount

MongoDBErrors

MongoDBHighLatency

LowCacheHitRate

AuditQueueBacklog

AuditLogsDropped

Alert Summary Table

Setting Up Alertmanager

Notification Channels

Email

Slack

PagerDuty

Webhook

Advanced Routing

Silencing Alerts

Testing Alerts

Custom Alert Rules

Viewing Active Alerts

Best Practices

Build docs developers (and LLMs) love

Getting Started

Core Concepts

Configuration

Guides

Monitoring & Observability

Advanced

SDK

​Overview

​Alert Rules

​HighErrorRate

​HighLatency

​ServiceDown

​HighGoroutineCount

​MongoDBErrors

​MongoDBHighLatency

​LowCacheHitRate

​AuditQueueBacklog

​AuditLogsDropped

​Alert Summary Table

​Setting Up Alertmanager

​Notification Channels

​Email

​Slack

​PagerDuty

​Webhook

​Advanced Routing

​Silencing Alerts

​Testing Alerts

​Custom Alert Rules

​Viewing Active Alerts

​Best Practices

Build docs developers (and LLMs) love

Overview

Alert Rules

HighErrorRate

HighLatency

ServiceDown

HighGoroutineCount

MongoDBErrors

MongoDBHighLatency

LowCacheHitRate

AuditQueueBacklog

AuditLogsDropped

Alert Summary Table

Setting Up Alertmanager

Notification Channels

Email

Slack

PagerDuty

Webhook

Advanced Routing

Silencing Alerts

Testing Alerts

Custom Alert Rules

Viewing Active Alerts

Best Practices