Skip to main content

Overview

Permission Mongo includes pre-configured Prometheus alerting rules that monitor critical system conditions. Alerts are defined in monitoring/alerts.yml and evaluated by Prometheus every 15 seconds.

Alert Rules

All alerting rules are organized in the permission-mongo-alerts group.

HighErrorRate

Severity: Critical
Condition: Error rate > 5% over 5 minutes
Duration: 5 minutes
- alert: HighErrorRate
  expr: |
    (
      sum(rate(http_requests_total{status=~"5.."}[5m]))
      /
      sum(rate(http_requests_total[5m]))
    ) > 0.05
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: "High HTTP error rate detected"
    description: "Error rate is {{ $value | humanizePercentage }} over the last 5 minutes"
Trigger Condition: More than 5% of requests return 5xx status codes Response Actions:
  • Check application logs for errors
  • Verify database connectivity
  • Review recent deployments
  • Check infrastructure health

HighLatency

Severity: Warning
Condition: P95 latency > 500ms
Duration: 5 minutes
- alert: HighLatency
  expr: |
    histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
    > 0.5
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "High request latency detected"
    description: "95th percentile latency is {{ $value }}s"
Trigger Condition: 95th percentile latency exceeds 500ms Response Actions:
  • Investigate slow database queries
  • Check cache hit rate
  • Review RBAC policy complexity
  • Examine resource utilization

ServiceDown

Severity: Critical
Condition: Target unreachable
Duration: 1 minute
- alert: ServiceDown
  expr: up{job="permission-mongo"} == 0 or up{job="permission-mongo-local"} == 0
  for: 1m
  labels:
    severity: critical
  annotations:
    summary: "Permission Mongo service is down"
    description: "The service has been unreachable for more than 1 minute"
Trigger Condition: Prometheus cannot scrape metrics endpoint Response Actions:
  • Verify service is running: docker ps or systemctl status
  • Check application logs
  • Verify network connectivity
  • Check port availability (8080)

HighGoroutineCount

Severity: Warning
Condition: Goroutines > 10,000
Duration: 5 minutes
- alert: HighGoroutineCount
  expr: goroutines_count > 10000
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "High goroutine count"
    description: "Goroutine count is {{ $value }}"
Trigger Condition: More than 10,000 active goroutines Response Actions:
  • Investigate potential goroutine leaks
  • Review recent code changes
  • Check for blocked operations
  • Consider restarting the service
Persistently high goroutine counts indicate a goroutine leak that will eventually cause memory exhaustion.

MongoDBErrors

Severity: Warning
Condition: Error rate > 1/second
Duration: 5 minutes
- alert: MongoDBErrors
  expr: rate(mongodb_errors_total[5m]) > 1
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "MongoDB errors detected"
    description: "MongoDB error rate is {{ $value }}/s"
Trigger Condition: More than 1 MongoDB error per second Response Actions:
  • Check MongoDB logs
  • Verify database connectivity
  • Check disk space on MongoDB server
  • Review query patterns

MongoDBHighLatency

Severity: Warning
Condition: P95 latency > 100ms
Duration: 5 minutes
- alert: MongoDBHighLatency
  expr: |
    histogram_quantile(0.95, sum(rate(mongodb_operation_duration_seconds_bucket[5m])) by (le, operation))
    > 0.1
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "High MongoDB latency"
    description: "95th percentile MongoDB latency is {{ $value }}s"
Trigger Condition: 95th percentile database operation latency exceeds 100ms Response Actions:
  • Check MongoDB server resources (CPU, memory, disk I/O)
  • Review slow query logs
  • Verify indexes are present
  • Check for long-running operations

LowCacheHitRate

Severity: Warning
Condition: Hit rate < 50%
Duration: 10 minutes
- alert: LowCacheHitRate
  expr: |
    (
      sum(rate(cache_hits_total[5m]))
      /
      (sum(rate(cache_hits_total[5m])) + sum(rate(cache_misses_total[5m])))
    ) < 0.5
  for: 10m
  labels:
    severity: warning
  annotations:
    summary: "Low cache hit rate"
    description: "Cache hit rate is {{ $value | humanizePercentage }}"
Trigger Condition: Cache hit rate below 50% Response Actions:
  • Review cache TTL settings
  • Check cache memory limits
  • Verify cache eviction policies
  • Consider increasing cache size

AuditQueueBacklog

Severity: Warning
Condition: Queue size > 900
Duration: 5 minutes
- alert: AuditQueueBacklog
  expr: audit_queue_size > 900
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "Audit log queue backlog"
    description: "Audit queue size is {{ $value }}, approaching capacity"
Trigger Condition: Audit queue approaching capacity (900 of 1000 max) Response Actions:
  • Check MongoDB write performance
  • Verify audit batch processing
  • Consider increasing queue size
  • Review audit log volume

AuditLogsDropped

Severity: Critical
Condition: Any drops occurring
Duration: 1 minute
- alert: AuditLogsDropped
  expr: rate(audit_logs_dropped_total[5m]) > 0
  for: 1m
  labels:
    severity: critical
  annotations:
    summary: "Audit logs are being dropped"
    description: "{{ $value }} audit logs/s are being dropped"
Trigger Condition: Any audit logs being dropped Response Actions:
  • Immediately investigate audit system
  • Check MongoDB connectivity and performance
  • Review audit queue configuration
  • Consider increasing queue size or batch processing rate
Dropped audit logs represent a compliance risk. This alert should be treated with highest priority.

Alert Summary Table

AlertThresholdDurationSeverity
HighErrorRate> 5% 5xx errors5 minutesCritical
HighLatencyP95 > 500ms5 minutesWarning
ServiceDownTarget unreachable1 minuteCritical
HighGoroutineCount> 10,000 goroutines5 minutesWarning
MongoDBErrors> 1 error/second5 minutesWarning
MongoDBHighLatencyP95 > 100ms5 minutesWarning
LowCacheHitRate< 50% hit rate10 minutesWarning
AuditQueueBacklogQueue > 9005 minutesWarning
AuditLogsDroppedAny drops1 minuteCritical

Setting Up Alertmanager

To receive alert notifications, configure Alertmanager:
1

Create alertmanager.yml

monitoring/alertmanager.yml
global:
  resolve_timeout: 5m

route:
  group_by: ['alertname', 'severity']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 12h
  receiver: 'default'

receivers:
  - name: 'default'
    email_configs:
      - to: '[email protected]'
        from: '[email protected]'
        smarthost: 'smtp.example.com:587'
        auth_username: '[email protected]'
        auth_password: 'password'
2

Add Alertmanager to Docker Compose

docker-compose.monitoring.yaml
alertmanager:
  image: prom/alertmanager
  ports:
    - "9093:9093"
  volumes:
    - ./monitoring/alertmanager.yml:/etc/alertmanager/alertmanager.yml
  command:
    - '--config.file=/etc/alertmanager/alertmanager.yml'
3

Update Prometheus configuration

monitoring/prometheus.yml
alerting:
  alertmanagers:
    - static_configs:
        - targets: ['alertmanager:9093']
4

Restart the monitoring stack

docker-compose -f docker-compose.monitoring.yaml restart

Notification Channels

Alertmanager supports multiple notification channels:

Email

receivers:
  - name: 'email-alerts'
    email_configs:
      - to: '[email protected]'

Slack

receivers:
  - name: 'slack-alerts'
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/YOUR/WEBHOOK/URL'
        channel: '#alerts'
        title: 'Permission Mongo Alert'

PagerDuty

receivers:
  - name: 'pagerduty-critical'
    pagerduty_configs:
      - service_key: 'YOUR_SERVICE_KEY'
        severity: 'critical'

Webhook

receivers:
  - name: 'webhook-alerts'
    webhook_configs:
      - url: 'https://your-webhook-endpoint.com/alerts'

Advanced Routing

Route different severity levels to different channels:
monitoring/alertmanager.yml
route:
  group_by: ['alertname']
  receiver: 'default'
  routes:
    - match:
        severity: critical
      receiver: 'pagerduty-critical'
      continue: true
    
    - match:
        severity: warning
      receiver: 'slack-warnings'

Silencing Alerts

Temporarily silence alerts during maintenance:
  1. Navigate to http://localhost:9093 (Alertmanager)
  2. Click SilencesNew Silence
  3. Configure matchers (e.g., alertname=HighLatency)
  4. Set duration
  5. Add comment
  6. Click Create

Testing Alerts

Manually trigger alerts to verify notification setup:
# Send test alert
curl -X POST http://localhost:9093/api/v1/alerts -d '[
  {
    "labels": {
      "alertname": "TestAlert",
      "severity": "warning"
    },
    "annotations": {
      "summary": "This is a test alert",
      "description": "Testing alert notification pipeline"
    }
  }
]'

Custom Alert Rules

To add custom alerting rules:
1

Edit alerts.yml

monitoring/alerts.yml
groups:
  - name: permission-mongo-alerts
    rules:
      # ... existing rules ...
      
      # Add your custom rule
      - alert: CustomAlert
        expr: your_metric > threshold
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Custom alert triggered"
          description: "{{ $value }}"
2

Validate configuration

promtool check rules monitoring/alerts.yml
3

Reload Prometheus

curl -X POST http://localhost:9090/-/reload
Or restart the container:
docker-compose -f docker-compose.monitoring.yaml restart prometheus

Viewing Active Alerts

Check currently firing alerts:

Best Practices

  • Avoid alert fatigue by setting realistic thresholds
  • Use historical data to establish baselines
  • Tune thresholds based on actual incidents
  • Short durations for critical issues (1-2 minutes)
  • Longer durations for warnings (5-10 minutes)
  • Prevents flapping alerts from transient issues
  • Include specific values in descriptions
  • Add runbook links for complex issues
  • Use clear, descriptive summaries
  • Regularly test notification channels
  • Verify on-call rotations receive alerts
  • Conduct alert drills

Build docs developers (and LLMs) love