Overview
Permission Mongo includes pre-configured Prometheus alerting rules that monitor critical system conditions. Alerts are defined in monitoring/alerts.yml and evaluated by Prometheus every 15 seconds.
Alert Rules
All alerting rules are organized in the permission-mongo-alerts group.
HighErrorRate
Severity: Critical
Condition: Error rate > 5% over 5 minutes
Duration: 5 minutes
- alert : HighErrorRate
expr : |
(
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
) > 0.05
for : 5m
labels :
severity : critical
annotations :
summary : "High HTTP error rate detected"
description : "Error rate is {{ $value | humanizePercentage }} over the last 5 minutes"
Trigger Condition: More than 5% of requests return 5xx status codes
Response Actions:
Check application logs for errors
Verify database connectivity
Review recent deployments
Check infrastructure health
HighLatency
Severity: Warning
Condition: P95 latency > 500ms
Duration: 5 minutes
- alert : HighLatency
expr : |
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
> 0.5
for : 5m
labels :
severity : warning
annotations :
summary : "High request latency detected"
description : "95th percentile latency is {{ $value }}s"
Trigger Condition: 95th percentile latency exceeds 500ms
Response Actions:
Investigate slow database queries
Check cache hit rate
Review RBAC policy complexity
Examine resource utilization
ServiceDown
Severity: Critical
Condition: Target unreachable
Duration: 1 minute
- alert : ServiceDown
expr : up{job="permission-mongo"} == 0 or up{job="permission-mongo-local"} == 0
for : 1m
labels :
severity : critical
annotations :
summary : "Permission Mongo service is down"
description : "The service has been unreachable for more than 1 minute"
Trigger Condition: Prometheus cannot scrape metrics endpoint
Response Actions:
Verify service is running: docker ps or systemctl status
Check application logs
Verify network connectivity
Check port availability (8080)
HighGoroutineCount
Severity: Warning
Condition: Goroutines > 10,000
Duration: 5 minutes
- alert : HighGoroutineCount
expr : goroutines_count > 10000
for : 5m
labels :
severity : warning
annotations :
summary : "High goroutine count"
description : "Goroutine count is {{ $value }}"
Trigger Condition: More than 10,000 active goroutines
Response Actions:
Investigate potential goroutine leaks
Review recent code changes
Check for blocked operations
Consider restarting the service
Persistently high goroutine counts indicate a goroutine leak that will eventually cause memory exhaustion.
MongoDBErrors
Severity: Warning
Condition: Error rate > 1/second
Duration: 5 minutes
- alert : MongoDBErrors
expr : rate(mongodb_errors_total[5m]) > 1
for : 5m
labels :
severity : warning
annotations :
summary : "MongoDB errors detected"
description : "MongoDB error rate is {{ $value }}/s"
Trigger Condition: More than 1 MongoDB error per second
Response Actions:
Check MongoDB logs
Verify database connectivity
Check disk space on MongoDB server
Review query patterns
MongoDBHighLatency
Severity: Warning
Condition: P95 latency > 100ms
Duration: 5 minutes
- alert : MongoDBHighLatency
expr : |
histogram_quantile(0.95, sum(rate(mongodb_operation_duration_seconds_bucket[5m])) by (le, operation))
> 0.1
for : 5m
labels :
severity : warning
annotations :
summary : "High MongoDB latency"
description : "95th percentile MongoDB latency is {{ $value }}s"
Trigger Condition: 95th percentile database operation latency exceeds 100ms
Response Actions:
Check MongoDB server resources (CPU, memory, disk I/O)
Review slow query logs
Verify indexes are present
Check for long-running operations
LowCacheHitRate
Severity: Warning
Condition: Hit rate < 50%
Duration: 10 minutes
- alert : LowCacheHitRate
expr : |
(
sum(rate(cache_hits_total[5m]))
/
(sum(rate(cache_hits_total[5m])) + sum(rate(cache_misses_total[5m])))
) < 0.5
for : 10m
labels :
severity : warning
annotations :
summary : "Low cache hit rate"
description : "Cache hit rate is {{ $value | humanizePercentage }}"
Trigger Condition: Cache hit rate below 50%
Response Actions:
Review cache TTL settings
Check cache memory limits
Verify cache eviction policies
Consider increasing cache size
AuditQueueBacklog
Severity: Warning
Condition: Queue size > 900
Duration: 5 minutes
- alert : AuditQueueBacklog
expr : audit_queue_size > 900
for : 5m
labels :
severity : warning
annotations :
summary : "Audit log queue backlog"
description : "Audit queue size is {{ $value }}, approaching capacity"
Trigger Condition: Audit queue approaching capacity (900 of 1000 max)
Response Actions:
Check MongoDB write performance
Verify audit batch processing
Consider increasing queue size
Review audit log volume
AuditLogsDropped
Severity: Critical
Condition: Any drops occurring
Duration: 1 minute
- alert : AuditLogsDropped
expr : rate(audit_logs_dropped_total[5m]) > 0
for : 1m
labels :
severity : critical
annotations :
summary : "Audit logs are being dropped"
description : "{{ $value }} audit logs/s are being dropped"
Trigger Condition: Any audit logs being dropped
Response Actions:
Immediately investigate audit system
Check MongoDB connectivity and performance
Review audit queue configuration
Consider increasing queue size or batch processing rate
Dropped audit logs represent a compliance risk. This alert should be treated with highest priority.
Alert Summary Table
Alert Threshold Duration Severity HighErrorRate > 5% 5xx errors 5 minutes Critical HighLatency P95 > 500ms 5 minutes Warning ServiceDown Target unreachable 1 minute Critical HighGoroutineCount > 10,000 goroutines 5 minutes Warning MongoDBErrors > 1 error/second 5 minutes Warning MongoDBHighLatency P95 > 100ms 5 minutes Warning LowCacheHitRate < 50% hit rate 10 minutes Warning AuditQueueBacklog Queue > 900 5 minutes Warning AuditLogsDropped Any drops 1 minute Critical
Setting Up Alertmanager
To receive alert notifications, configure Alertmanager:
Create alertmanager.yml
monitoring/alertmanager.yml
global :
resolve_timeout : 5m
route :
group_by : [ 'alertname' , 'severity' ]
group_wait : 10s
group_interval : 10s
repeat_interval : 12h
receiver : 'default'
receivers :
- name : 'default'
email_configs :
- to : '[email protected] '
from : '[email protected] '
smarthost : 'smtp.example.com:587'
auth_username : '[email protected] '
auth_password : 'password'
Add Alertmanager to Docker Compose
docker-compose.monitoring.yaml
alertmanager :
image : prom/alertmanager
ports :
- "9093:9093"
volumes :
- ./monitoring/alertmanager.yml:/etc/alertmanager/alertmanager.yml
command :
- '--config.file=/etc/alertmanager/alertmanager.yml'
Update Prometheus configuration
monitoring/prometheus.yml
alerting :
alertmanagers :
- static_configs :
- targets : [ 'alertmanager:9093' ]
Restart the monitoring stack
docker-compose -f docker-compose.monitoring.yaml restart
Notification Channels
Alertmanager supports multiple notification channels:
Email
Slack
receivers :
- name : 'slack-alerts'
slack_configs :
- api_url : 'https://hooks.slack.com/services/YOUR/WEBHOOK/URL'
channel : '#alerts'
title : 'Permission Mongo Alert'
receivers :
- name : 'pagerduty-critical'
pagerduty_configs :
- service_key : 'YOUR_SERVICE_KEY'
severity : 'critical'
Webhook
receivers :
- name : 'webhook-alerts'
webhook_configs :
- url : 'https://your-webhook-endpoint.com/alerts'
Advanced Routing
Route different severity levels to different channels:
monitoring/alertmanager.yml
route :
group_by : [ 'alertname' ]
receiver : 'default'
routes :
- match :
severity : critical
receiver : 'pagerduty-critical'
continue : true
- match :
severity : warning
receiver : 'slack-warnings'
Silencing Alerts
Temporarily silence alerts during maintenance:
Navigate to http://localhost:9093 (Alertmanager)
Click Silences → New Silence
Configure matchers (e.g., alertname=HighLatency)
Set duration
Add comment
Click Create
amtool silence add alertname=HighLatency \
--duration=2h \
--comment= "Maintenance window" \
--alertmanager.url=http://localhost:9093
Testing Alerts
Manually trigger alerts to verify notification setup:
# Send test alert
curl -X POST http://localhost:9093/api/v1/alerts -d '[
{
"labels": {
"alertname": "TestAlert",
"severity": "warning"
},
"annotations": {
"summary": "This is a test alert",
"description": "Testing alert notification pipeline"
}
}
]'
Custom Alert Rules
To add custom alerting rules:
Edit alerts.yml
groups :
- name : permission-mongo-alerts
rules :
# ... existing rules ...
# Add your custom rule
- alert : CustomAlert
expr : your_metric > threshold
for : 5m
labels :
severity : warning
annotations :
summary : "Custom alert triggered"
description : "{{ $value }}"
Validate configuration
promtool check rules monitoring/alerts.yml
Reload Prometheus
curl -X POST http://localhost:9090/-/reload
Or restart the container: docker-compose -f docker-compose.monitoring.yaml restart prometheus
Viewing Active Alerts
Check currently firing alerts:
Prometheus UI
Alertmanager UI
API
# Active alerts
curl http://localhost:9090/api/v1/alerts
# Alert rules
curl http://localhost:9090/api/v1/rules
Best Practices
Set appropriate thresholds
Avoid alert fatigue by setting realistic thresholds
Use historical data to establish baselines
Tune thresholds based on actual incidents
Short durations for critical issues (1-2 minutes)
Longer durations for warnings (5-10 minutes)
Prevents flapping alerts from transient issues
Provide actionable context
Include specific values in descriptions
Add runbook links for complex issues
Use clear, descriptive summaries
Regularly test notification channels
Verify on-call rotations receive alerts
Conduct alert drills