Overview
The distributed notification system provides comprehensive monitoring and observability features to track system health, performance, and message flow across all microservices.Logging Strategy
Correlation IDs
Every request is assigned a unique correlation ID to track the full notification lifecycle across all services. API Gateway Implementation:Log Format
All services use structured logging with the following information:- Correlation ID: Tracks requests across services
- Timestamp: When the event occurred
- Service Name: Which service generated the log
- Log Level: INFO, WARN, ERROR, DEBUG
- Message: Human-readable description
- Metadata: Additional context (user_id, request_id, etc.)
Key Metrics to Track
Queue Metrics
RabbitMQ Queue Lengths:email.queuedepthpush.queuedepthfailed.queuedepth (dead letter queue)
- Messages published per second
- Messages consumed per second
- Message acknowledgment rate
- Message rejection rate
Service Metrics
Response Times:- API Gateway:
/api/v1/notificationsendpoint latency - User Service:
/users/{user_id}lookup time - Template Service:
/templates/{template_code}retrieval time - Email/Push Service: Message processing time
- HTTP 4xx errors (client errors)
- HTTP 5xx errors (server errors)
- RabbitMQ connection failures
- Database connection errors
- SMTP/Push notification delivery failures
- CPU usage per service
- Memory consumption
- Network I/O
- Database connection pool usage
Notification Metrics
Delivery Success:- Total notifications sent
- Successful deliveries
- Failed deliveries
- Retry attempts
- Average delivery time
- Notifications filtered by user preferences
- Preference cache hit rate (Redis)
Management UIs
RabbitMQ Management UI
Access the RabbitMQ management interface to monitor message queues: URL:http://localhost:15673
Default Credentials:
- Username:
guest - Password:
guest
- View queue depths and message rates
- Monitor connections and channels
- Inspect message contents
- Configure exchanges and bindings
- Track consumer performance
- View dead letter queue messages
MailHog Email Testing UI
MailHog captures all outgoing emails for testing purposes: URL:http://localhost:8025
Features:
- View all sent emails in real-time
- Inspect email headers and content
- Test HTML and plain text rendering
- Download email files (.eml format)
- Search emails by recipient, subject, or content
- Delete test emails
MailHog is a development tool. In production, replace with a real SMTP service like SendGrid, Mailgun, or AWS SES.
SMTP Configuration
The Email Service uses the following SMTP settings:Recommended Monitoring Tools
Application Performance Monitoring (APM)
Prometheus + Grafana:- Collect metrics from all services
- Create dashboards for queue depths, latency, error rates
- Set up alerts for anomalies
- Time-series data storage and visualization
- End-to-end distributed tracing
- Automatic service dependency mapping
- Custom metrics and dashboards
- Anomaly detection and alerting
Log Aggregation
ELK Stack (Elasticsearch, Logstash, Kibana):- Centralized log collection from all services
- Full-text search across logs
- Correlation ID-based log tracing
- Custom dashboards and visualizations
- Lightweight log aggregation
- Integration with existing Grafana setup
- Label-based log filtering
Distributed Tracing
Jaeger / Zipkin:- Trace requests across microservices
- Visualize service dependencies
- Identify performance bottlenecks
- Root cause analysis for failures
Alerting Strategy
Set up alerts for critical conditions: Queue Depth Alerts:- Warning: Queue depth > 1000 messages
- Critical: Queue depth > 5000 messages
- Warning: Error rate > 5%
- Critical: Error rate > 10%
- Critical: Service health check fails for > 1 minute
- Warning: Service response time > 2 seconds
- Warning: Any messages in
failed.queue - Immediate investigation required
Monitoring Best Practices
- Track Correlation IDs: Always log correlation IDs to trace requests end-to-end
- Set Baseline Metrics: Establish normal operating ranges for key metrics
- Alert on Trends: Monitor rate of change, not just absolute values
- Retain Logs: Keep logs for at least 30 days for debugging
- Dashboard Everything: Create service-specific and system-wide dashboards
- Test Alerts: Regularly verify that alerting systems work correctly
- Document Runbooks: Create playbooks for common alert scenarios
Next Steps
Health Checks
Configure health check endpoints and service dependency checks
Troubleshooting
Diagnose and resolve common system issues