Overview
Unmute includes comprehensive monitoring capabilities using Prometheus for metrics collection and Grafana for visualization. The monitoring stack tracks latency, throughput, error rates, and resource utilization across all services.Architecture
Prometheus scrapes metrics from all services and stores time-series data. Grafana queries Prometheus to display real-time dashboards.Metrics Collection
Available Metrics
Unmute exposes metrics via the Prometheus client library (defined inunmute/metrics.py):
Session Metrics
STT (Speech-to-Text) Metrics
TTS (Text-to-Speech) Metrics
LLM (Language Model) Metrics
Error Metrics
Histogram Buckets
Metrics use predefined buckets for accurate percentile calculations:Prometheus Setup
Docker Compose Configuration
For production deployments, add Prometheus to your Docker Swarm stack:Prometheus Configuration
Createservices/prometheus/prometheus.yml:
Service Labels
Label services to expose metrics:Grafana Setup
Docker Configuration
Data Source Configuration
Createservices/grafana/provisioning/datasources/datasources.yaml:
Dashboard Provisioning
Createservices/grafana/provisioning/dashboards/dashboards.yaml:
Key Dashboards
System Overview Dashboard
Active Sessions:Latency Dashboard
STT Latency (p95):Throughput Dashboard
STT Words Per Second:Service Health Dashboard
STT Service Misses:User Behavior Dashboard
Session Duration (Average):Accessing Dashboards
Local Development
Access Grafana athttp://localhost:3000 (default credentials: admin/admin).
Production Deployment
For unmute.sh deployment with Traefik:https://grafana.unmute.sh
Alerting
Example Alert Rules
Createservices/prometheus/alerts.yml:
Health Checks
Backend Health Endpoint
Unmute exposes a health check endpoint:Load Testing
Use the built-in load test client to validate monitoring:Production Monitoring URLs
From unmute.sh deployment:- Main app: https://unmute.sh
- Grafana: https://grafana.unmute.sh
- Prometheus: https://prometheus.unmute.sh
- Traefik: https://traefik.unmute.sh
- Portainer: https://portainer.unmute.sh
Best Practices
- Set appropriate scrape intervals: 5s for real-time, 30s for cost savings
- Use retention policies: Configure Prometheus to retain data for 30-90 days
- Monitor percentiles, not just averages: p95 and p99 reveal tail latencies
- Set up alerts: Proactive notification prevents outages
- Archive long-term data: Export to long-term storage (e.g., S3) for historical analysis
Troubleshooting
Metrics not appearing
Check:- Service has
prometheus-portlabel - Prometheus can reach the service (check targets page)
- Metrics endpoint returns data:
curl http://backend/metrics
High cardinality warnings
Cause: Too many unique label combinations Solution: Avoid using user IDs or session IDs as labels. Use counters instead.Missing histograms
Check: Bucket configuration matches expected latency ranges. Add buckets if values exceed defined ranges.Next Steps
- Performance Tuning - Optimize based on metrics
- Debugging - Use metrics to identify issues
- Multi-GPU Setup - Monitor GPU-specific metrics