Monitoring - Unmute

Overview

Unmute includes comprehensive monitoring capabilities using Prometheus for metrics collection and Grafana for visualization. The monitoring stack tracks latency, throughput, error rates, and resource utilization across all services.

Architecture

Prometheus scrapes metrics from all services and stores time-series data. Grafana queries Prometheus to display real-time dashboards.

Metrics Collection

Available Metrics

Unmute exposes metrics via the Prometheus client library (defined in unmute/metrics.py):

Session Metrics

SESSIONS = Counter("worker_sessions", "")  # Total sessions
ACTIVE_SESSIONS = Gauge("worker_active_sessions", "")  # Current connections
SESSION_DURATION = Histogram("worker_session_duration", "")

STT (Speech-to-Text) Metrics

STT_SESSIONS = Counter("worker_stt_sessions", "")
STT_ACTIVE_SESSIONS = Gauge("worker_stt_active_sessions", "")
STT_TTFT = Histogram("worker_stt_ttft", "")  # Time-to-first-token
STT_PING_TIME = Histogram("worker_stt_ping_time", "")
STT_RECV_WORDS = Counter("worker_stt_recv_words", "")
STT_MISSES = Counter("worker_stt_misses", "")  # Connection failures

TTS (Text-to-Speech) Metrics

TTS_SESSIONS = Counter("worker_tts_sessions", "")
TTS_ACTIVE_SESSIONS = Gauge("worker_tts_active_sessions", "")
TTS_TTFT = Histogram("worker_tts_ttft", "")  # Time-to-first-audio
TTS_INTERRUPT = Counter("worker_tts_interrupt", "")  # User interruptions
TTS_AUDIO_DURATION = Histogram("worker_tts_audio_duration", "")
TTS_GEN_DURATION = Histogram("worker_tts_gen_duration", "")

LLM (Language Model) Metrics

VLLM_SESSIONS = Counter("worker_vllm_sessions", "")
VLLM_ACTIVE_SESSIONS = Gauge("worker_vllm_active_sessions", "")
VLLM_TTFT = Histogram("worker_vllm_ttft", "")  # Time-to-first-token
VLLM_REQUEST_LENGTH = Histogram("worker_vllm_request_length", "")
VLLM_REPLY_LENGTH = Histogram("worker_vllm_reply_length", "")
VLLM_INTERRUPTS = Counter("worker_vllm_interrupt", "")

Error Metrics

HARD_ERRORS = Counter("worker_hard_errors", "")  # Fatal errors
SERVICE_MISSES = Counter("worker_service_misses", "")  # Service unavailable
FATAL_SERVICE_MISSES = Counter("worker_fatal_service_misses", "")
FORCE_DISCONNECTS = Counter("worker_force_disconnects", "")

Histogram Buckets

Metrics use predefined buckets for accurate percentile calculations:

# STT latency buckets (milliseconds)
TTFT_BINS_STT_MS = [10.0, 15.0, 25.0, 50.0, 75.0, 100.0]

# TTS latency buckets (milliseconds)  
TTFT_BINS_TTS_MS = [200.0, 250.0, 300.0, 350.0, 400.0, 450.0, 500.0, 550.0]

# LLM latency buckets (milliseconds)
TTFT_BINS_VLLM_MS = [50.0, 75.0, 100.0, 150.0, 200.0, 250.0, 300.0, 400.0, 500.0, 750.0, 1000.0]

# Session duration buckets (seconds)
SESSION_DURATION_BINS = [1.0, 10.0, 30.0, 60.0, 120.0, 240.0, 480.0, 960.0, 1920.0]

Prometheus Setup

Docker Compose Configuration

For production deployments, add Prometheus to your Docker Swarm stack:

prometheus:
  image: prom/prometheus:latest
  volumes:
    - ./services/prometheus/prometheus.yml:/etc/prometheus/prometheus.yml
    - /var/run/docker.sock:/var/run/docker.sock:ro
    - prometheus-data:/prometheus
  command:
    - '--config.file=/etc/prometheus/prometheus.yml'
    - '--storage.tsdb.path=/prometheus'
  ports:
    - "9090:9090"

Prometheus Configuration

Create services/prometheus/prometheus.yml:

scrape_configs:
  - job_name: 'dockerswarm'
    scrape_interval: 5s
    dockerswarm_sd_configs:
      - host: unix:///var/run/docker.sock
        role: tasks
    relabel_configs:
      # Keep only running tasks
      - source_labels: [__meta_dockerswarm_task_desired_state]
        regex: running
        action: keep
      
      # Keep only tasks with prometheus-port label
      - source_labels: [__meta_dockerswarm_service_label_prometheus_port]
        regex: .+
        action: keep
      
      # Set job name from service name
      - source_labels: [__meta_dockerswarm_service_name]
        regex: .*_(.+)
        replacement: $1
        target_label: job
      
      # Set scrape address from prometheus-port label
      - source_labels: [__address__, __meta_dockerswarm_service_label_prometheus_port]
        regex: ([^:]+):\d+;(\d+)
        replacement: $1:$2
        target_label: __address__

Service Labels

Label services to expose metrics:

backend:
  labels:
    - "prometheus-port=80"  # Backend exposes metrics on port 80

traefik:
  labels:
    - "prometheus-port=8080"  # Traefik metrics on port 8080

Grafana Setup

Docker Configuration

grafana:
  image: grafana/grafana:latest
  volumes:
    - grafana-data:/var/lib/grafana
    - ./services/grafana/grafana.ini:/etc/grafana/grafana.ini
    - ./services/grafana/provisioning:/etc/grafana/provisioning
    - ./services/grafana/dashboards:/etc/grafana/dashboards
  environment:
    - GF_SECURITY_ADMIN_PASSWORD=your_secure_password
  ports:
    - "3000:3000"
  depends_on:
    - prometheus

Data Source Configuration

Create services/grafana/provisioning/datasources/datasources.yaml:

apiVersion: 1

datasources:
  - name: Prometheus
    type: prometheus
    access: proxy
    orgId: 1
    url: http://prometheus:9090
    isDefault: true
    editable: true

Dashboard Provisioning

Create services/grafana/provisioning/dashboards/dashboards.yaml:

apiVersion: 1

providers:
  - name: 'default'
    orgId: 1
    folder: ''
    type: file
    disableDeletion: false
    editable: true
    updateIntervalSeconds: 10
    options:
      path: /etc/grafana/dashboards

Key Dashboards

System Overview Dashboard

Active Sessions:

worker_active_sessions

Total Sessions (Rate):

rate(worker_sessions_total[5m])

Error Rate:

rate(worker_hard_errors_total[5m])

Latency Dashboard

STT Latency (p95):

histogram_quantile(0.95, rate(worker_stt_ttft_bucket[5m]))

TTS Latency (p95):

histogram_quantile(0.95, rate(worker_tts_ttft_bucket[5m]))

LLM Latency (p95):

histogram_quantile(0.95, rate(worker_vllm_ttft_bucket[5m]))

Average STT Latency:

rate(worker_stt_ttft_sum[5m]) / rate(worker_stt_ttft_count[5m])

Throughput Dashboard

STT Words Per Second:

rate(worker_stt_recv_words_total[5m])

TTS Words Per Second:

rate(worker_tts_recv_words_total[5m])

LLM Tokens Per Second:

rate(worker_vllm_recv_words_total[5m])

Service Health Dashboard

STT Service Misses:

rate(worker_stt_misses_total[5m])

TTS Service Misses:

rate(worker_tts_misses_total[5m])

Active STT Sessions:

worker_stt_active_sessions

Active TTS Sessions:

worker_tts_active_sessions

Active LLM Sessions:

worker_vllm_active_sessions

User Behavior Dashboard

Session Duration (Average):

rate(worker_session_duration_sum[5m]) / rate(worker_session_duration_count[5m])

Interruption Rate:

rate(worker_tts_interrupt_total[5m])

Average Request Length:

rate(worker_vllm_request_length_sum[5m]) / rate(worker_vllm_request_length_count[5m])

Average Reply Length:

rate(worker_vllm_reply_length_sum[5m]) / rate(worker_vllm_reply_length_count[5m])

Accessing Dashboards

Local Development

Access Grafana at http://localhost:3000 (default credentials: admin/admin).

Production Deployment

For unmute.sh deployment with Traefik:

grafana:
  deploy:
    labels:
      - "traefik.enable=true"
      - "traefik.http.routers.grafana.rule=Host(`grafana.${DOMAIN}`)"
      - "traefik.http.routers.grafana.middlewares=traefik-forward-auth"
      - "traefik.http.routers.grafana.entrypoints=websecure"
      - "traefik.http.routers.grafana.tls=true"
      - "traefik.http.services.grafana.loadbalancer.server.port=3000"

Access at: https://grafana.unmute.sh

Alerting

Example Alert Rules

Create services/prometheus/alerts.yml:

groups:
  - name: unmute_alerts
    interval: 30s
    rules:
      - alert: HighErrorRate
        expr: rate(worker_hard_errors_total[5m]) > 0.1
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High error rate detected"
          description: "Error rate is {{ $value }} errors/sec"
      
      - alert: HighLatency
        expr: histogram_quantile(0.95, rate(worker_tts_ttft_bucket[5m])) > 1.0
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High TTS latency"
          description: "P95 TTS latency is {{ $value }}s"
      
      - alert: ServiceUnavailable
        expr: rate(worker_fatal_service_misses_total[5m]) > 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Backend services unavailable"

Add to Prometheus configuration:

alerting:
  alertmanagers:
    - static_configs:
        - targets:
            - alertmanager:9093

rule_files:
  - /etc/prometheus/alerts.yml

Health Checks

Backend Health Endpoint

Unmute exposes a health check endpoint:

curl http://localhost:8000/v1/health

Response:

{
  "ok": true,
  "services": {
    "stt": "available",
    "tts": "available",
    "llm": "available"
  }
}

Load Testing

Use the built-in load test client to validate monitoring:

uv run unmute/loadtest/loadtest_client.py \
  --server-url ws://localhost:8000 \
  --n-workers 10 \
  --n-conversations 50

Watch metrics in Grafana during the test to verify collection.

Production Monitoring URLs

From unmute.sh deployment:

Main app: https://unmute.sh
Grafana: https://grafana.unmute.sh
Prometheus: https://prometheus.unmute.sh
Traefik: https://traefik.unmute.sh
Portainer: https://portainer.unmute.sh

All monitoring services are protected by OAuth authentication (Google) via Traefik Forward Auth.

Best Practices

Set appropriate scrape intervals: 5s for real-time, 30s for cost savings
Use retention policies: Configure Prometheus to retain data for 30-90 days
Monitor percentiles, not just averages: p95 and p99 reveal tail latencies
Set up alerts: Proactive notification prevents outages
Archive long-term data: Export to long-term storage (e.g., S3) for historical analysis

Troubleshooting

Metrics not appearing

Check:

Service has prometheus-port label
Prometheus can reach the service (check targets page)
Metrics endpoint returns data: curl http://backend/metrics

High cardinality warnings

Cause: Too many unique label combinations Solution: Avoid using user IDs or session IDs as labels. Use counters instead.

Missing histograms

Check: Bucket configuration matches expected latency ranges. Add buckets if values exceed defined ranges.

Next Steps

Performance Tuning - Optimize based on metrics
Debugging - Use metrics to identify issues
Multi-GPU Setup - Monitor GPU-specific metrics

Customization

Advanced

​Overview

​Architecture

​Metrics Collection

​Available Metrics

​Session Metrics

​STT (Speech-to-Text) Metrics

​TTS (Text-to-Speech) Metrics

​LLM (Language Model) Metrics

​Error Metrics

​Histogram Buckets

​Prometheus Setup

​Docker Compose Configuration

​Prometheus Configuration

​Service Labels

​Grafana Setup

​Docker Configuration

​Data Source Configuration

​Dashboard Provisioning

​Key Dashboards

​System Overview Dashboard

​Latency Dashboard

​Throughput Dashboard

​Service Health Dashboard

​User Behavior Dashboard

​Accessing Dashboards

​Local Development

​Production Deployment

​Alerting

​Example Alert Rules

​Health Checks

​Backend Health Endpoint

​Load Testing

​Production Monitoring URLs

​Best Practices

​Troubleshooting

​Metrics not appearing

​High cardinality warnings

​Missing histograms

​Next Steps

Build docs developers (and LLMs) love