Monitoring and Observability - Headscale + Tailscale Docker Stack

Overview

Effective monitoring ensures your Headscale deployment remains healthy and performant. This guide covers health checks, metrics collection, log analysis, and alerting strategies.

Health Checks

Headscale Health Endpoint

Headscale exposes a health check endpoint for monitoring service status:

curl http://localhost:8000/health

{
  "status": "pass"
}

Container Health Status

All services include Docker health checks:

# Check all container health
docker compose ps

# Detailed health status
docker inspect --format='{{.State.Health.Status}}' headscale
docker inspect --format='{{.State.Health.Status}}' headscale-db
docker inspect --format='{{.State.Health.Status}}' nginx

Health checks run automatically:

Headscale: Every 30s (command: headscale health)
PostgreSQL: Every 10s (command: pg_isready)
nginx: Every 30s (HTTP check to /health)

Health Check Configuration

From docker-compose.yml:

headscale:
  healthcheck:
    test: [CMD, headscale, health]
    interval: 30s
    timeout: 10s
    retries: 3
    start_period: 10s

postgres:
  healthcheck:
    test: [CMD-SHELL, "pg_isready -U headscale"]
    interval: 10s
    timeout: 5s
    retries: 5

nginx:
  healthcheck:
    test: [CMD, wget, --quiet, --tries=1, --spider, http://localhost:8080/health]
    interval: 30s
    timeout: 5s
    retries: 3
    start_period: 10s

Prometheus Metrics

Headscale exposes Prometheus-compatible metrics for detailed monitoring.

Metrics Endpoint

Access metrics on port 9090 (localhost only for security):

# View all metrics
curl http://localhost:9090/metrics

# Filter specific metrics
curl http://localhost:9090/metrics | grep headscale_

Key Metrics

Node Metrics

# Total registered nodes
headscale_nodes_total

# Nodes by state
headscale_nodes_registered
headscale_nodes_online
headscale_nodes_offline

# Node registration rate
rate(headscale_node_registrations_total[5m])

Network Metrics

# Active connections
headscale_derp_connections_active

# Data transfer
headscale_network_bytes_sent_total
headscale_network_bytes_received_total

# Connection quality
headscale_connection_latency_seconds

API Metrics

# Request rate
rate(headscale_http_requests_total[1m])

# Request duration
headscale_http_request_duration_seconds

# Error rate
rate(headscale_http_requests_total{code=~"5.."}[5m])

Database Metrics

# Database connections
headscale_db_connections_open
headscale_db_connections_idle

# Query duration
headscale_db_query_duration_seconds

# Connection pool
headscale_db_max_open_connections

Metrics Configuration

From config/config.yaml:

listen_addr: 0.0.0.0:8080
metrics_listen_addr: 0.0.0.0:9090

Metrics are bound to 0.0.0.0:9090 inside the container but exposed only to 127.0.0.1:9090 on the host via port mapping. Never expose metrics publicly without authentication.

Log Management

Viewing Logs

# All service logs
docker compose logs -f

# Specific service
docker compose logs -f headscale
docker compose logs -f postgres
docker compose logs -f nginx

# Last N lines
docker compose logs --tail 100 headscale

# With timestamps
docker compose logs -f --timestamps headscale

# Since specific time
docker compose logs --since 30m headscale

Log Levels

Configure logging in config/config.yaml:

log:
  format: text  # or: json
  level: info   # debug, info, warn, error

Production
Development

log:
  format: json
  level: info

Use JSON format for easier parsing by log aggregators.

log:
  format: text
  level: debug

Use text format and debug level for troubleshooting.

Log Analysis

# Search for errors
docker compose logs headscale | grep -i error

# Count error occurrences
docker compose logs --since 24h headscale | grep -i error | wc -l

# Monitor for failed authentication
docker compose logs -f headscale | grep "authentication failed"

# Track node registrations
docker compose logs headscale | grep "node registered"

Log Rotation

Configure Docker log rotation in /etc/docker/daemon.json:

{
  "log-driver": "json-file",
  "log-opts": {
    "max-size": "10m",
    "max-file": "3"
  }
}

# Apply configuration
sudo systemctl restart docker

Resource Monitoring

Container Resource Usage

# Real-time resource stats
docker stats

# Specific containers
docker stats headscale headscale-db nginx

# Single snapshot
docker stats --no-stream

NAME            CPU %   MEM USAGE / LIMIT   MEM %   NET I/O         BLOCK I/O
headscale       0.05%   41MB / 970MB        4%      1.6MB / 1.7MB   512KB / 0B
headscale-db    0.01%   25MB / 970MB        2%      800KB / 850KB   1MB / 2MB
nginx           0.00%   30MB / 970MB        3%      140KB / 148KB   0B / 0B
headplane       0.00%   180MB / 970MB       18%     7.6MB / 3.9MB   0B / 0B

System Resources

# Disk usage
df -h
du -sh data/ config/ backups/

# Docker disk usage
docker system df

# Detailed breakdown
docker system df -v

# Memory usage
free -h

# CPU load
uptime

Database Monitoring

PostgreSQL
SQLite

# Connection count
docker exec headscale-db psql -U headscale -c "SELECT count(*) FROM pg_stat_activity;"

# Database size
docker exec headscale-db psql -U headscale -c "SELECT pg_size_pretty(pg_database_size('headscale'));"

# Active queries
docker exec headscale-db psql -U headscale -c "SELECT pid, age(clock_timestamp(), query_start), query FROM pg_stat_activity WHERE state != 'idle';"

# Table sizes
docker exec headscale-db psql -U headscale -c "SELECT relname, pg_size_pretty(pg_total_relation_size(relid)) FROM pg_catalog.pg_statio_user_tables ORDER BY pg_total_relation_size(relid) DESC;"

# Database size
ls -lh data/db.sqlite
du -h data/db.sqlite

# Check integrity
docker exec headscale sqlite3 /var/lib/headscale/db.sqlite "PRAGMA integrity_check;"

# View tables
docker exec headscale sqlite3 /var/lib/headscale/db.sqlite ".tables"

# Row counts
docker exec headscale sqlite3 /var/lib/headscale/db.sqlite "SELECT 'nodes', COUNT(*) FROM nodes UNION SELECT 'users', COUNT(*) FROM users;"

Monitoring Stack Setup

Prometheus + Grafana

Add monitoring services to your stack:

docker-compose.monitoring.yml

services:
  prometheus:
    image: prom/prometheus:latest
    container_name: prometheus
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml:ro
      - prometheus-data:/prometheus
    ports:
      - "127.0.0.1:9091:9090"
    networks:
      - headscale-network
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.retention.time=30d'

  grafana:
    image: grafana/grafana:latest
    container_name: grafana
    ports:
      - "3002:3000"
    volumes:
      - grafana-data:/var/lib/grafana
    networks:
      - headscale-network
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=changeme
      - GF_USERS_ALLOW_SIGN_UP=false

volumes:
  prometheus-data:
  grafana-data:

Prometheus Configuration

Create prometheus.yml:

global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'headscale'
    static_configs:
      - targets: ['headscale:9090']
        labels:
          service: 'headscale'

  - job_name: 'postgres'
    static_configs:
      - targets: ['postgres-exporter:9187']
        labels:
          service: 'postgres'

  - job_name: 'cadvisor'
    static_configs:
      - targets: ['cadvisor:8080']
        labels:
          service: 'docker'

Alerting

Basic Alert Script

monitor-headscale.sh

#!/bin/bash

# Health check
if ! curl -sf http://localhost:8000/health > /dev/null; then
    echo "ALERT: Headscale health check failed" | mail -s "Headscale Down" [email protected]
fi

# Disk space check
DISK_USAGE=$(df -h / | awk 'NR==2 {print $5}' | sed 's/%//')
if [ "$DISK_USAGE" -gt 90 ]; then
    echo "ALERT: Disk usage at ${DISK_USAGE}%" | mail -s "Disk Space Critical" [email protected]
fi

# Database connection check
if ! docker exec headscale-db pg_isready -U headscale > /dev/null; then
    echo "ALERT: Database connection failed" | mail -s "Database Down" [email protected]
fi

Schedule with cron:

# Every 5 minutes
*/5 * * * * /path/to/monitor-headscale.sh

Prometheus Alertmanager

Create alertmanager.yml:

route:
  receiver: 'email'
  group_by: ['alertname', 'service']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h

receivers:
  - name: 'email'
    email_configs:
      - to: '[email protected]'
        from: '[email protected]'
        smarthost: 'smtp.example.com:587'
        auth_username: '[email protected]'
        auth_password: 'password'

Define alert rules in alerts.yml:

groups:
  - name: headscale
    rules:
      - alert: HeadscaleDown
        expr: up{job="headscale"} == 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Headscale is down"

      - alert: HighMemoryUsage
        expr: container_memory_usage_bytes{name="headscale"} / container_spec_memory_limit_bytes{name="headscale"} > 0.9
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Headscale memory usage above 90%"

      - alert: DatabaseConnectionsFull
        expr: headscale_db_connections_open >= headscale_db_max_open_connections
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "Database connection pool exhausted"

Performance Monitoring

Key Performance Indicators

Response Time

/health endpoint: < 10ms
API endpoints: < 50ms
Node registration: < 500ms

Throughput

API requests: 100+ req/s
WebSocket connections: 1000+ concurrent
DERP relay: 100+ Mbps

Resource Usage

CPU: < 10% average
Memory: < 512MB typical
Disk I/O: < 10 MB/s

Availability

Uptime: 99.9%+
Health checks: 100% pass
Database: < 1s query time

Benchmarking

# API response time
time curl http://localhost:8000/health

# Load testing
ab -n 1000 -c 10 http://localhost:8000/health

# Database query performance
docker exec headscale-db psql -U headscale -c "EXPLAIN ANALYZE SELECT * FROM nodes;"

Status Page

Create a simple status page:

status.html

<!DOCTYPE html>
<html>
<head>
    <title>Headscale Status</title>
    <meta http-equiv="refresh" content="30">
</head>
<body>
    <h1>Headscale Status</h1>
    <div id="status"></div>
    
    <script>
        fetch('http://localhost:8000/health')
            .then(r => r.json())
            .then(d => {
                document.getElementById('status').innerHTML = 
                    `Status: ${d.status}<br>Last checked: ${new Date()}`;
            })
            .catch(e => {
                document.getElementById('status').innerHTML = 
                    `Status: Error - ${e.message}`;
            });
    </script>
</body>
</html>

Troubleshooting

Metrics endpoint not accessible

# Check port binding
docker compose ps | grep headscale

# Verify metrics configuration
grep metrics_listen_addr config/config.yaml

# Test from inside container
docker exec headscale curl http://localhost:9090/metrics

High memory usage

# Check for memory leaks
docker stats --no-stream headscale

# Review database connection pool
grep max_open_conns config/config.yaml

# Restart service
docker compose restart headscale

Logs filling disk

# Check current log size
docker inspect headscale | grep LogPath
du -h $(docker inspect headscale | grep LogPath | cut -d'"' -f4)

# Configure log rotation
sudo nano /etc/docker/daemon.json
# Add log rotation settings

# Restart Docker
sudo systemctl restart docker

Troubleshooting

Diagnose and fix common issues

Security

Secure your monitoring endpoints

Get Started

Deployment

Configuration

Guides

Operations

​Overview

​Health Checks

​Headscale Health Endpoint

​Container Health Status

​Health Check Configuration

​Prometheus Metrics

​Metrics Endpoint

​Key Metrics

​Metrics Configuration

​Log Management

​Viewing Logs

​Log Levels

​Log Analysis

​Log Rotation

​Resource Monitoring

​Container Resource Usage

​System Resources

​Database Monitoring

​Monitoring Stack Setup

​Prometheus + Grafana

​Prometheus Configuration

​Alerting

​Basic Alert Script

​Prometheus Alertmanager

​Performance Monitoring

​Key Performance Indicators

Response Time

Throughput

Resource Usage

Availability

​Benchmarking

​Status Page

​Troubleshooting

​Related Resources

Troubleshooting

Security

Build docs developers (and LLMs) love

Overview

Health Checks

Headscale Health Endpoint

Container Health Status

Health Check Configuration

Prometheus Metrics

Metrics Endpoint

Key Metrics

Metrics Configuration

Log Management

Viewing Logs

Log Levels

Log Analysis

Log Rotation

Resource Monitoring

Container Resource Usage

System Resources

Database Monitoring

Monitoring Stack Setup

Prometheus + Grafana

Prometheus Configuration

Alerting

Basic Alert Script

Prometheus Alertmanager

Performance Monitoring

Key Performance Indicators

Benchmarking

Status Page

Troubleshooting

Related Resources