Overview
Monitoring infrastructure provides visibility into host health, service availability, and backup job success across the entire homelab.Beszel
Host and container metrics
Uptime Kuma
Service uptime monitoring
Healthchecks.io
Backup job heartbeat monitoring
Beszel
Purpose: Lightweight host and container resource monitoring Architecture:- Beszel server: Runs on pi-prod-01 (192.168.10.20)
- Beszel agents: Deployed on all hosts
https://beszel.giohosted.com
Monitored Hosts
| Host | Type | Agent Location | Metrics |
|---|---|---|---|
| pve-prod-01 | Proxmox | Native agent | CPU, RAM, disk, network |
| pve-prod-02 | Proxmox | Native agent | CPU, RAM, disk, network |
| nas-prod-01 | Unraid | Docker agent | CPU, RAM, disk, network, array status |
| docker-prod-01 | VM | Docker agent | CPU, RAM, disk, containers |
| auth-prod-01 | VM | Docker agent | CPU, RAM, disk, containers |
| immich-prod-01 | VM | Docker agent | CPU, RAM, disk, containers |
| pi-prod-01 | Raspberry Pi | Native agent | CPU, RAM, disk, network |
Features
System metrics:- CPU usage (per core and aggregate)
- Memory usage (used/available/cached)
- Disk I/O and usage percentage
- Network throughput (TX/RX)
- System uptime
- Per-container CPU and memory usage
- Container status (running/stopped)
- Container count
- Configurable thresholds
- Discord webhook notifications
- Email alerts (optional)
Agent Deployment
Docker agent (for VMs and Unraid):SSO Configuration
OIDC via Authentik: Configuration steps:- Create custom scope in Authentik:
email_verifiedwith valuetrue - Add scope to Beszel OIDC provider
- Configure Beszel OAuth settings with Authentik endpoints
- Test login with admin account
- Dashboard:
https://beszel.giohosted.com - API:
https://beszel.giohosted.com/api
Uptime Kuma
Purpose: HTTP/HTTPS service availability monitoring Location: pi-prod-01 (192.168.10.20) Access:https://uptime.giohosted.com
Monitored Services
Internal services:- Traefik (
https://traefik.giohosted.com) - Plex (
https://plex.giohosted.com) - Sonarr, Radarr, Prowlarr (all ARR services)
- Audiobookshelf, Calibre-Web-Automated, Shelfmark
- Immich
- Authentik
- AdGuard Home (dns-prod-01 and dns-prod-02)
- Proxmox (pve-prod-01 and pve-prod-02)
audiobooks.giohosted.com(via Cloudflare Tunnel)books.giohosted.com(via Cloudflare Tunnel)request.giohosted.com(via Cloudflare Tunnel)auth.giohosted.com(via Cloudflare Tunnel)
Monitor Types
HTTP(S) monitoring:- Status code checking (expect 200, 301, etc.)
- Response time tracking
- SSL certificate expiration alerts
- Keyword presence/absence validation
- Port 32400 (Plex external)
- Port 22 (SSH on critical hosts)
- Host reachability
- Network latency
Notification Channels
Discord webhook:- Service down alerts
- Service recovery notifications
- SSL certificate expiration warnings
- Check interval: 60 seconds (critical services), 300 seconds (non-critical)
- Retry: 3 attempts before marking down
- Timeout: 10 seconds per request
Status Page
Public status page (optional):- Accessible at custom URL
- Shows current status of monitored services
- Historical uptime percentages
- No authentication required (read-only)
- Admin panel:
https://uptime.giohosted.com(local only, no SSO) - Status page: Can be configured for public access
Uptime Kuma intentionally does NOT have SSO. It is LAN-only for admin access, with optional public status page for read-only viewing.
Healthchecks.io
Purpose: Cron job and backup script heartbeat monitoring Platform: Cloud-hosted at healthchecks.io (free tier) Alternative: Self-hosted instance (future consideration)Monitored Jobs
| Job Name | Frequency | Script | Monitored Action |
|---|---|---|---|
| Docker Backup | Daily (2 AM) | /opt/scripts/backup-docker.sh | Successful rsync to NAS |
| Plex DB Backup | Daily (3 AM) | /opt/scripts/backup-plex-db.sh | Successful Plex DB backup |
| PBS Backup (docker-prod-01) | Daily (1 AM) | Proxmox Backup Server | VM snapshot completion |
| PBS Backup (auth-prod-01) | Weekly (Sun 1 AM) | Proxmox Backup Server | VM snapshot completion |
| PBS Backup (immich-prod-01) | Weekly (Sun 2 AM) | Proxmox Backup Server | VM snapshot completion |
| Synology ABB Pull | Daily (4 AM) | Synology Active Backup | Successful backup pull |
Integration
Ping URL format:$HC_URL- Success ping (job completed successfully)$HC_URL/start- Start ping (job started)$HC_URL/fail- Failure ping (job failed)
Alert Configuration
Grace period:- Daily jobs: 15 minutes grace period
- Weekly jobs: 60 minutes grace period
- Allows for slight delays without false alarms
- Discord webhook (primary)
- Email (secondary)
- SMS (critical jobs only, requires paid plan)
- Job did not ping within expected schedule + grace period
- Job sent explicit failure ping
- Job started but never completed
Dashboard
Access: https://healthchecks.io/checks/ Features:- Visual timeline of pings
- Last ping time and status
- Expected next ping time
- Historical reliability percentage
- Manual ping test button
Monitoring Strategy
Complementary Roles
Beszel: Infrastructure-level metrics- “Is the host healthy?”
- “Is CPU/RAM/disk usage normal?”
- “Are containers running?”
- “Is the service responding?”
- “Is HTTPS working with valid certificate?”
- “Can users access the service?”
- “Did the backup run?”
- “Did the backup succeed?”
- “Are cron jobs executing on schedule?”
Alert Fatigue Prevention
Thresholds:- Beszel: Alert only on sustained high usage (>90% for 5+ minutes)
- Uptime Kuma: 3 retry attempts before alerting
- Healthchecks: Grace period prevents early alerts
- Critical alerts: Discord with @mention
- Non-critical alerts: Discord without mention
- Informational: Log only, no notification
Backup Monitoring Deep Dive
Docker Backup Script
Script:/opt/scripts/backup-docker.sh
Coverage:
/opt/stacks/- All compose files/opt/appdata/- All container persistent data
- Mountpoint check: Fails if NAS mount missing (prevents backup to local disk)
- Lockfile: Prevents concurrent runs
- Healthchecks ping on success/failure
- Logs to
/var/log/backup-docker.log
- Success ping: Backup completed without errors
- Failure ping: rsync error, mountpoint missing, or lockfile conflict
Plex DB Backup Script
Script:/opt/scripts/backup-plex-db.sh
Process:
- Stop Plex service
- rsync
/opt/appdata/plex/to NAS/backups/plex/db/ - Restart Plex service
- Ping Healthchecks on success/failure
PBS Backup Monitoring
Proxmox Backup Server:- Automated VM snapshots configured per host
- Backup jobs run via Proxmox scheduler
- Healthchecks ping via post-job hook script
Future: Kubernetes Monitoring
When Phase 6 k3s cluster is introduced: Prometheus + Grafana:- Replaces Beszel for k8s metrics
- Pod, node, and persistent volume monitoring
- Custom dashboards for application metrics
- Kubernetes cluster management UI
- Real-time resource viewing
- Log aggregation
- Continue for application-level and job monitoring
- Kubernetes-native health checks supplement but don’t replace
Docker-based services (ARR stack, qBittorrent) will remain on Beszel monitoring even after k3s introduction.
Troubleshooting Monitoring
Troubleshooting Monitoring
Beszel agent not reporting:
- Check agent status on host:
systemctl status beszel-agent(native) ordocker logs beszel-agent(Docker) - Verify network connectivity to Beszel server:
curl http://192.168.10.20:45876 - Check agent key matches server configuration
- Review firewall rules: Agent port 45876 must be accessible
- Increase retry count: Settings → Monitor → Retries (try 5)
- Increase timeout: Settings → Monitor → Timeout (try 30 seconds)
- Check certificate expiration alerts are set correctly
- Verify keyword search is not too strict
- Test ping manually:
curl -fsS https://hc-ping.com/<uuid> - Check cron job is running:
systemctl status cron - Review cron logs:
grep CRON /var/log/syslog - Verify script has internet access: Test with
curl https://google.com - Check script exit code: Add
echo $?at end of script
- Check cron syntax:
crontab -l - Verify script executable:
chmod +x /opt/scripts/backup-docker.sh - Check mountpoint before run:
mount | grep /mnt/nas - Review script logs:
tail -f /var/log/backup-docker.log - Test manual run:
/opt/scripts/backup-docker.sh