Health Check Endpoints
The agent exposes health check endpoints via the OpenTelemetry Collector’shealth_check extension.
Configuration
The health check extension is configured in the collector configuration:internal/agent/agent.go
The health check endpoint listens on port 13133 by default. This port should be accessible for monitoring systems.
Health Check Endpoints
Liveness Probe
Endpoint:
http://localhost:13133/Returns 200 if the collector is runningReadiness Probe
Endpoint:
http://localhost:13133/Returns 200 if collector can receive dataTesting Health Endpoints
Agent Status Reporting
The agent reports its operational status to the KloudMate platform through periodic status updates.Status Parameters
The agent sends the following status information:internal/updater/updater.go
Status Values
Agent Status
- Running: Agent is operational and managing the collector
- Stopped: Agent has been stopped or is shutting down
Systemd Service Monitoring (Linux)
For Linux installations, the agent runs as a systemd service.Check Service Status
Service Management Commands
Docker Container Monitoring
For Docker installations, monitor the agent container directly.Container Status
Container Logs
Container Resource Usage
Kubernetes Monitoring
For Kubernetes deployments, use kubectl and Kubernetes-native monitoring.Pod Status
Pod Health
- Ready: True when pod is accepting traffic
- ContainersReady: True when all containers are ready
- PodScheduled: True when pod is assigned to a node
Pod Logs
Resource Usage
Events
Monitor Kubernetes events for agent-related issues:Configuration Update Monitoring
The agent periodically checks for configuration updates from the KloudMate platform.Update Check Interval
The default check interval is configurable:cmd/kmagent/main.go
Default configuration check interval is 60 seconds. For Kubernetes deployments, this can be customized via Helm values.
Monitoring Update Checks
Look for these log messages:Performance Metrics
Agent Lifecycle Events
The agent logs key lifecycle events:internal/agent/agent.go
Error Tracking
Monitor for these error patterns:Monitoring Best Practices
Set Up Alerts
Configure alerts for:
- Agent/collector status changes
- Health check failures
- Configuration update failures
- High resource usage
Regular Health Checks
Schedule periodic health checks:
- Every 30 seconds for production
- Monitor response time trends
- Track uptime metrics
Log Aggregation
Centralize logs for:
- Multi-host deployments
- Historical analysis
- Pattern detection
- Compliance requirements
Resource Monitoring
Track resource usage:
- CPU utilization trends
- Memory consumption patterns
- Network traffic volume
- Disk I/O operations
Troubleshooting Monitoring Issues
Health Check Not Responding
Missing Status Updates
Next Steps
Troubleshooting
Diagnose and resolve common issues
Upgrading
Upgrade to the latest version