Overview
Synthetic monitoring in the KloudMate Agent encompasses:Health Reporting
Continuous status updates to the KloudMate platform
Error Tracking
Automatic capture and transmission of error states
Version Monitoring
Track agent and collector versions across your infrastructure
Platform Identification
Unique agent identification for multi-environment management
Health Check Mechanism
The agent performs health checks as part of its configuration update cycle, sending detailed status information to the KloudMate API.Status Reporting Payload
Every configuration check includes comprehensive health data:updater.go:64-81
| Field | Description | Example Values |
|---|---|---|
is_docker | Whether running in Docker mode | true, false |
hostname | System hostname for identification | web-server-01 |
platform | Operating system | linux, windows, darwin, docker |
architecture | CPU architecture | amd64, arm64 |
agent_version | Current agent version | 0.1.0 |
collector_version | OpenTelemetry Collector version | 0.115.0 |
agent_status | Agent process state | Running, Stopped |
collector_status | Collector process state | Running, Stopped |
last_error_message | Last error encountered | Error string or empty |
Status Collection
The agent gathers status information before each API call:agent.go:232-257
- Uses mutex lock when checking collector reference
- Reads atomic boolean for agent running state
- Captures last error message for diagnostics
API Communication
The agent communicates with the KloudMate platform via HTTPS POST requests:updater.go:90-106
- HTTPS for encrypted communication
- API key authentication in Authorization header
- Request timeout protection (20 seconds)
- Context-aware cancellation
API Endpoint Configuration
The update endpoint is automatically derived from the collector endpoint:config.go:28-52
| Collector Endpoint | Derived API Endpoint |
|---|---|
https://otel.kloudmate.com:4318 | https://api.kloudmate.com/agents/config-check |
https://otel.kloudmate.dev:4318 | https://api.kloudmate.dev/agents/config-check |
https://otel.example.io:4318 | https://api.example.io/agents/config-check |
Health Check Frequency
Health checks occur at configurable intervals:Configuration
agent.go:194-230
- Default interval: 60 seconds
- First check: Immediate on agent startup
- Subsequent checks: On ticker interval
- Error handling: Logs errors but continues checking
- Graceful shutdown: Stops on context cancellation or shutdown signal
Configuring Check Interval
Recommended intervals:
- Production: 60-300 seconds
- Staging: 30-60 seconds
- Development: 10-30 seconds
Error Tracking and Reporting
The agent captures and reports errors from the collector:Error Capture
agent.go:151-159
collectorError field and included in the next health check report.
Error Persistence
Errors remain in the status until the collector successfully starts:agent.go:240-248
Unique Agent Identification
Each agent is uniquely identifiable through multiple attributes:Identification Components
Platform Information
Operating system and architecture help categorize agents:
- Platform:
linux,windows,darwin,docker - Architecture:
amd64,arm64,386
Kubernetes-Specific Monitoring
For Kubernetes deployments, additional monitoring data is available through:- DaemonSet agents: Report node-level metrics
- Deployment agents: Report cluster-level metrics
- Service discovery: Automatic endpoint detection
Dashboard Integration
The KloudMate platform uses health data to provide:Agent Inventory
Complete list of all agents with status, version, and platform
Health Dashboard
Real-time view of agent and collector health across your infrastructure
Alert Configuration
Set up alerts for agent failures, version mismatches, or error conditions
Trend Analysis
Historical view of agent uptime and error patterns
Configuration Examples
Linux Installation with Custom Monitoring
Docker with Health Reporting
Kubernetes with Enhanced Monitoring
Monitoring Best Practices
Set Appropriate Intervals
Balance between monitoring responsiveness and API load:
- High-value production systems: 30-60 seconds
- Standard deployments: 60-120 seconds
- Large fleets (100+ agents): 120-300 seconds
Set Up Alerts
Configure alerts in the KloudMate dashboard for:
- Agent offline (no heartbeat for 2x check interval)
- Collector stopped
- Repeated errors
- Version drift across agents
Troubleshooting
No health data in dashboard
No health data in dashboard
Symptoms: Agent appears to be running but not reporting to platformCheck:
- Verify API key is correct:
echo $KM_API_KEY - Confirm update URL is reachable:
- Check agent logs for connection errors
- Verify firewall rules allow outbound HTTPS
Stale health data
Stale health data
Symptoms: Last seen timestamp is outdatedSolutions:
- Check if agent is still running:
- Verify
ConfigCheckIntervalis set correctly - Review logs for repeated API failures
- Check network connectivity
Incorrect agent status
Incorrect agent status
Symptoms: Dashboard shows wrong statusSolutions:
- Wait for next check interval to see updated status
- Manually trigger config check (restart agent)
- Verify agent and collector processes are running
- Check for clock skew on the host
High API call volume
High API call volume
Symptoms: Too many requests to update endpointSolutions:
- Increase
ConfigCheckIntervalto reduce frequency - Verify only one agent instance is running per host
- Check for restart loops causing repeated checks
Security Considerations
Authentication Flow
Metrics and Observability
The agent reports several categories of observability data:System Metrics
- Agent uptime
- Configuration check success rate
- Last successful check timestamp
- API response times
Collector Metrics
- Collector uptime
- Restart count
- Error frequency
- Configuration reload success rate
Platform Metrics
- Agent distribution by platform
- Version distribution
- Geographic distribution (based on API endpoint)
Next Steps
Multi-Platform Support
Learn about agent deployment across different environments
Configuration
Deep dive into agent configuration options