Skip to main content
The KloudMate Agent provides comprehensive monitoring capabilities through built-in health checks, status reporting, and observability endpoints.

Health Check Endpoints

The agent exposes health check endpoints via the OpenTelemetry Collector’s health_check extension.

Configuration

The health check extension is configured in the collector configuration:
internal/agent/agent.go
extensions:
  health_check:
    endpoint: 0.0.0.0:13133
The health check endpoint listens on port 13133 by default. This port should be accessible for monitoring systems.

Health Check Endpoints

Liveness Probe

Endpoint: http://localhost:13133/Returns 200 if the collector is running

Readiness Probe

Endpoint: http://localhost:13133/Returns 200 if collector can receive data

Testing Health Endpoints

curl http://localhost:13133/

Agent Status Reporting

The agent reports its operational status to the KloudMate platform through periodic status updates.

Status Parameters

The agent sends the following status information:
internal/updater/updater.go
data := map[string]interface{}{
    "is_docker":          u.cfg.DockerMode,
    "hostname":           u.cfg.Hostname(),
    "platform":           platform,
    "architecture":       runtime.GOARCH,
    "agent_version":      p.Version,
    "collector_version":  version.GetCollectorVersion(),
    "agent_status":       p.AgentStatus,
    "collector_status":   p.CollectorStatus,
    "last_error_message": p.CollectorLastError,
}

Status Values

1

Agent Status

  • Running: Agent is operational and managing the collector
  • Stopped: Agent has been stopped or is shutting down
2

Collector Status

  • Running: OpenTelemetry Collector is actively processing telemetry
  • Stopped: Collector is not running (may be restarting or failed)

Systemd Service Monitoring (Linux)

For Linux installations, the agent runs as a systemd service.

Check Service Status

sudo systemctl status kmagent
Expected Output:
● kmagent.service - KloudMate Agent
   Loaded: loaded (/lib/systemd/system/kmagent.service; enabled)
   Active: active (running) since Thu 2024-03-06 10:15:30 UTC; 2h 15min ago
 Main PID: 12345 (kmagent)
    Tasks: 23
   Memory: 128.5M
   CGroup: /system.slice/kmagent.service
           └─12345 /usr/bin/kmagent start

Service Management Commands

sudo systemctl status kmagent

Docker Container Monitoring

For Docker installations, monitor the agent container directly.

Container Status

docker ps -f name=km-agent

Container Logs

docker logs -f km-agent

Container Resource Usage

docker stats km-agent --no-stream
Example Output:
CONTAINER ID   NAME       CPU %     MEM USAGE / LIMIT     MEM %
a1b2c3d4e5f6   km-agent   2.5%      128MiB / 1.95GiB     6.4%

Kubernetes Monitoring

For Kubernetes deployments, use kubectl and Kubernetes-native monitoring.

Pod Status

kubectl get pods -n km-agent -l app.kubernetes.io/component=node-agent

Pod Health

kubectl describe pod -n km-agent <pod-name>
Check the Conditions section for:
  • Ready: True when pod is accepting traffic
  • ContainersReady: True when all containers are ready
  • PodScheduled: True when pod is assigned to a node

Pod Logs

kubectl logs -n km-agent -l app.kubernetes.io/component=node-agent -f

Resource Usage

kubectl top pods -n km-agent

Events

Monitor Kubernetes events for agent-related issues:
kubectl get events -n km-agent --sort-by='.lastTimestamp'

Configuration Update Monitoring

The agent periodically checks for configuration updates from the KloudMate platform.

Update Check Interval

The default check interval is configurable:
cmd/kmagent/main.go
altsrc.NewIntFlag(&cli.IntFlag{
    Name:        "config-check-interval",
    Usage:       "Interval in seconds to check for config updates",
    Value:       60,
    EnvVars:     []string{"KM_CONFIG_CHECK_INTERVAL"},
    Destination: &program.cfg.ConfigCheckInterval,
}),
Default configuration check interval is 60 seconds. For Kubernetes deployments, this can be customized via Helm values.

Monitoring Update Checks

Look for these log messages:
INFO  config update checker started
DEBUG checking for configuration updates
DEBUG no configuration change detected
INFO  configuration changed, restarting collector
INFO  collector restarted successfully

Performance Metrics

Agent Lifecycle Events

The agent logs key lifecycle events:
internal/agent/agent.go
a.logger.Info("agent start sequence initiated")
a.logger.Info("collector instance created, starting run loop")
a.logger.Info("collector run loop exited normally")
a.logger.Info("collector restarted successfully")

Error Tracking

Monitor for these error patterns:
ERROR Initial collector run failed
ERROR Periodic config check failed
ERROR failed to create new collector instance
ERROR collector run loop exited with error

Monitoring Best Practices

Set Up Alerts

Configure alerts for:
  • Agent/collector status changes
  • Health check failures
  • Configuration update failures
  • High resource usage

Regular Health Checks

Schedule periodic health checks:
  • Every 30 seconds for production
  • Monitor response time trends
  • Track uptime metrics

Log Aggregation

Centralize logs for:
  • Multi-host deployments
  • Historical analysis
  • Pattern detection
  • Compliance requirements

Resource Monitoring

Track resource usage:
  • CPU utilization trends
  • Memory consumption patterns
  • Network traffic volume
  • Disk I/O operations

Troubleshooting Monitoring Issues

Health Check Not Responding

1

Verify Port Accessibility

Ensure port 13133 is not blocked by firewall:
sudo netstat -tlnp | grep 13133
2

Check Collector Status

Verify the collector is running:
sudo systemctl status kmagent
3

Review Configuration

Confirm health_check extension is enabled in the collector config

Missing Status Updates

If status updates are not appearing on the KloudMate platform:
  1. Verify network connectivity to https://api.kloudmate.com
  2. Check API key is valid and properly configured
  3. Review agent logs for connection errors
  4. Confirm the update endpoint URL is correct

Next Steps

Troubleshooting

Diagnose and resolve common issues

Upgrading

Upgrade to the latest version

Build docs developers (and LLMs) love