Monitoring - KloudMate Agent

The KloudMate Agent provides comprehensive monitoring capabilities through built-in health checks, status reporting, and observability endpoints.

Health Check Endpoints

The agent exposes health check endpoints via the OpenTelemetry Collector’s health_check extension.

Configuration

The health check extension is configured in the collector configuration:

internal/agent/agent.go

extensions:
  health_check:
    endpoint: 0.0.0.0:13133

The health check endpoint listens on port 13133 by default. This port should be accessible for monitoring systems.

Health Check Endpoints

Liveness Probe

Endpoint: http://localhost:13133/Returns 200 if the collector is running

Readiness Probe

Endpoint: http://localhost:13133/Returns 200 if collector can receive data

Testing Health Endpoints

curl http://localhost:13133/

Agent Status Reporting

The agent reports its operational status to the KloudMate platform through periodic status updates.

Status Parameters

The agent sends the following status information:

internal/updater/updater.go

data := map[string]interface{}{
    "is_docker":          u.cfg.DockerMode,
    "hostname":           u.cfg.Hostname(),
    "platform":           platform,
    "architecture":       runtime.GOARCH,
    "agent_version":      p.Version,
    "collector_version":  version.GetCollectorVersion(),
    "agent_status":       p.AgentStatus,
    "collector_status":   p.CollectorStatus,
    "last_error_message": p.CollectorLastError,
}

Status Values

Agent Status

Running: Agent is operational and managing the collector
Stopped: Agent has been stopped or is shutting down

Collector Status

Running: OpenTelemetry Collector is actively processing telemetry
Stopped: Collector is not running (may be restarting or failed)

Systemd Service Monitoring (Linux)

For Linux installations, the agent runs as a systemd service.

Check Service Status

sudo systemctl status kmagent

Expected Output:

● kmagent.service - KloudMate Agent
   Loaded: loaded (/lib/systemd/system/kmagent.service; enabled)
   Active: active (running) since Thu 2024-03-06 10:15:30 UTC; 2h 15min ago
 Main PID: 12345 (kmagent)
    Tasks: 23
   Memory: 128.5M
   CGroup: /system.slice/kmagent.service
           └─12345 /usr/bin/kmagent start

Service Management Commands

sudo systemctl status kmagent

Docker Container Monitoring

For Docker installations, monitor the agent container directly.

Container Status

docker ps -f name=km-agent

Container Logs

docker logs -f km-agent

Container Resource Usage

docker stats km-agent --no-stream

Example Output:

CONTAINER ID   NAME       CPU %     MEM USAGE / LIMIT     MEM %
a1b2c3d4e5f6   km-agent   2.5%      128MiB / 1.95GiB     6.4%

Kubernetes Monitoring

For Kubernetes deployments, use kubectl and Kubernetes-native monitoring.

Pod Status

kubectl get pods -n km-agent -l app.kubernetes.io/component=node-agent

Pod Health

kubectl describe pod -n km-agent <pod-name>

Check the Conditions section for:

Ready: True when pod is accepting traffic
ContainersReady: True when all containers are ready
PodScheduled: True when pod is assigned to a node

Pod Logs

kubectl logs -n km-agent -l app.kubernetes.io/component=node-agent -f

Resource Usage

kubectl top pods -n km-agent

Events

Monitor Kubernetes events for agent-related issues:

kubectl get events -n km-agent --sort-by='.lastTimestamp'

Configuration Update Monitoring

The agent periodically checks for configuration updates from the KloudMate platform.

Update Check Interval

The default check interval is configurable:

cmd/kmagent/main.go

altsrc.NewIntFlag(&cli.IntFlag{
    Name:        "config-check-interval",
    Usage:       "Interval in seconds to check for config updates",
    Value:       60,
    EnvVars:     []string{"KM_CONFIG_CHECK_INTERVAL"},
    Destination: &program.cfg.ConfigCheckInterval,
}),

Default configuration check interval is 60 seconds. For Kubernetes deployments, this can be customized via Helm values.

Monitoring Update Checks

Look for these log messages:

INFO  config update checker started
DEBUG checking for configuration updates
DEBUG no configuration change detected
INFO  configuration changed, restarting collector
INFO  collector restarted successfully

Performance Metrics

Agent Lifecycle Events

The agent logs key lifecycle events:

internal/agent/agent.go

a.logger.Info("agent start sequence initiated")
a.logger.Info("collector instance created, starting run loop")
a.logger.Info("collector run loop exited normally")
a.logger.Info("collector restarted successfully")

Error Tracking

Monitor for these error patterns:

ERROR Initial collector run failed
ERROR Periodic config check failed
ERROR failed to create new collector instance
ERROR collector run loop exited with error

Monitoring Best Practices

Set Up Alerts

Configure alerts for:

Agent/collector status changes
Health check failures
Configuration update failures
High resource usage

Regular Health Checks

Schedule periodic health checks:

Every 30 seconds for production
Monitor response time trends
Track uptime metrics

Log Aggregation

Centralize logs for:

Multi-host deployments
Historical analysis
Pattern detection
Compliance requirements

Resource Monitoring

Track resource usage:

CPU utilization trends
Memory consumption patterns
Network traffic volume
Disk I/O operations

Troubleshooting Monitoring Issues

Health Check Not Responding

Verify Port Accessibility

Ensure port 13133 is not blocked by firewall:

sudo netstat -tlnp | grep 13133

Check Collector Status

Verify the collector is running:

sudo systemctl status kmagent

Review Configuration

Confirm health_check extension is enabled in the collector config

Missing Status Updates

If status updates are not appearing on the KloudMate platform:

Verify network connectivity to https://api.kloudmate.com
Check API key is valid and properly configured
Review agent logs for connection errors
Confirm the update endpoint URL is correct

Next Steps

Troubleshooting

Diagnose and resolve common issues

Upgrading

Upgrade to the latest version

Getting Started

Installation

Configuration

Features

Architecture

Operations

​Health Check Endpoints

​Configuration

​Health Check Endpoints

Liveness Probe

Readiness Probe

​Testing Health Endpoints

​Agent Status Reporting

​Status Parameters

​Status Values

​Systemd Service Monitoring (Linux)

​Check Service Status

​Service Management Commands

​Docker Container Monitoring

​Container Status

​Container Logs

​Container Resource Usage

​Kubernetes Monitoring

​Pod Status

​Pod Health

​Pod Logs

​Resource Usage

​Events

​Configuration Update Monitoring

​Update Check Interval

​Monitoring Update Checks

​Performance Metrics

​Agent Lifecycle Events

​Error Tracking

​Monitoring Best Practices

Set Up Alerts

Regular Health Checks

Log Aggregation

Resource Monitoring

​Troubleshooting Monitoring Issues

​Health Check Not Responding

​Missing Status Updates

​Next Steps

Troubleshooting

Upgrading

Build docs developers (and LLMs) love

Health Check Endpoints

Configuration

Health Check Endpoints

Testing Health Endpoints

Agent Status Reporting

Status Parameters

Status Values

Systemd Service Monitoring (Linux)

Check Service Status

Service Management Commands

Docker Container Monitoring

Container Status

Container Logs

Container Resource Usage

Kubernetes Monitoring

Pod Status

Pod Health

Pod Logs

Resource Usage

Events

Configuration Update Monitoring

Update Check Interval

Monitoring Update Checks

Performance Metrics

Agent Lifecycle Events

Error Tracking

Monitoring Best Practices

Troubleshooting Monitoring Issues

Health Check Not Responding

Missing Status Updates

Next Steps