Skip to main content
The KloudMate Agent includes comprehensive synthetic monitoring capabilities that provide visibility into agent health, collector status, and operational metrics. The agent actively reports its state to the KloudMate platform, enabling centralized monitoring and alerting without additional infrastructure.

Overview

Synthetic monitoring in the KloudMate Agent encompasses:

Health Reporting

Continuous status updates to the KloudMate platform

Error Tracking

Automatic capture and transmission of error states

Version Monitoring

Track agent and collector versions across your infrastructure

Platform Identification

Unique agent identification for multi-environment management

Health Check Mechanism

The agent performs health checks as part of its configuration update cycle, sending detailed status information to the KloudMate API.

Status Reporting Payload

Every configuration check includes comprehensive health data:
updater.go:64-81
data := map[string]interface{}{
	"is_docker":          u.cfg.DockerMode,
	"hostname":           u.cfg.Hostname(),
	"platform":           platform,
	"architecture":       runtime.GOARCH,
	"agent_version":      p.Version,
	"collector_version":  version.GetCollectorVersion(),
	"agent_status":       p.AgentStatus,
	"collector_status":   p.CollectorStatus,
	"last_error_message": p.CollectorLastError,
}
Reported metrics:
FieldDescriptionExample Values
is_dockerWhether running in Docker modetrue, false
hostnameSystem hostname for identificationweb-server-01
platformOperating systemlinux, windows, darwin, docker
architectureCPU architectureamd64, arm64
agent_versionCurrent agent version0.1.0
collector_versionOpenTelemetry Collector version0.115.0
agent_statusAgent process stateRunning, Stopped
collector_statusCollector process stateRunning, Stopped
last_error_messageLast error encounteredError string or empty

Status Collection

The agent gathers status information before each API call:
agent.go:232-257
func (a *Agent) performConfigCheck(agentCtx context.Context) error {
	ctx, cancel := context.WithTimeout(agentCtx, 10*time.Second)
	defer cancel()

	a.logger.Debug("checking for configuration updates")

	// Safely read collector status under lock
	a.collectorMu.Lock()
	params := updater.UpdateCheckerParams{
		Version: a.version,
	}
	if a.collector != nil {
		params.CollectorStatus = "Running"
	} else {
		params.CollectorStatus = "Stopped"
		params.CollectorLastError = a.collectorError
	}
	a.collectorMu.Unlock()

	if a.isRunning.Load() {
		params.AgentStatus = "Running"
	} else {
		params.AgentStatus = "Stopped"
	}

	a.logger.Debugf("Checking for updates with params: %+v", params)
}
Thread-safe status gathering:
  • Uses mutex lock when checking collector reference
  • Reads atomic boolean for agent running state
  • Captures last error message for diagnostics

API Communication

The agent communicates with the KloudMate platform via HTTPS POST requests:
updater.go:90-106
req, err := http.NewRequestWithContext(reqCtx, "POST", u.cfg.ConfigUpdateURL, bytes.NewBuffer(jsonData))
if err != nil {
	return false, nil, fmt.Errorf("failed to create request: %w", err)
}

req.Header.Set("Content-Type", "application/json")
if u.cfg.APIKey != "" {
	req.Header.Set("Authorization", u.cfg.APIKey)
}

resp, respErr := u.client.Do(req)
if respErr != nil {
	return false, nil, fmt.Errorf("failed to fetch config updates: %w", respErr)
}
defer resp.Body.Close()
Security features:
  • HTTPS for encrypted communication
  • API key authentication in Authorization header
  • Request timeout protection (20 seconds)
  • Context-aware cancellation

API Endpoint Configuration

The update endpoint is automatically derived from the collector endpoint:
config.go:28-52
func GetAgentConfigUpdaterURL(collectorEndpoint string) string {
	const fallbackURL = "https://api.kloudmate.com/agents/config-check"

	u, _ := url.Parse(collectorEndpoint)
	host := u.Hostname()
	parts := strings.Split(host, ".")

	if len(parts) < 2 {
		return fallbackURL
	}

	// Extract root domain from collector endpoint
	// e.g., "otel.kloudmate.dev" -> "kloudmate.dev"
	rootDomain := parts[len(parts)-2] + "." + parts[len(parts)-1]

	// Build API URL with same domain
	updateURL := url.URL{
		Scheme: u.Scheme,
		Host:   "api." + rootDomain,
		Path:   "/agents/config-check",
	}

	return updateURL.String()
}
Endpoint derivation examples:
Collector EndpointDerived API Endpoint
https://otel.kloudmate.com:4318https://api.kloudmate.com/agents/config-check
https://otel.kloudmate.dev:4318https://api.kloudmate.dev/agents/config-check
https://otel.example.io:4318https://api.example.io/agents/config-check
This allows seamless operation across different KloudMate environments (production, staging, on-premise).

Health Check Frequency

Health checks occur at configurable intervals:

Configuration

agent.go:194-230
func (a *Agent) runConfigUpdateChecker(ctx context.Context) {
	if a.cfg.ConfigUpdateURL == "" {
		a.logger.Debug("config update URL not configured, skipping update checks")
		return
	}
	if a.cfg.ConfigCheckInterval <= 0 {
		a.logger.Debug("config check interval not set, skipping update checks")
		return
	}
	a.logger.Infow("config update checker started",
		"updateURL", a.cfg.ConfigUpdateURL,
		"intervalSeconds", a.cfg.ConfigCheckInterval,
	)
	ticker := time.NewTicker(time.Duration(a.cfg.ConfigCheckInterval) * time.Second)
	defer ticker.Stop()

	// Perform first check immediately
	if err := a.performConfigCheck(ctx); err != nil {
		a.logger.Errorf("Periodic config check failed: %v", err)
	}

	for {
		select {
		case <-ticker.C:
			if err := a.performConfigCheck(ctx); err != nil {
				a.logger.Errorf("Periodic config check failed: %v", err)
			}
		case <-a.shutdownSignal:
			a.logger.Info("config update checker stopping")
			return
		case <-ctx.Done():
			a.logger.Info("config update checker stopping")
			return
		}
	}
}
Check interval behavior:
  • Default interval: 60 seconds
  • First check: Immediate on agent startup
  • Subsequent checks: On ticker interval
  • Error handling: Logs errors but continues checking
  • Graceful shutdown: Stops on context cancellation or shutdown signal

Configuring Check Interval

export KM_CONFIG_CHECK_INTERVAL=30
Recommended intervals:
  • Production: 60-300 seconds
  • Staging: 30-60 seconds
  • Development: 10-30 seconds

Error Tracking and Reporting

The agent captures and reports errors from the collector:

Error Capture

agent.go:151-159
runErr := collector.Run(ctx)
if runErr != nil {
	a.collectorError = runErr.Error()
	a.logger.Errorw("collector run loop exited with error", "error", runErr)
} else {
	a.collectorError = ""
	a.logger.Info("collector run loop exited normally")
}
Errors are stored in the collectorError field and included in the next health check report.

Error Persistence

Errors remain in the status until the collector successfully starts:
agent.go:240-248
a.collectorMu.Lock()
params := updater.UpdateCheckerParams{
	Version: a.version,
}
if a.collector != nil {
	params.CollectorStatus = "Running"
} else {
	params.CollectorStatus = "Stopped"
	params.CollectorLastError = a.collectorError  // Reported to API
}
a.collectorMu.Unlock()
This ensures the KloudMate platform receives error information even if the agent itself is healthy.

Unique Agent Identification

Each agent is uniquely identifiable through multiple attributes:

Identification Components

1

Hostname

System hostname provides human-readable identification:
config.go:119-125
func (c *Config) Hostname() string {
    n, e := os.Hostname()
    if e != nil {
        n = ""
    }
    return n
}
2

Platform Information

Operating system and architecture help categorize agents:
  • Platform: linux, windows, darwin, docker
  • Architecture: amd64, arm64, 386
3

Deployment Context

Additional metadata for Kubernetes deployments:
  • Cluster name (configured via Helm)
  • Namespace
  • Pod name
  • Node name

Kubernetes-Specific Monitoring

For Kubernetes deployments, additional monitoring data is available through:
  • DaemonSet agents: Report node-level metrics
  • Deployment agents: Report cluster-level metrics
  • Service discovery: Automatic endpoint detection

Dashboard Integration

The KloudMate platform uses health data to provide:

Agent Inventory

Complete list of all agents with status, version, and platform

Health Dashboard

Real-time view of agent and collector health across your infrastructure

Alert Configuration

Set up alerts for agent failures, version mismatches, or error conditions

Trend Analysis

Historical view of agent uptime and error patterns

Configuration Examples

Linux Installation with Custom Monitoring

KM_API_KEY="your-api-key" \
KM_COLLECTOR_ENDPOINT="https://otel.kloudmate.com:4318" \
KM_CONFIG_CHECK_INTERVAL=30 \
bash -c "$(curl -L https://cdn.kloudmate.com/scripts/install_linux.sh)"

Docker with Health Reporting

docker run -d \
  --name kloudmate-agent \
  -e KM_API_KEY="your-api-key" \
  -e KM_COLLECTOR_ENDPOINT="https://otel.kloudmate.com:4318" \
  -e KM_CONFIG_CHECK_INTERVAL=60 \
  -v /var/log:/var/log:ro \
  -v /proc:/host/proc:ro \
  ghcr.io/kloudmate/km-kube-agent:latest

Kubernetes with Enhanced Monitoring

helm install kloudmate-release kloudmate/km-kube-agent \
  --namespace km-agent \
  --create-namespace \
  --set API_KEY="your-api-key" \
  --set COLLECTOR_ENDPOINT="https://otel.kloudmate.com:4318" \
  --set KM_CONFIG_CHECK_INTERVAL="30s" \
  --set clusterName="production-cluster"

Monitoring Best Practices

1

Set Appropriate Intervals

Balance between monitoring responsiveness and API load:
  • High-value production systems: 30-60 seconds
  • Standard deployments: 60-120 seconds
  • Large fleets (100+ agents): 120-300 seconds
2

Monitor Agent Logs

Enable debug logging temporarily for troubleshooting:
# Linux/Docker
tail -f /var/log/kmagent/kmagent.log | grep -i error

# Kubernetes
kubectl logs -n km-agent daemonset/km-agent --tail=50 -f
3

Set Up Alerts

Configure alerts in the KloudMate dashboard for:
  • Agent offline (no heartbeat for 2x check interval)
  • Collector stopped
  • Repeated errors
  • Version drift across agents
4

Regular Health Checks

Review agent health dashboard weekly to identify:
  • Agents with persistent errors
  • Version inconsistencies
  • Unusual restart patterns

Troubleshooting

Symptoms: Agent appears to be running but not reporting to platformCheck:
  • Verify API key is correct: echo $KM_API_KEY
  • Confirm update URL is reachable:
    curl -X POST https://api.kloudmate.com/agents/config-check \
      -H "Authorization: your-api-key" \
      -H "Content-Type: application/json" \
      -d '{}'
    
  • Check agent logs for connection errors
  • Verify firewall rules allow outbound HTTPS
Symptoms: Last seen timestamp is outdatedSolutions:
  • Check if agent is still running:
    # Linux
    systemctl status kmagent
    
    # Docker
    docker ps | grep kloudmate
    
    # Kubernetes
    kubectl get pods -n km-agent
    
  • Verify ConfigCheckInterval is set correctly
  • Review logs for repeated API failures
  • Check network connectivity
Symptoms: Dashboard shows wrong statusSolutions:
  • Wait for next check interval to see updated status
  • Manually trigger config check (restart agent)
  • Verify agent and collector processes are running
  • Check for clock skew on the host
Symptoms: Too many requests to update endpointSolutions:
  • Increase ConfigCheckInterval to reduce frequency
  • Verify only one agent instance is running per host
  • Check for restart loops causing repeated checks

Security Considerations

The API key grants access to your KloudMate account. Protect it appropriately:
  • Use secrets management for Kubernetes (e.g., Sealed Secrets, External Secrets)
  • Restrict file permissions on Linux: chmod 600 /etc/kmagent/config.yaml
  • Rotate API keys periodically
  • Use separate API keys for different environments

Authentication Flow

Metrics and Observability

The agent reports several categories of observability data:

System Metrics

  • Agent uptime
  • Configuration check success rate
  • Last successful check timestamp
  • API response times

Collector Metrics

  • Collector uptime
  • Restart count
  • Error frequency
  • Configuration reload success rate

Platform Metrics

  • Agent distribution by platform
  • Version distribution
  • Geographic distribution (based on API endpoint)

Next Steps

Multi-Platform Support

Learn about agent deployment across different environments

Configuration

Deep dive into agent configuration options

Build docs developers (and LLMs) love