Skip to main content
Monitor your ZeroClaw agents with comprehensive logging, metrics, and alerting.

Observability Stack

ZeroClaw provides multiple observability layers:

Logs

Structured logging with tracing

Metrics

Prometheus metrics endpoint

Health Checks

Built-in diagnostics

Tracing

OpenTelemetry support

Logging

Log Levels

Control verbosity with RUST_LOG environment variable:
# Error only
export RUST_LOG=error

# Info (default)
export RUST_LOG=info

# Debug
export RUST_LOG=debug

# Trace (very verbose)
export RUST_LOG=trace

# Module-specific
export RUST_LOG=zeroclaw::agent=debug,zeroclaw::tools=trace

Log Format

Logs use structured format with timestamps:
2026-03-03T12:00:00.000Z INFO zeroclaw::agent: Starting agent loop session_id="abc123"
2026-03-03T12:00:01.234Z DEBUG zeroclaw::tools::shell: Executing command cmd="ls -la"
2026-03-03T12:00:01.567Z WARN zeroclaw::security: Blocked command attempt cmd="rm -rf /"

Log Aggregation

Ship logs to aggregation services:
# Install promtail
wget https://github.com/grafana/loki/releases/download/v2.9.0/promtail-linux-amd64.zip
unzip promtail-linux-amd64.zip

# Configure promtail.yaml
cat > promtail.yaml <<EOF
server:
  http_listen_port: 9080

clients:
  - url: http://loki:3100/loki/api/v1/push

scrape_configs:
  - job_name: zeroclaw
    static_configs:
      - targets:
          - localhost
        labels:
          job: zeroclaw
          __path__: /home/user/.zeroclaw/logs/*.log
EOF

# Start promtail
./promtail -config.file=promtail.yaml

Metrics

Prometheus Endpoint

Enable metrics in config.toml:
[observability]
prometheus_enabled = true
prometheus_port = 9090
prometheus_path = "/metrics"
Metrics available at http://localhost:9090/metrics

Key Metrics

zeroclaw_requests_total{channel="telegram",status="success"} 1234
zeroclaw_request_duration_seconds{quantile="0.95"} 0.234
zeroclaw_tool_calls_total{tool="shell",status="success"} 567
zeroclaw_tool_duration_seconds{tool="file_read"} 0.012
zeroclaw_provider_requests_total{provider="anthropic"} 890
zeroclaw_token_usage_total{provider="anthropic",type="input"} 456789
zeroclaw_token_usage_total{provider="anthropic",type="output"} 123456
zeroclaw_errors_total{type="provider_error"} 12
zeroclaw_errors_total{type="tool_failure"} 5
zeroclaw_errors_total{type="security_block"} 3
zeroclaw_memory_bytes 5242880
zeroclaw_cpu_usage_percent 12.5
zeroclaw_goroutines 42

Prometheus Configuration

Add ZeroClaw to Prometheus scrape config:
scrape_configs:
  - job_name: 'zeroclaw'
    static_configs:
      - targets: ['localhost:9090']
    scrape_interval: 15s

Grafana Dashboard

Import the ZeroClaw dashboard:
# Download dashboard JSON
wget https://github.com/zeroclaw-labs/zeroclaw/blob/main/grafana-dashboard.json

# Import in Grafana UI:
# Dashboards > Import > Upload JSON file
Dashboard includes:
  • Request rate and latency
  • Tool execution statistics
  • Token usage and costs
  • Error rates
  • System resources

OpenTelemetry

Export traces and metrics to OTLP collectors:
[observability]
otel_enabled = true
otel_endpoint = "http://localhost:4317"
otel_service_name = "zeroclaw-prod"
Build with OTLP support:
cargo build --features observability-otel

Trace Example

SPAN: agent_loop
  SPAN: provider_call provider=anthropic
    SPAN: http_request duration=234ms
  SPAN: tool_execution tool=shell
    SPAN: security_check duration=2ms
    SPAN: runtime_exec duration=1234ms

Health Checks

Liveness Probe

Check if the service is running:
curl http://localhost:3000/health
Response:
{
  "status": "healthy",
  "uptime_seconds": 86400,
  "version": "0.1.8"
}

Readiness Probe

Check if the service is ready to handle requests:
curl http://localhost:3000/ready
Checks:
  • Provider connectivity
  • Channel health
  • Memory backend

Kubernetes Probes

livenessProbe:
  httpGet:
    path: /health
    port: 3000
  initialDelaySeconds: 30
  periodSeconds: 10

readinessProbe:
  httpGet:
    path: /ready
    port: 3000
  initialDelaySeconds: 5
  periodSeconds: 5

Alerting

Prometheus Alerts

Create alerting rules:
groups:
  - name: zeroclaw
    rules:
      - alert: HighErrorRate
        expr: rate(zeroclaw_errors_total[5m]) > 0.1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High error rate detected"
      
      - alert: HighMemoryUsage
        expr: zeroclaw_memory_bytes > 500000000
        for: 10m
        labels:
          severity: critical
        annotations:
          summary: "Memory usage above 500MB"
      
      - alert: ProviderDown
        expr: zeroclaw_provider_health == 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Provider unreachable"

Notification Channels

Configure Alertmanager:
receivers:
  - name: 'slack'
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/YOUR/WEBHOOK/URL'
        channel: '#zeroclaw-alerts'
  
  - name: 'pagerduty'
    pagerduty_configs:
      - service_key: 'YOUR_SERVICE_KEY'

Cost Tracking

Monitor API costs:
zeroclaw cost summary
Output:
Provider    | Input Tokens | Output Tokens | Estimated Cost
------------|--------------|---------------|--------------
Anthropic   | 1,234,567    | 456,789       | $12.34
OpenAI      | 567,890      | 234,567       | $8.90
Cost metrics:
zeroclaw_cost_total{provider="anthropic"} 12.34
zeroclaw_cost_total{provider="openai"} 8.90

Performance Tuning

Query Latency

Track provider response times:
zeroclaw_provider_latency_seconds{provider="anthropic",quantile="0.5"} 0.234
zeroclaw_provider_latency_seconds{provider="anthropic",quantile="0.95"} 0.567
zeroclaw_provider_latency_seconds{provider="anthropic",quantile="0.99"} 0.890

Tool Execution

Monitor tool performance:
zeroclaw_tool_duration_seconds{tool="shell"} histogram
zeroclaw_tool_duration_seconds{tool="file_read"} histogram

Build docs developers (and LLMs) love