Monitoring & Observability

Overview

captaind provides comprehensive observability through:

Structured Logs: JSON-formatted logs with contextual data
OpenTelemetry Metrics: Prometheus-compatible metrics
Distributed Tracing: Request and round tracing
Health Checks: gRPC health check protocol

This enables production monitoring with tools like Prometheus, Grafana, and Jaeger.

Logging

Log Format

Structured JSON logs using the server-log crate:

{"msg":"RoundStarted","round_seq":123456,"timestamp":"2024-01-15T10:30:00.123Z"}
{"msg":"ReceivedRoundPayments","round_seq":123456,"input_count":5,"output_count":8}
{"msg":"BroadcastRoundFundingTx","round_seq":123456,"txid":"abc123...","round_tx_fee":1000}
{"msg":"RoundFinished","round_seq":123456,"nb_input_vtxos":5,"vtxo_expiry_block_height":850000}

Log Levels

Set via CAPTAIND_LOG environment variable:

# Info level (default)
export CAPTAIND_LOG=info

# Debug level (verbose)
export CAPTAIND_LOG=debug

# Trace level (very verbose)
export CAPTAIND_LOG=trace

# Warn level only
export CAPTAIND_LOG=warn

# Per-module filtering
export CAPTAIND_LOG="info,server::round=debug,server::ln=trace"

Log Destinations

Stdout (default):

captaind start 2>&1 | tee captaind.log

Systemd Journal:

# View logs
journalctl -u captaind -f

# Filter by level
journalctl -u captaind -p err

# Export to file
journalctl -u captaind --since "1 hour ago" > captaind.log

Log Aggregation (Loki, ELK, etc.):

# Ship logs to Loki
cat captaind.log | promtail --config.file=promtail.yaml

Important Log Messages

Server Lifecycle:

ServerStarted: Server initialization complete
ServerTerminated: Graceful shutdown

Round Events:

RoundStarted: New round initiated
ReceivedRoundPayments: User payments collected
BroadcastRoundFundingTx: Round tx broadcast
RoundFinished: Round completed successfully
RoundAbandoned: Round abandoned (no signers)

Errors:

RoundPaymentRegistrationFailed: User payment rejected
FatalStoringRound: Critical database error
ClaimChunkBroadcastFailure: Watchman claim failed

OpenTelemetry Setup

Install OpenTelemetry Collector

# Download collector
wget https://github.com/open-telemetry/opentelemetry-collector-releases/releases/download/v0.91.0/otelcol_0.91.0_linux_amd64.tar.gz
tar -xzf otelcol_0.91.0_linux_amd64.tar.gz
sudo mv otelcol /usr/local/bin/

Configure Collector

Create otel-collector-config.yaml:

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  batch:
    timeout: 1s
    send_batch_size: 1024

exporters:
  # Prometheus metrics
  prometheus:
    endpoint: "0.0.0.0:8889"
  
  # Jaeger traces
  otlp/jaeger:
    endpoint: jaeger:4317
    tls:
      insecure: true
  
  # Logging (debug)
  logging:
    loglevel: info

service:
  pipelines:
    metrics:
      receivers: [otlp]
      processors: [batch]
      exporters: [prometheus, logging]
    
    traces:
      receivers: [otlp]
      processors: [batch]
      exporters: [otlp/jaeger, logging]

Start collector:

otelcol --config otel-collector-config.yaml

Configure captaind

Add to captaind.toml:

# OpenTelemetry collector endpoint
otel_collector_endpoint = "http://127.0.0.1:4317"

# Sampling rate: 0.0 to 1.0
# 0.0 = disabled, 1.0 = trace everything, 0.1 = 10% sampling
otel_tracing_sampler = "1.0"

# Deployment name (appears in metrics)
otel_deployment_name = "production-ark-01"

Metrics

Available Metrics

System Metrics

Runtime:

bark_spawn_counter: Active background tasks
bark_block_height_gauge: Current Bitcoin block height
bark_sync_height_gauge: Server’s synced block height

Wallets:

bark_wallet_balance_gauge: Wallet balance in sats (by kind)
- Labels: kind=rounds|watchman

Round Metrics

bark_round_seq_gauge: Current round sequence number
bark_round_state_gauge: Round state (0-5)
- 0 = CollectingPayments
- 1 = SigningVtxoTree
- 2 = FinishedEmpty
- 3 = FinishedAbandoned
- 4 = FinishedSuccess
- 5 = FinishedError
bark_round_attempt_gauge: Current attempt within round
bark_round_step_duration_gauge: Duration of each round step (ms)
bark_round_input_volume_gauge: Total input amount (sats)
bark_round_input_count_gauge: Number of input VTXOs
bark_round_output_count_gauge: Number of output VTXOs

Lightning Metrics

bark_lightning_node_gauge: Connected CLN nodes (by uri, pubkey)
bark_lightning_node_boot_counter: CLN reconnections
bark_lightning_payment_counter: Payments by status
- Labels: status=success|failed|pending
bark_lightning_payment_volume: Payment volume in msats
bark_lightning_open_invoices_gauge: Open invoices count
bark_lightning_invoice_verification_queue_gauge: Pending verifications

VTXO Pool Metrics

bark_vtxo_pool_amount_gauge: Current pool amount (sats) by denomination
bark_vtxo_pool_amount_max_gauge: Target pool amount
bark_vtxo_pool_count_gauge: Current VTXO count by denomination

Database Metrics

bark_postgres_connections: Total connections
bark_postgres_idle_connections: Idle connections in pool
bark_postgres_connections_created: Created connections (counter)
bark_postgres_connections_closed_*: Connection close reasons
bark_postgres_get_*: Connection pool statistics

gRPC Metrics

bark_grpc_in_progress_counter: Active RPC calls
bark_grpc_latency_histogram: Request latency (ms)
bark_grpc_request_counter: Requests by service, method, status
bark_grpc_error_counter: Errors by service, method, error

Fee Estimator Metrics

bark_fee_rate_gauge: Current fee rate (sat/vb) by priority
- Labels: priority=fast|regular|slow
bark_fee_rate_using_fallback_gauge: Using fallback fee rate (0/1)

Prometheus Configuration

Add to prometheus.yml:

scrape_configs:
  - job_name: 'captaind'
    static_configs:
      - targets: ['localhost:8889']  # OTel collector Prometheus endpoint
    relabel_configs:
      - source_labels: [__address__]
        target_label: instance
        replacement: 'captaind-prod-01'

Example Queries

Round success rate:

rate(bark_round_state_gauge{state="FinishedSuccess"}[5m]) / 
rate(bark_round_seq_gauge[5m])

Average round input volume:

avg_over_time(bark_round_input_volume_gauge[1h])

Lightning payment success rate:

sum(rate(bark_lightning_payment_counter{status="success"}[5m])) / 
sum(rate(bark_lightning_payment_counter[5m]))

Database connection pool utilization:

bark_postgres_connections - bark_postgres_idle_connections

Grafana Dashboards

Sample Dashboard Panels

Round Health:

{
  "title": "Round Success Rate",
  "targets": [
    {
      "expr": "rate(bark_round_state_gauge{state=\"FinishedSuccess\"}[5m]) / rate(bark_round_seq_gauge[5m])",
      "legendFormat": "Success Rate"
    }
  ],
  "type": "graph"
}

Wallet Balances:

{
  "title": "Wallet Balances",
  "targets": [
    {
      "expr": "bark_wallet_balance_gauge",
      "legendFormat": "{{kind}}"
    }
  ],
  "type": "graph",
  "yaxes": [{"format": "sat"}]
}

Lightning Volume:

{
  "title": "Lightning Payment Volume",
  "targets": [
    {
      "expr": "rate(bark_lightning_payment_volume[5m]) / 1000",
      "legendFormat": "sats/sec"
    }
  ],
  "type": "graph"
}

Distributed Tracing

Jaeger Setup

# Run Jaeger all-in-one
docker run -d --name jaeger \
  -p 16686:16686 \
  -p 4317:4317 \
  jaegertracing/all-in-one:latest

Access UI: http://localhost:16686

Trace Spans

Round Execution:

round: Full round execution
- round_attempt: Single attempt
  - ReceivePayments: Payment collection
  - VtxoTree: Tree construction
  - ReceiveVtxoSignatures: Signature collection
  - SignOnChainTransaction: TX signing
  - BroadcastOnChainTransaction: TX broadcast
  - Persist: Database storage

gRPC Requests:

grpc.<service>/<method>: Each RPC call
- Includes: latency, status, error details

Trace Attributes:

round_seq: Round sequence number
attempt_seq: Attempt number
round_id: Round transaction ID
service.name: “captaind”
service.version: Version from Cargo.toml

Health Checks

gRPC Health Check

Use grpc_health_probe:

# Install
wget https://github.com/grpc-ecosystem/grpc-health-probe/releases/download/v0.4.24/grpc_health_probe-linux-amd64
chmod +x grpc_health_probe-linux-amd64
sudo mv grpc_health_probe-linux-amd64 /usr/local/bin/grpc_health_probe

# Check health
grpc_health_probe -addr=127.0.0.1:3535

Kubernetes Liveness Probe:

livenessProbe:
  exec:
    command: ["/usr/local/bin/grpc_health_probe", "-addr=:3535"]
  initialDelaySeconds: 10
  periodSeconds: 10

Custom Health Checks

Check wallet balance:

captaind rpc wallet | jq -r '.rounds.confirmed_balance'

Check database connectivity:

psql -h localhost -U postgres -d bark-server-db -c "SELECT 1;" > /dev/null && echo "OK" || echo "FAIL"

Check Bitcoin Core sync:

bitcoin-cli getblockchaininfo | jq -r '.blocks'

Alerting

Prometheus Alertmanager Rules

Create alerts.yml:

groups:
  - name: captaind
    interval: 30s
    rules:
      # Server down
      - alert: CaptaindDown
        expr: up{job="captaind"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "captaind is down"
      
      # High round failure rate
      - alert: HighRoundFailureRate
        expr: |
          rate(bark_round_state_gauge{state="FinishedError"}[5m]) / 
          rate(bark_round_seq_gauge[5m]) > 0.1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High round failure rate (>10%)"
      
      # Low wallet balance
      - alert: LowWalletBalance
        expr: bark_wallet_balance_gauge{kind="rounds"} < 10000000
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Rounds wallet balance low (<0.1 BTC)"
      
      # Lightning payment failures
      - alert: HighLightningFailureRate
        expr: |
          rate(bark_lightning_payment_counter{status="failed"}[5m]) / 
          rate(bark_lightning_payment_counter[5m]) > 0.2
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High Lightning failure rate (>20%)"
      
      # Database connection pool exhaustion
      - alert: DatabasePoolExhausted
        expr: bark_postgres_idle_connections == 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "No idle database connections"
      
      # Slow rounds
      - alert: SlowRounds
        expr: bark_round_step_duration_gauge{step="Persist"} > 5000
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Round persistence taking >5 seconds"

Notification Channels

Slack:

receivers:
  - name: 'slack'
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/YOUR/WEBHOOK/URL'
        channel: '#ark-alerts'
        title: 'Captaind Alert'
        text: '{{ range .Alerts }}{{ .Annotations.summary }}{{ end }}'

Email:

receivers:
  - name: 'email'
    email_configs:
      - to: '[email protected]'
        from: '[email protected]'
        smarthost: 'smtp.gmail.com:587'
        auth_username: '[email protected]'
        auth_password: 'password'

Best Practices

Set up comprehensive monitoring

Deploy full stack:

Prometheus for metrics
Grafana for visualization
Jaeger for tracing
Alertmanager for notifications

Monitor critical metrics

Key indicators to watch:

Round success rate (should be >95%)
Wallet balances (alert before empty)
Lightning payment success rate (>90%)
Database connection pool usage
Block height sync lag

Tune alert thresholds

Avoid alert fatigue:

Start with conservative thresholds
Adjust based on observed patterns
Use for durations to prevent flapping
Prioritize alerts (critical vs warning)

Retain logs and metrics

Retention policies:

Logs: 30 days minimum
Metrics: 90 days minimum
Traces: 7 days (expensive to store)
Archive critical events long-term

Test your monitoring

Regularly verify:

Trigger test alerts
Simulate failures
Practice incident response
Update runbooks

Running captaind

Operations

Monitoring & Observability

Overview

Logging

Log Format

Log Levels

Log Destinations

Important Log Messages

OpenTelemetry Setup

Install OpenTelemetry Collector

Configure Collector

Configure captaind

Metrics

Available Metrics

System Metrics

Round Metrics

Lightning Metrics

VTXO Pool Metrics

Database Metrics

gRPC Metrics

Fee Estimator Metrics

Prometheus Configuration

Example Queries

Grafana Dashboards

Sample Dashboard Panels

Distributed Tracing

Jaeger Setup

Trace Spans

Health Checks

gRPC Health Check

Custom Health Checks

Alerting

Prometheus Alertmanager Rules

Notification Channels

Best Practices

Build docs developers (and LLMs) love

Running captaind

Operations

​Overview

​Logging

​Log Format

​Log Levels

​Log Destinations

​Important Log Messages

​OpenTelemetry Setup

​Install OpenTelemetry Collector

​Configure Collector

​Configure captaind

​Metrics

​Available Metrics

​System Metrics

​Round Metrics

​Lightning Metrics

​VTXO Pool Metrics

​Database Metrics

​gRPC Metrics

​Fee Estimator Metrics

​Prometheus Configuration

​Example Queries

​Grafana Dashboards

​Sample Dashboard Panels

​Distributed Tracing

​Jaeger Setup

​Trace Spans

​Health Checks

​gRPC Health Check

​Custom Health Checks

​Alerting

​Prometheus Alertmanager Rules

​Notification Channels

​Best Practices

Build docs developers (and LLMs) love

Overview

Logging

Log Format

Log Levels

Log Destinations

Important Log Messages

OpenTelemetry Setup

Install OpenTelemetry Collector

Configure Collector

Configure captaind

Metrics

Available Metrics

System Metrics

Round Metrics

Lightning Metrics

VTXO Pool Metrics

Database Metrics

gRPC Metrics

Fee Estimator Metrics

Prometheus Configuration

Example Queries

Grafana Dashboards

Sample Dashboard Panels

Distributed Tracing

Jaeger Setup

Trace Spans

Health Checks

gRPC Health Check

Custom Health Checks

Alerting

Prometheus Alertmanager Rules

Notification Channels

Best Practices