Skip to main content

Overview

captaind provides comprehensive observability through:
  • Structured Logs: JSON-formatted logs with contextual data
  • OpenTelemetry Metrics: Prometheus-compatible metrics
  • Distributed Tracing: Request and round tracing
  • Health Checks: gRPC health check protocol
This enables production monitoring with tools like Prometheus, Grafana, and Jaeger.

Logging

Log Format

Structured JSON logs using the server-log crate:
{"msg":"RoundStarted","round_seq":123456,"timestamp":"2024-01-15T10:30:00.123Z"}
{"msg":"ReceivedRoundPayments","round_seq":123456,"input_count":5,"output_count":8}
{"msg":"BroadcastRoundFundingTx","round_seq":123456,"txid":"abc123...","round_tx_fee":1000}
{"msg":"RoundFinished","round_seq":123456,"nb_input_vtxos":5,"vtxo_expiry_block_height":850000}

Log Levels

Set via CAPTAIND_LOG environment variable:
# Info level (default)
export CAPTAIND_LOG=info

# Debug level (verbose)
export CAPTAIND_LOG=debug

# Trace level (very verbose)
export CAPTAIND_LOG=trace

# Warn level only
export CAPTAIND_LOG=warn

# Per-module filtering
export CAPTAIND_LOG="info,server::round=debug,server::ln=trace"

Log Destinations

Stdout (default):
captaind start 2>&1 | tee captaind.log
Systemd Journal:
# View logs
journalctl -u captaind -f

# Filter by level
journalctl -u captaind -p err

# Export to file
journalctl -u captaind --since "1 hour ago" > captaind.log
Log Aggregation (Loki, ELK, etc.):
# Ship logs to Loki
cat captaind.log | promtail --config.file=promtail.yaml

Important Log Messages

Server Lifecycle:
  • ServerStarted: Server initialization complete
  • ServerTerminated: Graceful shutdown
Round Events:
  • RoundStarted: New round initiated
  • ReceivedRoundPayments: User payments collected
  • BroadcastRoundFundingTx: Round tx broadcast
  • RoundFinished: Round completed successfully
  • RoundAbandoned: Round abandoned (no signers)
Errors:
  • RoundPaymentRegistrationFailed: User payment rejected
  • FatalStoringRound: Critical database error
  • ClaimChunkBroadcastFailure: Watchman claim failed

OpenTelemetry Setup

Install OpenTelemetry Collector

# Download collector
wget https://github.com/open-telemetry/opentelemetry-collector-releases/releases/download/v0.91.0/otelcol_0.91.0_linux_amd64.tar.gz
tar -xzf otelcol_0.91.0_linux_amd64.tar.gz
sudo mv otelcol /usr/local/bin/

Configure Collector

Create otel-collector-config.yaml:
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  batch:
    timeout: 1s
    send_batch_size: 1024

exporters:
  # Prometheus metrics
  prometheus:
    endpoint: "0.0.0.0:8889"
  
  # Jaeger traces
  otlp/jaeger:
    endpoint: jaeger:4317
    tls:
      insecure: true
  
  # Logging (debug)
  logging:
    loglevel: info

service:
  pipelines:
    metrics:
      receivers: [otlp]
      processors: [batch]
      exporters: [prometheus, logging]
    
    traces:
      receivers: [otlp]
      processors: [batch]
      exporters: [otlp/jaeger, logging]
Start collector:
otelcol --config otel-collector-config.yaml

Configure captaind

Add to captaind.toml:
# OpenTelemetry collector endpoint
otel_collector_endpoint = "http://127.0.0.1:4317"

# Sampling rate: 0.0 to 1.0
# 0.0 = disabled, 1.0 = trace everything, 0.1 = 10% sampling
otel_tracing_sampler = "1.0"

# Deployment name (appears in metrics)
otel_deployment_name = "production-ark-01"

Metrics

Available Metrics

System Metrics

Runtime:
  • bark_spawn_counter: Active background tasks
  • bark_block_height_gauge: Current Bitcoin block height
  • bark_sync_height_gauge: Server’s synced block height
Wallets:
  • bark_wallet_balance_gauge: Wallet balance in sats (by kind)
    • Labels: kind=rounds|watchman

Round Metrics

  • bark_round_seq_gauge: Current round sequence number
  • bark_round_state_gauge: Round state (0-5)
    • 0 = CollectingPayments
    • 1 = SigningVtxoTree
    • 2 = FinishedEmpty
    • 3 = FinishedAbandoned
    • 4 = FinishedSuccess
    • 5 = FinishedError
  • bark_round_attempt_gauge: Current attempt within round
  • bark_round_step_duration_gauge: Duration of each round step (ms)
  • bark_round_input_volume_gauge: Total input amount (sats)
  • bark_round_input_count_gauge: Number of input VTXOs
  • bark_round_output_count_gauge: Number of output VTXOs

Lightning Metrics

  • bark_lightning_node_gauge: Connected CLN nodes (by uri, pubkey)
  • bark_lightning_node_boot_counter: CLN reconnections
  • bark_lightning_payment_counter: Payments by status
    • Labels: status=success|failed|pending
  • bark_lightning_payment_volume: Payment volume in msats
  • bark_lightning_open_invoices_gauge: Open invoices count
  • bark_lightning_invoice_verification_queue_gauge: Pending verifications

VTXO Pool Metrics

  • bark_vtxo_pool_amount_gauge: Current pool amount (sats) by denomination
  • bark_vtxo_pool_amount_max_gauge: Target pool amount
  • bark_vtxo_pool_count_gauge: Current VTXO count by denomination

Database Metrics

  • bark_postgres_connections: Total connections
  • bark_postgres_idle_connections: Idle connections in pool
  • bark_postgres_connections_created: Created connections (counter)
  • bark_postgres_connections_closed_*: Connection close reasons
  • bark_postgres_get_*: Connection pool statistics

gRPC Metrics

  • bark_grpc_in_progress_counter: Active RPC calls
  • bark_grpc_latency_histogram: Request latency (ms)
  • bark_grpc_request_counter: Requests by service, method, status
  • bark_grpc_error_counter: Errors by service, method, error

Fee Estimator Metrics

  • bark_fee_rate_gauge: Current fee rate (sat/vb) by priority
    • Labels: priority=fast|regular|slow
  • bark_fee_rate_using_fallback_gauge: Using fallback fee rate (0/1)

Prometheus Configuration

Add to prometheus.yml:
scrape_configs:
  - job_name: 'captaind'
    static_configs:
      - targets: ['localhost:8889']  # OTel collector Prometheus endpoint
    relabel_configs:
      - source_labels: [__address__]
        target_label: instance
        replacement: 'captaind-prod-01'

Example Queries

Round success rate:
rate(bark_round_state_gauge{state="FinishedSuccess"}[5m]) / 
rate(bark_round_seq_gauge[5m])
Average round input volume:
avg_over_time(bark_round_input_volume_gauge[1h])
Lightning payment success rate:
sum(rate(bark_lightning_payment_counter{status="success"}[5m])) / 
sum(rate(bark_lightning_payment_counter[5m]))
Database connection pool utilization:
bark_postgres_connections - bark_postgres_idle_connections

Grafana Dashboards

Sample Dashboard Panels

Round Health:
{
  "title": "Round Success Rate",
  "targets": [
    {
      "expr": "rate(bark_round_state_gauge{state=\"FinishedSuccess\"}[5m]) / rate(bark_round_seq_gauge[5m])",
      "legendFormat": "Success Rate"
    }
  ],
  "type": "graph"
}
Wallet Balances:
{
  "title": "Wallet Balances",
  "targets": [
    {
      "expr": "bark_wallet_balance_gauge",
      "legendFormat": "{{kind}}"
    }
  ],
  "type": "graph",
  "yaxes": [{"format": "sat"}]
}
Lightning Volume:
{
  "title": "Lightning Payment Volume",
  "targets": [
    {
      "expr": "rate(bark_lightning_payment_volume[5m]) / 1000",
      "legendFormat": "sats/sec"
    }
  ],
  "type": "graph"
}

Distributed Tracing

Jaeger Setup

# Run Jaeger all-in-one
docker run -d --name jaeger \
  -p 16686:16686 \
  -p 4317:4317 \
  jaegertracing/all-in-one:latest
Access UI: http://localhost:16686

Trace Spans

Round Execution:
  • round: Full round execution
    • round_attempt: Single attempt
      • ReceivePayments: Payment collection
      • VtxoTree: Tree construction
      • ReceiveVtxoSignatures: Signature collection
      • SignOnChainTransaction: TX signing
      • BroadcastOnChainTransaction: TX broadcast
      • Persist: Database storage
gRPC Requests:
  • grpc.<service>/<method>: Each RPC call
    • Includes: latency, status, error details
Trace Attributes:
  • round_seq: Round sequence number
  • attempt_seq: Attempt number
  • round_id: Round transaction ID
  • service.name: “captaind”
  • service.version: Version from Cargo.toml

Health Checks

gRPC Health Check

Use grpc_health_probe:
# Install
wget https://github.com/grpc-ecosystem/grpc-health-probe/releases/download/v0.4.24/grpc_health_probe-linux-amd64
chmod +x grpc_health_probe-linux-amd64
sudo mv grpc_health_probe-linux-amd64 /usr/local/bin/grpc_health_probe

# Check health
grpc_health_probe -addr=127.0.0.1:3535
Kubernetes Liveness Probe:
livenessProbe:
  exec:
    command: ["/usr/local/bin/grpc_health_probe", "-addr=:3535"]
  initialDelaySeconds: 10
  periodSeconds: 10

Custom Health Checks

Check wallet balance:
captaind rpc wallet | jq -r '.rounds.confirmed_balance'
Check database connectivity:
psql -h localhost -U postgres -d bark-server-db -c "SELECT 1;" > /dev/null && echo "OK" || echo "FAIL"
Check Bitcoin Core sync:
bitcoin-cli getblockchaininfo | jq -r '.blocks'

Alerting

Prometheus Alertmanager Rules

Create alerts.yml:
groups:
  - name: captaind
    interval: 30s
    rules:
      # Server down
      - alert: CaptaindDown
        expr: up{job="captaind"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "captaind is down"
      
      # High round failure rate
      - alert: HighRoundFailureRate
        expr: |
          rate(bark_round_state_gauge{state="FinishedError"}[5m]) / 
          rate(bark_round_seq_gauge[5m]) > 0.1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High round failure rate (>10%)"
      
      # Low wallet balance
      - alert: LowWalletBalance
        expr: bark_wallet_balance_gauge{kind="rounds"} < 10000000
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Rounds wallet balance low (<0.1 BTC)"
      
      # Lightning payment failures
      - alert: HighLightningFailureRate
        expr: |
          rate(bark_lightning_payment_counter{status="failed"}[5m]) / 
          rate(bark_lightning_payment_counter[5m]) > 0.2
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High Lightning failure rate (>20%)"
      
      # Database connection pool exhaustion
      - alert: DatabasePoolExhausted
        expr: bark_postgres_idle_connections == 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "No idle database connections"
      
      # Slow rounds
      - alert: SlowRounds
        expr: bark_round_step_duration_gauge{step="Persist"} > 5000
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Round persistence taking >5 seconds"

Notification Channels

Slack:
receivers:
  - name: 'slack'
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/YOUR/WEBHOOK/URL'
        channel: '#ark-alerts'
        title: 'Captaind Alert'
        text: '{{ range .Alerts }}{{ .Annotations.summary }}{{ end }}'
Email:
receivers:
  - name: 'email'
    email_configs:
      - to: '[email protected]'
        from: '[email protected]'
        smarthost: 'smtp.gmail.com:587'
        auth_username: '[email protected]'
        auth_password: 'password'

Best Practices

Deploy full stack:
  • Prometheus for metrics
  • Grafana for visualization
  • Jaeger for tracing
  • Alertmanager for notifications
Key indicators to watch:
  • Round success rate (should be >95%)
  • Wallet balances (alert before empty)
  • Lightning payment success rate (>90%)
  • Database connection pool usage
  • Block height sync lag
Avoid alert fatigue:
  • Start with conservative thresholds
  • Adjust based on observed patterns
  • Use for durations to prevent flapping
  • Prioritize alerts (critical vs warning)
Retention policies:
  • Logs: 30 days minimum
  • Metrics: 90 days minimum
  • Traces: 7 days (expensive to store)
  • Archive critical events long-term
Regularly verify:
  • Trigger test alerts
  • Simulate failures
  • Practice incident response
  • Update runbooks

Build docs developers (and LLMs) love