Overview
captaind provides comprehensive observability through:- Structured Logs: JSON-formatted logs with contextual data
- OpenTelemetry Metrics: Prometheus-compatible metrics
- Distributed Tracing: Request and round tracing
- Health Checks: gRPC health check protocol
Logging
Log Format
Structured JSON logs using theserver-log crate:
Log Levels
Set viaCAPTAIND_LOG environment variable:
Log Destinations
Stdout (default):Important Log Messages
Server Lifecycle:ServerStarted: Server initialization completeServerTerminated: Graceful shutdown
RoundStarted: New round initiatedReceivedRoundPayments: User payments collectedBroadcastRoundFundingTx: Round tx broadcastRoundFinished: Round completed successfullyRoundAbandoned: Round abandoned (no signers)
RoundPaymentRegistrationFailed: User payment rejectedFatalStoringRound: Critical database errorClaimChunkBroadcastFailure: Watchman claim failed
OpenTelemetry Setup
Install OpenTelemetry Collector
Configure Collector
Createotel-collector-config.yaml:
Configure captaind
Add tocaptaind.toml:
Metrics
Available Metrics
System Metrics
Runtime:bark_spawn_counter: Active background tasksbark_block_height_gauge: Current Bitcoin block heightbark_sync_height_gauge: Server’s synced block height
bark_wallet_balance_gauge: Wallet balance in sats (bykind)- Labels:
kind=rounds|watchman
- Labels:
Round Metrics
bark_round_seq_gauge: Current round sequence numberbark_round_state_gauge: Round state (0-5)- 0 = CollectingPayments
- 1 = SigningVtxoTree
- 2 = FinishedEmpty
- 3 = FinishedAbandoned
- 4 = FinishedSuccess
- 5 = FinishedError
bark_round_attempt_gauge: Current attempt within roundbark_round_step_duration_gauge: Duration of each round step (ms)bark_round_input_volume_gauge: Total input amount (sats)bark_round_input_count_gauge: Number of input VTXOsbark_round_output_count_gauge: Number of output VTXOs
Lightning Metrics
bark_lightning_node_gauge: Connected CLN nodes (byuri,pubkey)bark_lightning_node_boot_counter: CLN reconnectionsbark_lightning_payment_counter: Payments bystatus- Labels:
status=success|failed|pending
- Labels:
bark_lightning_payment_volume: Payment volume in msatsbark_lightning_open_invoices_gauge: Open invoices countbark_lightning_invoice_verification_queue_gauge: Pending verifications
VTXO Pool Metrics
bark_vtxo_pool_amount_gauge: Current pool amount (sats) by denominationbark_vtxo_pool_amount_max_gauge: Target pool amountbark_vtxo_pool_count_gauge: Current VTXO count by denomination
Database Metrics
bark_postgres_connections: Total connectionsbark_postgres_idle_connections: Idle connections in poolbark_postgres_connections_created: Created connections (counter)bark_postgres_connections_closed_*: Connection close reasonsbark_postgres_get_*: Connection pool statistics
gRPC Metrics
bark_grpc_in_progress_counter: Active RPC callsbark_grpc_latency_histogram: Request latency (ms)bark_grpc_request_counter: Requests byservice,method,statusbark_grpc_error_counter: Errors byservice,method,error
Fee Estimator Metrics
bark_fee_rate_gauge: Current fee rate (sat/vb) bypriority- Labels:
priority=fast|regular|slow
- Labels:
bark_fee_rate_using_fallback_gauge: Using fallback fee rate (0/1)
Prometheus Configuration
Add toprometheus.yml:
Example Queries
Round success rate:Grafana Dashboards
Sample Dashboard Panels
Round Health:Distributed Tracing
Jaeger Setup
Trace Spans
Round Execution:round: Full round executionround_attempt: Single attemptReceivePayments: Payment collectionVtxoTree: Tree constructionReceiveVtxoSignatures: Signature collectionSignOnChainTransaction: TX signingBroadcastOnChainTransaction: TX broadcastPersist: Database storage
grpc.<service>/<method>: Each RPC call- Includes: latency, status, error details
round_seq: Round sequence numberattempt_seq: Attempt numberround_id: Round transaction IDservice.name: “captaind”service.version: Version from Cargo.toml
Health Checks
gRPC Health Check
Use grpc_health_probe:Custom Health Checks
Check wallet balance:Alerting
Prometheus Alertmanager Rules
Createalerts.yml:
Notification Channels
Slack:Best Practices
Set up comprehensive monitoring
Set up comprehensive monitoring
Deploy full stack:
- Prometheus for metrics
- Grafana for visualization
- Jaeger for tracing
- Alertmanager for notifications
Monitor critical metrics
Monitor critical metrics
Key indicators to watch:
- Round success rate (should be >95%)
- Wallet balances (alert before empty)
- Lightning payment success rate (>90%)
- Database connection pool usage
- Block height sync lag
Tune alert thresholds
Tune alert thresholds
Avoid alert fatigue:
- Start with conservative thresholds
- Adjust based on observed patterns
- Use
fordurations to prevent flapping - Prioritize alerts (critical vs warning)
Retain logs and metrics
Retain logs and metrics
Retention policies:
- Logs: 30 days minimum
- Metrics: 90 days minimum
- Traces: 7 days (expensive to store)
- Archive critical events long-term
Test your monitoring
Test your monitoring
Regularly verify:
- Trigger test alerts
- Simulate failures
- Practice incident response
- Update runbooks