Overview
CoW Protocol Services provide extensive observability through multiple channels:
- Prometheus metrics - Time-series metrics for monitoring and alerting
- OpenTelemetry tracing - Distributed tracing for request flows
- Structured logging - JSON-formatted logs with tracing integration
- Runtime diagnostics - Dynamic log filtering and heap profiling
- Performance profiling - Heap dumps and tokio runtime inspection
Prometheus Metrics
Metrics Endpoint
All services expose Prometheus-compatible metrics on a dedicated port (default: 9586):
# Default metrics endpoint
http://localhost:9586/metrics
# Health check endpoints
http://localhost:9586/liveness
http://localhost:9586/ready
http://localhost:9586/startup
Configuration:
# Services typically expose metrics on 9586 by default
# Check service-specific documentation for custom ports
Available Metrics
Common Metrics (all services):
# Auction overhead tracking
auction_overhead_time{component="autopilot",phase="database"}
auction_overhead_count{component="autopilot",phase="database"}
# Database query performance
persistence_database_queries{type="fetch_orders"}
# HTTP request metrics (via observe crate)
http_requests_total{method="GET",endpoint="/api/v1/orders"}
http_request_duration_seconds{method="GET",endpoint="/api/v1/orders"}
Service-Specific Metrics:
Orderbook:
orderbook_orders_total{status="created"}
orderbook_quotes_total
orderbook_api_requests{endpoint="/api/v1/quote"}
Autopilot:
autopilot_auctions_total
autopilot_auction_orders_count
autopilot_solutions_submitted
autopilot_settlements_completed
Driver:
driver_solutions_simulated
driver_solutions_submitted
driver_tx_submission_time
Scrape Configuration
Prometheus prometheus.yml:
scrape_configs:
- job_name: 'cow-orderbook'
static_configs:
- targets: ['orderbook-1:9586', 'orderbook-2:9586', 'orderbook-3:9586']
scrape_interval: 15s
scrape_timeout: 10s
- job_name: 'cow-autopilot'
static_configs:
- targets: ['autopilot:9586']
scrape_interval: 15s
- job_name: 'cow-driver'
static_configs:
- targets: ['driver:9586']
scrape_interval: 15s
- job_name: 'cow-solver'
static_configs:
- targets: ['solver-baseline:9586']
scrape_interval: 15s
Kubernetes ServiceMonitor:
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: cow-services
namespace: cow-protocol
spec:
selector:
matchLabels:
app.kubernetes.io/part-of: cow-protocol
endpoints:
- port: metrics
interval: 15s
path: /metrics
Services automatically register metrics when they start. No explicit metric initialization is required in most cases.
OpenTelemetry Integration
Distributed Tracing
Services support OpenTelemetry tracing with OTLP export via gRPC.
Configuration:
autopilot \
--tracing-collector-endpoint http://otel-collector:4317 \
--tracing-level INFO \
--tracing-exporter-timeout 10s
Environment Variables:
export TRACING_COLLECTOR_ENDPOINT=http://localhost:4317
export TRACING_LEVEL=INFO
export TRACING_EXPORTER_TIMEOUT=10s
OpenTelemetry Collector Setup
otel-collector-config.yaml:
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
processors:
batch:
timeout: 10s
send_batch_size: 1024
memory_limiter:
limit_mib: 512
spike_limit_mib: 128
check_interval: 5s
exporters:
# Export to Jaeger
jaeger:
endpoint: jaeger:14250
tls:
insecure: true
# Export to Tempo
otlp/tempo:
endpoint: tempo:4317
tls:
insecure: true
# Export to Honeycomb
otlp/honeycomb:
endpoint: api.honeycomb.io:443
headers:
x-honeycomb-team: ${HONEYCOMB_API_KEY}
service:
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [jaeger, otlp/tempo]
Trace Propagation
Services use W3C TraceContext propagation:
traceparent: 00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01
tracestate: congo=t61rcWkgMzE
HTTP headers are automatically propagated across service boundaries.
Custom Span Attributes:
use tracing::{info_span, instrument};
#[instrument(skip(context))]
async fn process_auction(auction_id: i64, context: &Context) {
let span = info_span!("process_auction", auction_id);
// Span automatically includes auction_id
}
Use Jaeger or Grafana Tempo to visualize trace flows across services and identify performance bottlenecks.
Logging
Structured Logging with tracing
All services use the tracing crate for structured logging.
Log Levels:
TRACE - Very verbose, function entry/exit
DEBUG - Detailed diagnostic information
INFO - General informational messages (default)
WARN - Warning conditions
ERROR - Error conditions that need attention
Configuration:
# Production: JSON format
autopilot \
--log-filter "info,autopilot=debug" \
--use-json-logs true \
--log-stderr-threshold ERROR
# Development: Human-readable
autopilot \
--log-filter "debug,shared::http=trace" \
--use-json-logs false
Example output:
{
"timestamp": "2026-03-04T15:23:45.123Z",
"level": "INFO",
"target": "autopilot::run_loop",
"message": "auction created",
"fields": {
"auction_id": 12345,
"order_count": 42
},
"spans": [
{"name": "run_loop", "auction_id": 12345}
],
"trace_id": "0af7651916cd43dd8448eb211c80319c",
"span_id": "b7ad6b7169203331"
}
Log Fields:
timestamp - ISO 8601 timestamp (UTC)
level - Log level
target - Module path
message - Human-readable message
fields - Structured key-value data
spans - Active tracing spans
trace_id - OpenTelemetry trace ID
span_id - OpenTelemetry span ID
Log Aggregation
Promtail + Loki:
# promtail-config.yaml
server:
http_listen_port: 9080
positions:
filename: /tmp/positions.yaml
clients:
- url: http://loki:3100/loki/api/v1/push
scrape_configs:
- job_name: cow-services
static_configs:
- targets:
- localhost
labels:
job: cow-services
__path__: /var/log/cow/*.json
pipeline_stages:
- json:
expressions:
level: level
target: target
message: message
trace_id: trace_id
- labels:
level:
target:
Elasticsearch via Filebeat:
# filebeat.yml
filebeat.inputs:
- type: log
enabled: true
paths:
- /var/log/cow/*.json
json.keys_under_root: true
json.add_error_key: true
output.elasticsearch:
hosts: ["elasticsearch:9200"]
index: "cow-services-%{+yyyy.MM.dd}"
setup.template.name: "cow-services"
setup.template.pattern: "cow-services-*"
Runtime Log Filter Adjustment
Change log filtering while services are running without restart.
How It Works
Each service creates a UNIX socket:
/tmp/log_filter_override_<program_name>_<pid>.sock
Usage Examples
Local Development:
# Find the socket
ls /tmp/log_filter_override_*
# Apply new filter
echo "debug,autopilot=trace" | nc -U /tmp/log_filter_override_autopilot_12345.sock
# Reset to original
echo "reset" | nc -U /tmp/log_filter_override_autopilot_12345.sock
Kubernetes:
# List sockets
kubectl exec autopilot-5d6f8c9b7-xyz -- ls /tmp/log_filter_override_*
# Apply filter
kubectl exec autopilot-5d6f8c9b7-xyz -- sh -c \
"echo 'trace' | nc -U /tmp/log_filter_override_autopilot_1.sock"
# View response
kubectl logs autopilot-5d6f8c9b7-xyz --tail=10
Docker:
docker exec cow-autopilot sh -c \
"echo 'debug,shared=trace' | nc -U /tmp/log_filter_override_autopilot_1.sock"
Use Cases
- Debug production issues - Enable verbose logging temporarily
- Trace specific flows - Focus on particular modules
- Reduce noise - Filter out chatty modules
- A/B comparison - Compare behavior at different log levels
Changing log filters at runtime is powerful but can generate large volumes of logs. Use trace level sparingly in production.
Tokio Console (Playground Only)
tokio-console provides runtime inspection of async tasks, but is only available in playground environments due to significant memory overhead.
Enabling tokio-console
Playground Environment:
export TOKIO_CONSOLE=true
docker compose -f playground/docker-compose.fork.yml up --build
tokio-console has significant memory overhead and must never be enabled in production. It’s only compiled into playground builds.
Installation
cargo install --locked tokio-console
Usage
# Connect to service (playground ports)
tokio-console http://localhost:6669 # orderbook
tokio-console http://localhost:6670 # autopilot
tokio-console http://localhost:6671 # driver
tokio-console http://localhost:6672 # baseline solver
Console Features:
- View all async tasks
- Monitor task states (running, idle, polling)
- Track task resource usage
- Detect blocking operations
- Identify task leaks
Playground Ports:
orderbook: 6669
autopilot: 6670
driver: 6671
baseline: 6672
tokio-console requires both tokio_unstable cfg and the tokio-console feature. Production builds exclude these.
Heap Profiling with jemalloc
All services use jemalloc as the default allocator with built-in heap profiling support.
Enabling Heap Profiling
Heap profiling is enabled at runtime via environment variables:
# Enable profiling with reasonable sampling
export MALLOC_CONF="prof:true,prof_active:true,lg_prof_sample:22"
autopilot
MALLOC_CONF Parameters:
prof:true - Enable profiling capability
prof_active:true - Start profiling immediately
lg_prof_sample:22 - Sample every 4MB (2^22 bytes)
Higher lg_prof_sample values reduce overhead but provide less detail. Start with 22 (4MB) for production.
Heap Dump Socket
When profiling is enabled, each service opens a UNIX socket:
/tmp/heap_dump_<binary_name>.sock
Generating Heap Dumps
Local Development:
echo dump | nc -U /tmp/heap_dump_autopilot.sock > heap.pprof
Kubernetes:
kubectl exec <pod> -n <namespace> -- sh -c \
"echo dump | nc -U /tmp/heap_dump_orderbook.sock" > heap.pprof
Docker:
docker exec <container> sh -c \
"echo dump | nc -U /tmp/heap_dump_driver.sock" > heap.pprof
Heap dumps are in pprof binary format and can be large (100MB+). The socket has a 60-second timeout per dump.
Analyzing Heap Dumps
Install pprof:
go install github.com/google/pprof@latest
Interactive Web UI:
pprof -http=:8080 heap.pprof
Navigate to http://localhost:8080 and explore:
- Top - Functions with highest allocation
- Graph - Visual call graph
- Flame Graph - Hierarchical visualization
- Peek - Source code view
- Source - Annotated source
Command-Line Analysis:
# Top allocators
pprof -top heap.pprof
# Call tree
pprof -tree heap.pprof
# Focus on specific function
pprof -top -focus="autopilot::run_loop" heap.pprof
# Compare two dumps
pprof -top -base heap1.pprof heap2.pprof
Example Output:
File: autopilot
Type: inuse_space
Showing nodes accounting for 1.23GB, 89.45% of 1.38GB total
flat flat% sum% cum cum%
512.00MB 37.10% 37.10% 512.00MB 37.10% alloc::vec::Vec::extend_from_slice
256.00MB 18.55% 55.65% 256.00MB 18.55% sqlx::query::Query::fetch_all
192.00MB 13.91% 69.57% 192.00MB 13.91% autopilot::domain::auction::Auction::new
Heap profiling has minimal runtime overhead when enabled:
- CPU overhead: < 1% with lg_prof_sample=22
- Memory overhead: ~5-10% for metadata
- Dump generation: 1-5 seconds depending on heap size
Dumps capture a snapshot of the current heap state. For memory leaks, compare multiple dumps over time.
Health Checks
All services expose health check endpoints for orchestration systems.
Endpoints
Liveness Probe:
GET http://localhost:9586/liveness
# Returns 200 OK if service is alive
# Returns 503 Service Unavailable if unhealthy
Use for detecting deadlocks and crashes.
Readiness Probe:
GET http://localhost:9586/ready
# Returns 200 OK if ready to serve traffic
# Returns 503 if not ready (e.g., database connection lost)
Use for load balancer decisions.
Startup Probe:
GET http://localhost:9586/startup
# Returns 200 OK if startup complete
# Returns 503 during initialization
Use for slow-starting services.
Kubernetes Configuration
apiVersion: v1
kind: Pod
metadata:
name: autopilot
spec:
containers:
- name: autopilot
image: cow/autopilot:latest
ports:
- name: metrics
containerPort: 9586
livenessProbe:
httpGet:
path: /liveness
port: metrics
initialDelaySeconds: 30
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3
readinessProbe:
httpGet:
path: /ready
port: metrics
initialDelaySeconds: 10
periodSeconds: 5
timeoutSeconds: 3
failureThreshold: 2
startupProbe:
httpGet:
path: /startup
port: metrics
initialDelaySeconds: 0
periodSeconds: 5
timeoutSeconds: 3
failureThreshold: 30 # 150 seconds max startup time
Alerting Recommendations
Critical Alerts
Service Down:
alert: ServiceDown
expr: up{job="cow-autopilot"} == 0
for: 2m
annotations:
summary: "Autopilot service is down"
High Error Rate:
alert: HighErrorRate
expr: |
rate(http_requests_total{status=~"5.."}[5m]) /
rate(http_requests_total[5m]) > 0.05
for: 5m
annotations:
summary: "Error rate above 5%"
Database Connection Pool Exhausted:
alert: DBPoolExhausted
expr: db_pool_connections_active / db_pool_connections_max > 0.9
for: 5m
annotations:
summary: "Database connection pool nearly exhausted"
Warning Alerts
Slow Queries:
alert: SlowDatabaseQueries
expr: histogram_quantile(0.95, rate(persistence_database_queries_bucket[5m])) > 5
for: 10m
annotations:
summary: "95th percentile DB query time > 5s"
High Memory Usage:
alert: HighMemoryUsage
expr: process_resident_memory_bytes / 1e9 > 8
for: 15m
annotations:
summary: "Service using more than 8GB memory"
Auction Delays:
alert: AuctionDelays
expr: rate(auction_overhead_time_total[5m]) > 10
for: 10m
annotations:
summary: "Auction overhead increasing"
Grafana Dashboards
Example Dashboard Panels
Request Rate:
sum(rate(http_requests_total[5m])) by (endpoint)
Request Duration (p50, p95, p99):
histogram_quantile(0.50, rate(http_request_duration_seconds_bucket[5m]))
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))
Database Query Performance:
histogram_quantile(0.95, sum(rate(persistence_database_queries_bucket[5m])) by (le, type))
Active Orders:
orderbook_orders_total{status="active"}
Settlement Success Rate:
sum(rate(autopilot_settlements_completed[5m])) /
sum(rate(autopilot_solutions_submitted[5m]))
Best Practices
- Use structured logging in production - Enable
--use-json-logs true
- Set appropriate log levels -
info for production, debug/trace for troubleshooting
- Monitor connection pools - Alert on high utilization before exhaustion
- Enable distributed tracing - Track requests across service boundaries
- Use runtime log filtering - Debug production without restarts
- Profile memory regularly - Generate heap dumps weekly to detect leaks
- Never enable tokio-console in production - Playground only
- Configure health checks - Proper liveness/readiness probes
- Set up alerting - Critical and warning alerts for all services
- Aggregate logs centrally - Use Loki, Elasticsearch, or CloudWatch
Troubleshooting
Missing Metrics
Check metrics endpoint:
curl http://localhost:9586/metrics
Verify Prometheus scrape:
Trace Not Appearing
Verify collector connection:
# Check collector logs
kubectl logs otel-collector
# Test endpoint
curl http://otel-collector:4317
Check trace sampling:
Services use AlwaysOn sampler by default.
Log Volume Too High
Reduce log level:
echo "warn" | nc -U /tmp/log_filter_override_orderbook_1.sock
Filter noisy modules:
echo "info,hyper=warn,sqlx=warn" | nc -U /tmp/log_filter_override_autopilot_1.sock
Heap Dump Fails
Check profiling enabled:
echo $MALLOC_CONF
# Should contain: prof:true,prof_active:true
Verify socket exists:
Timeout:
Large heaps may take >60 seconds. Increase sampling rate:
MALLOC_CONF="prof:true,prof_active:true,lg_prof_sample:24" # 16MB sampling
See Also