Metrics Architecture
Snuba emits metrics to DataDog (StatsD protocol) for monitoring and alerting.Metrics Configuration
Metric Types
Snuba emits four types of metrics:Counters
Counters
Incremental counters for event tracking:Key counters:
consumer.message_processed: Messages consumed from Kafkaconsumer.message_filtered: Messages filtered before processinghealthcheck_failed: Failed health checksquery.success: Successful queriesquery.error: Failed queries
Timers/Distributions
Timers/Distributions
Timing and distribution metrics:Key timers:
query.duration: Query execution timeconsumer.batch_time: Consumer batch processing timeclickhouse.query.duration: ClickHouse query execution timehealthcheck.latency: Health check response time
Gauges
Gauges
Point-in-time measurements:Key gauges:
consumer.lag: Kafka consumer lag in messagesconsumer.batch_size: Current batch sizeclickhouse.connections: Active ClickHouse connections
Sets
Sets
Unique value counting:
Health Monitoring
Health Check Endpoints
Snuba provides multiple health check endpoints:Basic Health Check
Quick sanity check that at least one ClickHouse node is responsive:- At least one ClickHouse cluster is reachable
- Can execute
SHOW TABLESquery - Timeout: 500ms per cluster (configurable)
Thorough Health Check
Comprehensive check that verifies all required tables exist:- All enabled storage tables exist in ClickHouse
- All query nodes are accessible
- All storage nodes are healthy
- Verifies table schema integrity
Envoy Health Check
Special endpoint for Envoy/load balancer integration:CLI Health Check
Run health checks from the command line:Health Check Configuration
Customize health check behavior via runtime config:Key Metrics to Monitor
API Performance Metrics
Query Rate
Monitor query throughput:Alert on:
- Sudden drops in query rate (> 50% decrease)
- High error rates (> 5% of total queries)
Query Latency
Track query performance:Alert on:
- P95 latency > 5 seconds
- P99 latency > 10 seconds
- Sudden latency spikes (> 3x baseline)
ClickHouse Query Performance
Monitor database execution time:Alert on:
- Average query time > 2 seconds
- P95 query time > 5 seconds
Consumer Pipeline Metrics
Consumer Lag
Most critical metric for data freshness:Alert on:
- Lag > 1,000,000 messages (warning)
- Lag > 5,000,000 messages (critical)
- Lag growing consistently over 15 minutes
Consumer lag indicates how far behind Kafka the consumer is. High lag means delayed data visibility.
Message Processing Rate
Track ingestion throughput:Alert on:
- Processing rate drops to 0 (consumer stuck)
- Processing rate < expected load
Batch Processing Time
Monitor consumer performance:Alert on:
- Average batch time > 5 seconds
- P95 batch time > 10 seconds
ClickHouse Health Metrics
Connection Pool
Monitor ClickHouse connections:Alert on:
- Connections at max pool size
- High connection wait times (> 100ms)
Query Failures
Track ClickHouse errors:Common error codes:
241: Memory limit exceeded159: Query timeout60: Too many simultaneous queries
System Resource Metrics
Production Alerts
Critical Alerts
Alerts that require immediate response:High API Error Rate
High API Error Rate
Alert: SLO - High API error rateThreshold: Error rate > 5% for 15 minutesResponse:
- Check ClickHouse health
- Review error logs for common patterns
- Check for recent deployments
- Verify network connectivity
Consumer Lag Critical
Consumer Lag Critical
Alert: Consumer lag exceeding thresholdThreshold: Lag > 5M messages for 15 minutesResponse:
- Scale consumer replicas
- Check consumer error rates
- Verify Kafka broker health
- Review ClickHouse insert performance
Pod Restart Loop
Pod Restart Loop
Alert: Too many restarts on Snuba podsThreshold: > 3 restarts in 5 minutesResponse:
- Check pod logs for crash reason
- Review resource limits (OOM?)
- Check health check configuration
- Verify dependencies (ClickHouse, Redis, Kafka)
ClickHouse Connection Failures
ClickHouse Connection Failures
Alert: Cannot connect to ClickHouseThreshold: > 10 failed health checks in 5 minutesResponse:
- Verify ClickHouse is running
- Check network connectivity
- Review ClickHouse logs
- Check authentication credentials
Warning Alerts
Alerts requiring investigation but not immediate action:DataDog Integration
Health Check Monitor
Create a DataDog monitor to check Snuba health:Custom Dashboards
Key dashboard widgets:Distributed Tracing
Sentry Integration
Snuba automatically sends traces to Sentry:Trace Context
Snuba includes rich context in traces:- Query: Full query text and parameters
- Dataset: Which dataset was queried
- Storage: Which ClickHouse storage was used
- Referrer: Query origin/caller
- Project IDs: Projects involved in query
- Timing Breakdown: Time spent in each phase
Performance Profiling
Enable profiling for performance debugging:Logging
Log Levels
Configure logging verbosity:Structured Logging
Snuba uses structured logging with key context:Important Log Patterns
Query Performance Monitoring
Slow Query Logging
Snuba automatically logs slow queries:Query Recording
Enable query recording for debugging:Cost of Goods Sold (COGS) Tracking
Track query costs:Operational Runbooks
High Consumer Lag
- Identify lag source: Check which storage has lag
- Scale consumers: Increase consumer replicas
- Check ClickHouse: Verify insert performance
- Review batch sizes: May need to increase batch size
- Check for slow queries: Blocking queries can delay inserts
API Error Rate Spike
- Check error types: Identify common error pattern
- Verify ClickHouse: Ensure database is healthy
- Review recent changes: Check for bad deployments
- Check rate limits: May need to adjust limits
- Monitor resource usage: CPU/memory exhaustion?
ClickHouse Connection Issues
- Verify connectivity: Test network path
- Check credentials: Ensure auth is correct
- Review connection pool: May need larger pool
- Check ClickHouse logs: Look for server errors
- Verify DNS resolution: Ensure hostname resolves
Best Practices
- Monitor consumer lag continuously - This is your most important metric
- Set up PagerDuty/OpsGenie integration for critical alerts
- Use log aggregation (Datadog Logs, Elasticsearch) for debugging
- Create custom dashboards for each team’s use cases
- Review slow query logs weekly to identify optimization opportunities
- Test alerts in staging before deploying to production
- Document runbooks for each critical alert
- Set up synthetic monitoring to test query paths proactively