Overview
On-prem BuildBuddy deployments expose detailed operational metrics for:- Server health and performance
- Request rates and latency
- Cache performance
- Remote execution metrics
- Database and storage metrics
- Resource utilization
Endpoint
Prometheus metrics are exposed under the path:- Port:
9090 - Path:
/metrics/ - Format: Prometheus text format
Configuration
BuildBuddy Server
Metrics are enabled by default. You can customize the port:Prometheus Scrape Config
Add BuildBuddy to your Prometheus scrape configuration:Multiple Instances
For multiple BuildBuddy servers:Kubernetes Service Discovery
For Kubernetes deployments:Visualization with Grafana
To view these metrics in a live-updating dashboard, we recommend using Grafana.Setup
-
Install Grafana:
-
Add Prometheus data source:
- Navigate to Configuration > Data Sources
- Click “Add data source”
- Select Prometheus
- Enter Prometheus URL (e.g.,
http://prometheus:9090) - Click “Save & Test”
-
Import BuildBuddy dashboard:
- Go to Dashboards > Import
- Upload BuildBuddy dashboard JSON (if available)
- Or create custom dashboard
Example Dashboard Panels
Request Rate:Metric Categories
HTTP/gRPC Metrics
- Request count by method and status
- Request duration histograms
- Request size histograms
- Response size histograms
- Concurrent requests
Cache Metrics
- Cache hits and misses
- Cache read/write bytes
- Cache evictions
- Cache size and utilization
- Digest computation time
Remote Execution Metrics
- Executors connected
- Tasks queued, running, completed
- Task duration histograms
- Executor utilization
- Upload/download bytes
Build Event Metrics
- Build events received
- Event processing duration
- Events by type
- Stream errors
Database Metrics
- Query duration
- Connection pool stats
- Transaction counts
- Slow queries
Storage Metrics
- Bytes read/written
- Object count
- Storage errors
- Backend latency
System Metrics
- CPU usage
- Memory usage
- Goroutines
- File descriptors
- GC pause times
Alerting
Example Alert Rules
Create alert rules in Prometheus:Alertmanager Integration
Configure Alertmanager for notifications:Performance Monitoring
Key Metrics to Monitor
Latency:- API request duration (p50, p95, p99)
- Cache read/write latency
- Database query duration
- Remote execution task duration
- Requests per second
- Bytes transferred (cache, storage)
- Build events per second
- Active executors
- HTTP/gRPC error rates
- Cache errors
- Storage errors
- Execution failures
- CPU utilization
- Memory usage
- Disk usage
- Connection pool utilization
- Executor queue depth
Capacity Planning
Resource Utilization Queries
CPU Usage:Troubleshooting
Metrics not available
-
Verify BuildBuddy is running:
-
Check metrics endpoint:
-
Verify Prometheus can reach BuildBuddy:
- Check firewall rules
- Test network connectivity
-
Check Prometheus logs:
High cardinality
If metrics cause performance issues:- Reduce scrape frequency
- Use recording rules for expensive queries
- Drop high-cardinality labels
- Increase Prometheus resources
Best Practices
- Set up dashboards early: Establish baselines before issues occur
- Configure alerts: Be proactive about performance degradation
- Monitor trends: Look for gradual changes over time
- Document incidents: Note when alerts fire and resolution steps
- Regular review: Periodically review and update alert thresholds
Related Topics
For a complete list of available metrics with descriptions, build the BuildBuddy source:This generates comprehensive metric documentation at
bazel-bin/server/metrics/generate_docs/docs.mdx