Overview
Grafana is the industry-standard platform for observability and monitoring. In PentAGI, it provides:- Real-time Dashboards: Visualize metrics from all system components
- Distributed Tracing: Track requests across microservices with Jaeger
- Log Aggregation: Centralized logging with Loki
- Alerting: Proactive notifications for system issues
- Custom Queries: Build custom visualizations using PromQL
Architecture
The Grafana observability stack includes:- Grafana: Visualization and dashboard platform (port 3000)
- VictoriaMetrics: Time-series metrics database
- Jaeger: Distributed tracing backend with ClickHouse storage
- Loki: Log aggregation and querying
- OpenTelemetry Collector: Unified data collection and processing
- Node Exporter: System metrics collector
- cAdvisor: Container metrics collector
Setup
Configuration
Data Sources
Grafana is pre-configured with three data sources:VictoriaMetrics (Prometheus)
Metrics storage and querying:datasource.yml
Jaeger
Distributed tracing:datasource.yml
Loki
Log aggregation:datasource.yml
Environment Variables
Key configuration options:| Variable | Description | Default |
|---|---|---|
GRAFANA_LISTEN_PORT | Web UI port | 3000 |
GRAFANA_LISTEN_IP | Bind address | 127.0.0.1 |
GF_USERS_ALLOW_SIGN_UP | Allow user registration | false |
GF_EXPLORE_ENABLED | Enable Explore view | true |
GF_ALERTING_ENABLED | Enable alerting | true |
GF_UNIFIED_ALERTING_ENABLED | Use unified alerting | true |
Dashboards
Grafana includes pre-built dashboards for PentAGI:System Dashboards
Node Exporter Full
Comprehensive host system metrics:- CPU usage and load average
- Memory and swap utilization
- Disk I/O and space usage
- Network traffic and errors
- System uptime and processes
Docker Engine
Docker daemon metrics:- Container count and status
- Image and volume statistics
- Docker engine CPU/memory usage
- Network I/O per container
- Storage driver performance
Docker Containers
Per-container resource usage:- CPU utilization by container
- Memory consumption and limits
- Network traffic breakdown
- Filesystem I/O operations
- Container health status
Component Dashboards
PentAGI Service
Application-specific metrics:- Request rates and latency
- Agent execution metrics
- LLM API call statistics
- Error rates and types
- Task queue depth
VictoriaMetrics
Metrics database performance:- Query performance
- Ingestion rate
- Storage usage
- Cache hit rates
- Active series count
Usage
Viewing Metrics
- Navigate to Dashboards → Browse
- Select a dashboard category (Server, Components, etc.)
- Click on a dashboard to open it
- Use time range picker to adjust view period
- Hover over graphs for detailed values
Exploring Traces
- Go to Explore view
- Select Jaeger as data source
- Search for traces by:
- Service name (e.g.,
pentagi) - Operation name
- Tags and annotations
- Service name (e.g.,
- Click on a trace to see span details
- Jump to related logs using trace ID
Querying Logs
- Open Explore view
- Select Loki as data source
- Build queries using LogQL:
- Apply filters and aggregations
- Link to traces using derived fields
Creating Alerts
- Navigate to Alerting → Alert rules
- Click New alert rule
- Define query conditions:
- Set evaluation interval and thresholds
- Configure notification channels
Services
Grafana
Visualization platform:docker-compose-observability.yml
VictoriaMetrics
Time-series database:docker-compose-observability.yml
Jaeger
Distributed tracing:docker-compose-observability.yml
Loki
Log aggregation:docker-compose-observability.yml
OpenTelemetry Integration
The OpenTelemetry Collector provides unified data collection. See the OpenTelemetry documentation for detailed configuration.Metrics Collection
OTel scrapes metrics from:- Node Exporter (system metrics)
- cAdvisor (container metrics)
- ClickHouse (database metrics)
- Jaeger (tracing metrics)
- Loki (log metrics)
Traces Processing
OTel receives traces via:- gRPC endpoint:
otelcol:8148 - HTTP endpoint:
otelcol:4318
Logs Processing
OTel forwards logs to Loki via OTLP HTTP protocol.Troubleshooting
Grafana Not Loading
Check service status:No Metrics Showing
Verify data collection:Traces Not Appearing
Debug tracing pipeline:Dashboard Errors
Resolve panel issues:- Check data source configuration in Settings → Data sources
- Verify query syntax in panel edit mode
- Review time range and aggregation intervals
- Check Grafana logs for specific errors
- Validate data is actually being collected
Best Practices
Dashboard Design
- Use consistent color schemes across dashboards
- Group related panels together
- Add descriptions and documentation links
- Set appropriate refresh intervals (default: 5s)
- Use variables for dynamic filtering
Query Optimization
- Limit time ranges for heavy queries
- Use recording rules for complex calculations
- Apply appropriate aggregation intervals
- Filter unnecessary labels early
- Cache frequently-used queries
Alerting Strategy
- Set meaningful alert thresholds
- Use severity levels (warning, critical)
- Configure proper notification channels
- Avoid alert fatigue with sensible conditions
- Document alert resolution procedures
Resource Management
- Configure data retention policies
- Monitor disk usage regularly
- Set up log rotation for long-term storage
- Archive old dashboards
- Clean up unused data sources
Related Documentation
- OpenTelemetry - Data collection configuration
- Langfuse - LLM observability
- Observability Guide - Complete monitoring setup