Skip to main content
Grafana provides comprehensive system monitoring and visualization dashboards for PentAGI. Monitor resource usage, track performance metrics, and visualize system health in real-time.

Overview

Grafana is the industry-standard platform for observability and monitoring. In PentAGI, it provides:
  • Real-time Dashboards: Visualize metrics from all system components
  • Distributed Tracing: Track requests across microservices with Jaeger
  • Log Aggregation: Centralized logging with Loki
  • Alerting: Proactive notifications for system issues
  • Custom Queries: Build custom visualizations using PromQL

Architecture

The Grafana observability stack includes:
  • Grafana: Visualization and dashboard platform (port 3000)
  • VictoriaMetrics: Time-series metrics database
  • Jaeger: Distributed tracing backend with ClickHouse storage
  • Loki: Log aggregation and querying
  • OpenTelemetry Collector: Unified data collection and processing
  • Node Exporter: System metrics collector
  • cAdvisor: Container metrics collector

Setup

1

Configure Environment Variables

Edit your .env file to enable OpenTelemetry:
.env
# Enable observability integration
OTEL_HOST=otelcol:8148

# Grafana settings
GRAFANA_LISTEN_PORT=3000
GRAFANA_LISTEN_IP=127.0.0.1

# OpenTelemetry settings
OTEL_GRPC_LISTEN_PORT=8148
OTEL_HTTP_LISTEN_PORT=4318
2

Download Docker Compose File

Download the observability stack configuration:
curl -O https://raw.githubusercontent.com/vxcontrol/pentagi/master/docker-compose-observability.yml
3

Start Observability Stack

Launch the observability services:
docker compose -f docker-compose.yml -f docker-compose-observability.yml up -d
Verify services are running:
docker compose ps grafana victoriametrics jaeger loki otel
4

Access Grafana

Open your browser and navigate to:
http://localhost:3000
Default credentials:
  • Username: admin
  • Password: admin (you’ll be prompted to change it)

Configuration

Data Sources

Grafana is pre-configured with three data sources:

VictoriaMetrics (Prometheus)

Metrics storage and querying:
datasource.yml
- name: VictoriaMetrics
  uid: victoriametrics
  type: prometheus
  url: http://victoriametrics:8428
  jsonData:
    manageAlerts: true

Jaeger

Distributed tracing:
datasource.yml
- name: Jaeger
  uid: jaeger
  type: jaeger
  url: http://jaeger:16686
  jsonData:
    nodeGraph:
      enabled: true
    tracesToLogs:
      datasourceUid: loki
      filterByTraceID: true

Loki

Log aggregation:
datasource.yml
- name: Loki
  uid: loki
  type: loki
  url: http://loki:3100
  isDefault: true
  jsonData:
    maxLines: 1000
    derivedFields:
      - datasourceUid: jaeger
        matcherRegex: trace_id

Environment Variables

Key configuration options:
VariableDescriptionDefault
GRAFANA_LISTEN_PORTWeb UI port3000
GRAFANA_LISTEN_IPBind address127.0.0.1
GF_USERS_ALLOW_SIGN_UPAllow user registrationfalse
GF_EXPLORE_ENABLEDEnable Explore viewtrue
GF_ALERTING_ENABLEDEnable alertingtrue
GF_UNIFIED_ALERTING_ENABLEDUse unified alertingtrue

Dashboards

Grafana includes pre-built dashboards for PentAGI:

System Dashboards

Node Exporter Full

Comprehensive host system metrics:
  • CPU usage and load average
  • Memory and swap utilization
  • Disk I/O and space usage
  • Network traffic and errors
  • System uptime and processes

Docker Engine

Docker daemon metrics:
  • Container count and status
  • Image and volume statistics
  • Docker engine CPU/memory usage
  • Network I/O per container
  • Storage driver performance

Docker Containers

Per-container resource usage:
  • CPU utilization by container
  • Memory consumption and limits
  • Network traffic breakdown
  • Filesystem I/O operations
  • Container health status

Component Dashboards

PentAGI Service

Application-specific metrics:
  • Request rates and latency
  • Agent execution metrics
  • LLM API call statistics
  • Error rates and types
  • Task queue depth

VictoriaMetrics

Metrics database performance:
  • Query performance
  • Ingestion rate
  • Storage usage
  • Cache hit rates
  • Active series count

Usage

Viewing Metrics

  1. Navigate to DashboardsBrowse
  2. Select a dashboard category (Server, Components, etc.)
  3. Click on a dashboard to open it
  4. Use time range picker to adjust view period
  5. Hover over graphs for detailed values

Exploring Traces

  1. Go to Explore view
  2. Select Jaeger as data source
  3. Search for traces by:
    • Service name (e.g., pentagi)
    • Operation name
    • Tags and annotations
  4. Click on a trace to see span details
  5. Jump to related logs using trace ID

Querying Logs

  1. Open Explore view
  2. Select Loki as data source
  3. Build queries using LogQL:
    {container_name="pentagi"} |= "error"
    
  4. Apply filters and aggregations
  5. Link to traces using derived fields

Creating Alerts

  1. Navigate to AlertingAlert rules
  2. Click New alert rule
  3. Define query conditions:
    rate(http_requests_total{status="500"}[5m]) > 0.1
    
  4. Set evaluation interval and thresholds
  5. Configure notification channels

Services

Grafana

Visualization platform:
docker-compose-observability.yml
grafana:
  image: grafana/grafana:11.4.0
  ports:
    - "127.0.0.1:3000:3000"
  volumes:
    - ./observability/grafana/config:/etc/grafana:rw
    - ./observability/grafana/dashboards:/var/lib/grafana/dashboards:rw
    - grafana-data:/var/lib/grafana:rw

VictoriaMetrics

Time-series database:
docker-compose-observability.yml
victoriametrics:
  image: victoriametrics/victoria-metrics:v1.108.1
  command:
    - --storageDataPath=/storage
    - --httpListenAddr=:8428
    - --selfScrapeInterval=10s
  volumes:
    - victoriametrics-data:/storage:rw

Jaeger

Distributed tracing:
docker-compose-observability.yml
jaeger:
  image: jaegertracing/all-in-one:1.56.0
  depends_on:
    - clickstore
  environment:
    SPAN_STORAGE_TYPE: grpc-plugin

Loki

Log aggregation:
docker-compose-observability.yml
loki:
  image: grafana/loki:3.3.2
  command: -config.file=/etc/loki/config.yml
  volumes:
    - ./observability/loki/config.yml:/etc/loki/config.yml:ro

OpenTelemetry Integration

The OpenTelemetry Collector provides unified data collection. See the OpenTelemetry documentation for detailed configuration.

Metrics Collection

OTel scrapes metrics from:
  • Node Exporter (system metrics)
  • cAdvisor (container metrics)
  • ClickHouse (database metrics)
  • Jaeger (tracing metrics)
  • Loki (log metrics)

Traces Processing

OTel receives traces via:
  • gRPC endpoint: otelcol:8148
  • HTTP endpoint: otelcol:4318
And exports to Jaeger for storage and visualization.

Logs Processing

OTel forwards logs to Loki via OTLP HTTP protocol.

Troubleshooting

Grafana Not Loading

Check service status:
# View Grafana logs
docker compose logs grafana

# Verify Grafana is listening
docker exec grafana netstat -tlnp | grep 3000

# Check data source connectivity
docker exec grafana curl http://victoriametrics:8428

No Metrics Showing

Verify data collection:
# Check OpenTelemetry collector
docker compose logs otel

# Query VictoriaMetrics directly
curl 'http://localhost:8428/api/v1/query?query=up'

# Verify Prometheus scrape targets
curl 'http://localhost:8428/api/v1/targets'

Traces Not Appearing

Debug tracing pipeline:
# Check Jaeger health
curl http://localhost:16686/

# Verify ClickHouse storage
docker exec clickstore clickhouse-client --query "SHOW DATABASES"

# Review OTLP receiver logs
docker compose logs otel | grep otlp

Dashboard Errors

Resolve panel issues:
  1. Check data source configuration in SettingsData sources
  2. Verify query syntax in panel edit mode
  3. Review time range and aggregation intervals
  4. Check Grafana logs for specific errors
  5. Validate data is actually being collected

Best Practices

Dashboard Design

  • Use consistent color schemes across dashboards
  • Group related panels together
  • Add descriptions and documentation links
  • Set appropriate refresh intervals (default: 5s)
  • Use variables for dynamic filtering

Query Optimization

  • Limit time ranges for heavy queries
  • Use recording rules for complex calculations
  • Apply appropriate aggregation intervals
  • Filter unnecessary labels early
  • Cache frequently-used queries

Alerting Strategy

  • Set meaningful alert thresholds
  • Use severity levels (warning, critical)
  • Configure proper notification channels
  • Avoid alert fatigue with sensible conditions
  • Document alert resolution procedures

Resource Management

  • Configure data retention policies
  • Monitor disk usage regularly
  • Set up log rotation for long-term storage
  • Archive old dashboards
  • Clean up unused data sources

Build docs developers (and LLMs) love