Grafana

Grafana provides comprehensive system monitoring and visualization dashboards for PentAGI. Monitor resource usage, track performance metrics, and visualize system health in real-time.

Overview

Grafana is the industry-standard platform for observability and monitoring. In PentAGI, it provides:

Real-time Dashboards: Visualize metrics from all system components
Distributed Tracing: Track requests across microservices with Jaeger
Log Aggregation: Centralized logging with Loki
Alerting: Proactive notifications for system issues
Custom Queries: Build custom visualizations using PromQL

Architecture

The Grafana observability stack includes:

Grafana: Visualization and dashboard platform (port 3000)
VictoriaMetrics: Time-series metrics database
Jaeger: Distributed tracing backend with ClickHouse storage
Loki: Log aggregation and querying
OpenTelemetry Collector: Unified data collection and processing
Node Exporter: System metrics collector
cAdvisor: Container metrics collector

Setup

Configure Environment Variables

Edit your .env file to enable OpenTelemetry:

.env

# Enable observability integration
OTEL_HOST=otelcol:8148

# Grafana settings
GRAFANA_LISTEN_PORT=3000
GRAFANA_LISTEN_IP=127.0.0.1

# OpenTelemetry settings
OTEL_GRPC_LISTEN_PORT=8148
OTEL_HTTP_LISTEN_PORT=4318

Download Docker Compose File

Download the observability stack configuration:

curl -O https://raw.githubusercontent.com/vxcontrol/pentagi/master/docker-compose-observability.yml

Start Observability Stack

Launch the observability services:

docker compose -f docker-compose.yml -f docker-compose-observability.yml up -d

Verify services are running:

docker compose ps grafana victoriametrics jaeger loki otel

Access Grafana

Open your browser and navigate to:

http://localhost:3000

Default credentials:

Username: admin
Password: admin (you’ll be prompted to change it)

Configuration

Data Sources

Grafana is pre-configured with three data sources:

VictoriaMetrics (Prometheus)

Metrics storage and querying:

datasource.yml

- name: VictoriaMetrics
  uid: victoriametrics
  type: prometheus
  url: http://victoriametrics:8428
  jsonData:
    manageAlerts: true

Jaeger

Distributed tracing:

datasource.yml

- name: Jaeger
  uid: jaeger
  type: jaeger
  url: http://jaeger:16686
  jsonData:
    nodeGraph:
      enabled: true
    tracesToLogs:
      datasourceUid: loki
      filterByTraceID: true

Loki

Log aggregation:

datasource.yml

- name: Loki
  uid: loki
  type: loki
  url: http://loki:3100
  isDefault: true
  jsonData:
    maxLines: 1000
    derivedFields:
      - datasourceUid: jaeger
        matcherRegex: trace_id

Environment Variables

Key configuration options:

Variable	Description	Default
`GRAFANA_LISTEN_PORT`	Web UI port	`3000`
`GRAFANA_LISTEN_IP`	Bind address	`127.0.0.1`
`GF_USERS_ALLOW_SIGN_UP`	Allow user registration	`false`
`GF_EXPLORE_ENABLED`	Enable Explore view	`true`
`GF_ALERTING_ENABLED`	Enable alerting	`true`
`GF_UNIFIED_ALERTING_ENABLED`	Use unified alerting	`true`

Dashboards

Grafana includes pre-built dashboards for PentAGI:

System Dashboards

Node Exporter Full

Comprehensive host system metrics:

CPU usage and load average
Memory and swap utilization
Disk I/O and space usage
Network traffic and errors
System uptime and processes

Docker Engine

Docker daemon metrics:

Container count and status
Image and volume statistics
Docker engine CPU/memory usage
Network I/O per container
Storage driver performance

Docker Containers

Per-container resource usage:

CPU utilization by container
Memory consumption and limits
Network traffic breakdown
Filesystem I/O operations
Container health status

Component Dashboards

PentAGI Service

Application-specific metrics:

Request rates and latency
Agent execution metrics
LLM API call statistics
Error rates and types
Task queue depth

VictoriaMetrics

Metrics database performance:

Query performance
Ingestion rate
Storage usage
Cache hit rates
Active series count

Usage

Viewing Metrics

Navigate to Dashboards → Browse
Select a dashboard category (Server, Components, etc.)
Click on a dashboard to open it
Use time range picker to adjust view period
Hover over graphs for detailed values

Exploring Traces

Go to Explore view
Select Jaeger as data source
Search for traces by:
- Service name (e.g., pentagi)
- Operation name
- Tags and annotations
Click on a trace to see span details
Jump to related logs using trace ID

Querying Logs

Open Explore view
Select Loki as data source
Build queries using LogQL:
```
{container_name="pentagi"} |= "error"
```
Apply filters and aggregations
Link to traces using derived fields

Creating Alerts

Navigate to Alerting → Alert rules
Click New alert rule

Define query conditions:

rate(http_requests_total{status="500"}[5m]) > 0.1

Set evaluation interval and thresholds
Configure notification channels

Services

Visualization platform:

docker-compose-observability.yml

grafana:
  image: grafana/grafana:11.4.0
  ports:
    - "127.0.0.1:3000:3000"
  volumes:
    - ./observability/grafana/config:/etc/grafana:rw
    - ./observability/grafana/dashboards:/var/lib/grafana/dashboards:rw
    - grafana-data:/var/lib/grafana:rw

VictoriaMetrics

Time-series database:

docker-compose-observability.yml

victoriametrics:
  image: victoriametrics/victoria-metrics:v1.108.1
  command:
    - --storageDataPath=/storage
    - --httpListenAddr=:8428
    - --selfScrapeInterval=10s
  volumes:
    - victoriametrics-data:/storage:rw

Jaeger

Distributed tracing:

docker-compose-observability.yml

jaeger:
  image: jaegertracing/all-in-one:1.56.0
  depends_on:
    - clickstore
  environment:
    SPAN_STORAGE_TYPE: grpc-plugin

Loki

Log aggregation:

docker-compose-observability.yml

loki:
  image: grafana/loki:3.3.2
  command: -config.file=/etc/loki/config.yml
  volumes:
    - ./observability/loki/config.yml:/etc/loki/config.yml:ro

OpenTelemetry Integration

The OpenTelemetry Collector provides unified data collection. See the OpenTelemetry documentation for detailed configuration.

Metrics Collection

OTel scrapes metrics from:

Node Exporter (system metrics)
cAdvisor (container metrics)
ClickHouse (database metrics)
Jaeger (tracing metrics)
Loki (log metrics)

Traces Processing

OTel receives traces via:

gRPC endpoint: otelcol:8148
HTTP endpoint: otelcol:4318

And exports to Jaeger for storage and visualization.

Logs Processing

OTel forwards logs to Loki via OTLP HTTP protocol.

Troubleshooting

Grafana Not Loading

Check service status:

# View Grafana logs
docker compose logs grafana

# Verify Grafana is listening
docker exec grafana netstat -tlnp | grep 3000

# Check data source connectivity
docker exec grafana curl http://victoriametrics:8428

No Metrics Showing

Verify data collection:

# Check OpenTelemetry collector
docker compose logs otel

# Query VictoriaMetrics directly
curl 'http://localhost:8428/api/v1/query?query=up'

# Verify Prometheus scrape targets
curl 'http://localhost:8428/api/v1/targets'

Traces Not Appearing

Debug tracing pipeline:

# Check Jaeger health
curl http://localhost:16686/

# Verify ClickHouse storage
docker exec clickstore clickhouse-client --query "SHOW DATABASES"

# Review OTLP receiver logs
docker compose logs otel | grep otlp

Dashboard Errors

Resolve panel issues:

Check data source configuration in Settings → Data sources
Verify query syntax in panel edit mode
Review time range and aggregation intervals
Check Grafana logs for specific errors
Validate data is actually being collected

Best Practices

Dashboard Design

Use consistent color schemes across dashboards
Group related panels together
Add descriptions and documentation links
Set appropriate refresh intervals (default: 5s)
Use variables for dynamic filtering

Query Optimization

Limit time ranges for heavy queries
Use recording rules for complex calculations
Apply appropriate aggregation intervals
Filter unnecessary labels early
Cache frequently-used queries

Alerting Strategy

Set meaningful alert thresholds
Use severity levels (warning, critical)
Configure proper notification channels
Avoid alert fatigue with sensible conditions
Document alert resolution procedures

Resource Management

Configure data retention policies
Monitor disk usage regularly
Set up log rotation for long-term storage
Archive old dashboards
Clean up unused data sources

OpenTelemetry - Data collection configuration
Langfuse - LLM observability
Observability Guide - Complete monitoring setup

LLM Providers

Observability

Knowledge Graph

​Overview

​Architecture

​Setup

​Configuration

​Data Sources

​VictoriaMetrics (Prometheus)

​Jaeger

​Loki

​Environment Variables

​Dashboards

​System Dashboards

​Node Exporter Full

​Docker Engine

​Docker Containers

​Component Dashboards

​PentAGI Service

​VictoriaMetrics

​Usage

​Viewing Metrics

​Exploring Traces

​Querying Logs

​Creating Alerts

​Services

​Grafana

​VictoriaMetrics

​Jaeger

​Loki

​OpenTelemetry Integration

​Metrics Collection

​Traces Processing

​Logs Processing

​Troubleshooting

​Grafana Not Loading

​No Metrics Showing

​Traces Not Appearing

​Dashboard Errors

​Best Practices

​Dashboard Design

​Query Optimization

​Alerting Strategy

​Resource Management

​Related Documentation

Build docs developers (and LLMs) love

Overview

Architecture

Setup

Configuration

Data Sources

VictoriaMetrics (Prometheus)

Jaeger

Loki

Environment Variables

Dashboards

System Dashboards

Node Exporter Full

Docker Engine

Docker Containers

Component Dashboards

PentAGI Service

VictoriaMetrics

Usage

Viewing Metrics

Exploring Traces

Querying Logs

Creating Alerts

Services

Grafana

VictoriaMetrics

Jaeger

Loki

OpenTelemetry Integration

Metrics Collection

Traces Processing

Logs Processing

Troubleshooting

Grafana Not Loading

No Metrics Showing

Traces Not Appearing

Dashboard Errors

Best Practices

Dashboard Design

Query Optimization

Alerting Strategy

Resource Management

Related Documentation