OpenTelemetry

OpenTelemetry (OTel) provides standardized collection, processing, and export of telemetry data (metrics, traces, and logs) for PentAGI. It serves as the central hub for all observability data flowing through the system.

Overview

OpenTelemetry is a vendor-neutral observability framework that provides:

Unified Collection: Single endpoint for all telemetry data
Data Processing: Transform, filter, and enrich observability data
Multiple Exporters: Send data to various backends simultaneously
Standards-Based: Industry-standard OTLP protocol support
Extensible: Rich ecosystem of receivers, processors, and exporters

Architecture

The OpenTelemetry Collector acts as the central data pipeline:

Setup

Configure OpenTelemetry Endpoint

Enable OTel in your .env file:

.env

# OpenTelemetry configuration
OTEL_HOST=otelcol:8148

# OTel Collector ports
OTEL_GRPC_LISTEN_PORT=8148
OTEL_HTTP_LISTEN_PORT=4318
OTEL_GRPC_LISTEN_IP=127.0.0.1
OTEL_HTTP_LISTEN_IP=127.0.0.1

PentAGI will automatically send telemetry to the OTel collector when OTEL_HOST is set.

Deploy Observability Stack

The OpenTelemetry Collector is included in the observability stack:

curl -O https://raw.githubusercontent.com/vxcontrol/pentagi/master/docker-compose-observability.yml
docker compose -f docker-compose.yml -f docker-compose-observability.yml up -d

Verify Data Collection

Check that OTel is receiving data:

# View collector logs
docker compose logs -f otel

# Check health endpoint
curl http://localhost:13133/

Configuration

The OTel Collector is configured via /observability/otel/config.yml:

Receivers

Data collection endpoints:

config.yml

receivers:
  # OTLP protocol (from PentAGI)
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:8148
      http:
        endpoint: 0.0.0.0:4318
  
  # Prometheus scraping (system metrics)
  prometheus:
    config:
      scrape_configs:
        - job_name: 'otel-collector'
          scrape_interval: 10s
          static_configs:
            - targets: ['node-exporter:9100']
        - job_name: 'clickhouse-collector'
          static_configs:
            - targets: ['clickstore:9363']
        - job_name: 'jaeger-collector'
          static_configs:
            - targets: ['jaeger:14269', 'jaeger:9090']
  
  # Docker metrics
  prometheus/docker:
    config:
      scrape_configs:
        - job_name: 'docker-container-collector'
          static_configs:
            - targets: ['cadvisor:8080']

Processors

Data transformation and filtering:

config.yml

processors:
  # Batch processing for efficiency
  batch:
    timeout: 5s
    send_batch_size: 1000
  
  # Attribute manipulation
  attributes:
    actions:
      - key: service_name_extracted
        action: delete

Exporters

Data output destinations:

config.yml

exporters:
  # Traces to Jaeger
  otlp:
    endpoint: jaeger:4317
    tls:
      insecure: true
  
  # Logs to Loki
  otlphttp:
    endpoint: http://loki:3100/otlp
  
  # Metrics to VictoriaMetrics
  prometheusremotewrite/local:
    endpoint: http://victoriametrics:8428/api/v1/write

Pipelines

Data flow configuration:

config.yml

service:
  pipelines:
    # Traces pipeline
    traces:
      receivers: [otlp]
      processors: [batch]
      exporters: [otlp]
    
    # Logs pipeline
    logs:
      receivers: [otlp]
      processors: [attributes, batch]
      exporters: [otlphttp]
    
    # Metrics pipeline
    metrics:
      receivers: [otlp, prometheus, prometheus/docker]
      processors: [batch]
      exporters: [prometheusremotewrite/local]

Telemetry Types

Traces

Distributed tracing data: Source: PentAGI application spans Flow: PentAGI → OTel → Jaeger Usage: Track request flow through system

// Example: PentAGI automatically creates spans
span := tracer.Start(ctx, "agent.execute")
defer span.End()

Metrics

Numerical measurements over time: Sources:

PentAGI application metrics (OTLP)
Node Exporter (system metrics)
cAdvisor (container metrics)
Component health checks

Flow: Sources → OTel → VictoriaMetrics Usage: Monitor performance and resource usage

Logs

Structured log events: Source: PentAGI application logs Flow: PentAGI → OTel → Loki Usage: Debug issues and audit operations

Integration

PentAGI Integration

PentAGI automatically sends telemetry when configured:

.env

OTEL_HOST=otelcol:8148

The service will:

Create spans for agent operations
Export application metrics
Send structured logs
Include trace context in all operations

Langfuse Integration

Connect Langfuse to OTel for unified observability:

.env

LANGFUSE_OTEL_EXPORTER_OTLP_ENDPOINT=http://otelcol:4318
LANGFUSE_OTEL_SERVICE_NAME=langfuse

This enables:

LLM traces in Jaeger
Langfuse metrics in Grafana
Unified log aggregation

Monitoring

Collector Health

Built-in health endpoints:

# Check collector status
curl http://localhost:13133/

# View metrics
curl http://localhost:8888/metrics

# Check zpages (detailed internal state)
curl http://localhost:55679/debug/tracez
curl http://localhost:55679/debug/servicez

Performance Metrics

Key metrics to monitor:

Metric	Description
`otelcol_receiver_accepted_spans`	Spans received
`otelcol_receiver_refused_spans`	Spans rejected
`otelcol_exporter_sent_spans`	Spans exported
`otelcol_processor_batch_batch_send_size`	Batch sizes
`otelcol_processor_batch_timeout_trigger`	Batch timeouts

Resource Usage

Monitor collector resource consumption:

# Check memory and CPU
docker stats otel

# View detailed metrics
docker exec otel curl localhost:8888/metrics | grep process

Troubleshooting

No Data Flowing

Verify collector is receiving data:

# Check OTLP receivers
docker compose logs otel | grep "Starting OTLP"

# Test gRPC endpoint
grpcurl -plaintext localhost:8148 list

# Test HTTP endpoint
curl -X POST http://localhost:4318/v1/traces \
  -H "Content-Type: application/json" \
  -d '{}'

Connection Refused

Check network connectivity:

# From PentAGI container
docker exec pentagi ping otelcol
docker exec pentagi telnet otelcol 8148

# Verify networks
docker network inspect observability-network
docker network inspect pentagi-network

High Memory Usage

Optimize collector configuration:

config.yml

processors:
  batch:
    timeout: 1s          # Flush more frequently
    send_batch_size: 100 # Smaller batches
  
  memory_limiter:
    check_interval: 1s
    limit_mib: 512       # Limit memory usage

Export Failures

Debug exporter issues:

# Enable debug logging
docker compose logs otel | grep -i error

# Check exporter connectivity
docker exec otel curl http://victoriametrics:8428/health
docker exec otel curl http://loki:3100/ready

Advanced Configuration

Sampling

Reduce trace volume:

config.yml

processors:
  probabilistic_sampler:
    sampling_percentage: 10.0  # Sample 10% of traces

service:
  pipelines:
    traces:
      processors: [probabilistic_sampler, batch]

Filtering

Drop unwanted data:

config.yml

processors:
  filter:
    metrics:
      exclude:
        match_type: regexp
        metric_names:
          - ^go_.*  # Exclude Go runtime metrics

Enrichment

Add context to telemetry:

config.yml

processors:
  resource:
    attributes:
      - key: environment
        value: production
        action: upsert
      - key: cluster
        value: pentagi-prod-01
        action: insert

Multiple Backends

Export to multiple destinations:

config.yml

exporters:
  otlp/primary:
    endpoint: jaeger:4317
  otlp/backup:
    endpoint: backup-collector:4317

service:
  pipelines:
    traces:
      exporters: [otlp/primary, otlp/backup]

Best Practices

Configuration Management

Version control your config.yml
Use environment variables for secrets
Document custom configuration changes
Test changes in development first
Keep backups of working configurations

Performance Optimization

Enable batching for all pipelines
Use appropriate batch sizes (100-1000)
Configure memory limiters
Monitor collector resource usage
Scale horizontally if needed

Security

Use TLS for production deployments
Restrict network access to OTel ports
Sanitize sensitive data in processors
Implement authentication on receivers
Audit configuration regularly

Reliability

Configure retry policies for exporters
Use persistent queues for critical data
Monitor collector health continuously
Set up redundant collectors
Test failover scenarios

Grafana - Visualization and dashboards
Langfuse - LLM observability integration
Observability Guide - Complete setup guide

LLM Providers

Observability

Knowledge Graph

Overview

Architecture

Setup

Configuration

Receivers

Processors

Exporters

Pipelines

Telemetry Types

Traces

Metrics

Logs

Integration

PentAGI Integration

Langfuse Integration

Monitoring

Collector Health

Performance Metrics

Resource Usage

Troubleshooting

No Data Flowing

Connection Refused

High Memory Usage

Export Failures

Advanced Configuration

Sampling

Filtering

Enrichment

Multiple Backends

Best Practices

Configuration Management

Performance Optimization

Security

Reliability

Build docs developers (and LLMs) love

LLM Providers

Observability

Knowledge Graph

​Overview

​Architecture

​Setup

​Configuration

​Receivers

​Processors

​Exporters

​Pipelines

​Telemetry Types

​Traces

​Metrics

​Logs

​Integration

​PentAGI Integration

​Langfuse Integration

​Monitoring

​Collector Health

​Performance Metrics

​Resource Usage

​Troubleshooting

​No Data Flowing

​Connection Refused

​High Memory Usage

​Export Failures

​Advanced Configuration

​Sampling

​Filtering

​Enrichment

​Multiple Backends

​Best Practices

​Configuration Management

​Performance Optimization

​Security

​Reliability

​Related Documentation

Build docs developers (and LLMs) love

Overview

Architecture

Setup

Configuration

Receivers

Processors

Exporters

Pipelines

Telemetry Types

Traces

Metrics

Logs

Integration

PentAGI Integration

Langfuse Integration

Monitoring

Collector Health

Performance Metrics

Resource Usage

Troubleshooting

No Data Flowing

Connection Refused

High Memory Usage

Export Failures

Advanced Configuration

Sampling

Filtering

Enrichment

Multiple Backends

Best Practices

Configuration Management

Performance Optimization

Security

Reliability

Related Documentation