Monitoring

Monitor your Infrahub deployment to ensure reliability, identify performance bottlenecks, and troubleshoot issues. This page covers monitoring strategies, metrics, logging, and observability.

Monitoring components

Health checks

Infrahub provides built-in health check endpoints: API server health:

curl http://localhost:8000/api/health

Returns HTTP 200 if healthy. Configuration endpoint:

curl http://localhost:8000/api/config

Returns version and configuration details. Component health checks:

Container health

Docker Compose includes built-in health checks:

# Check all services
docker compose ps

# Services should show (healthy) status

Kubernetes liveness and readiness probes:

livenessProbe:
  httpGet:
    path: /api/health
    port: 8000
  initialDelaySeconds: 30
  periodSeconds: 10

readinessProbe:
  httpGet:
    path: /api/health
    port: 8000
  initialDelaySeconds: 10
  periodSeconds: 5

Metrics collection

Prometheus metrics

Enable OpenTelemetry metrics export:

docker-compose.override.yml

services:
  infrahub-server:
    environment:
      INFRAHUB_TRACE_ENABLE: "true"
      INFRAHUB_TRACE_EXPORTER_TYPE: otlp
      INFRAHUB_TRACE_EXPORTER_PROTOCOL: grpc
      INFRAHUB_TRACE_EXPORTER_ENDPOINT: http://prometheus:9090

Key metrics to monitor

API server metrics:

Request rate (requests/second)
Response time (p50, p95, p99)
Error rate (4xx, 5xx responses)
Active connections
Worker utilization

Database metrics:

Query execution time
Transaction rate
Cache hit ratio
Connection pool usage
Page cache usage
Heap memory usage

Task worker metrics:

Active tasks
Task queue depth
Task failure rate
Task execution time
Worker concurrency

System metrics:

CPU usage
Memory usage
Disk I/O
Network throughput
Disk space usage

Neo4j metrics

Query Neo4j metrics:

// Database size
CALL db.stats.retrieve('GRAPH COUNTS');

// Query performance
CALL dbms.queryJmx('org.neo4j:instance=kernel#0,name=Transactions');

// Page cache
CALL dbms.queryJmx('org.neo4j:instance=kernel#0,name=Page cache');

Expose Neo4j metrics to Prometheus:

docker-compose.override.yml

services:
  database:
    environment:
      NEO4J_metrics_prometheus_enabled: "true"
      NEO4J_metrics_prometheus_endpoint: "0.0.0.0:2004"
    ports:
      - "2004:2004"

RabbitMQ metrics

Access RabbitMQ management interface:

http://localhost:15672

Credentials: infrahub / infrahub Export Prometheus metrics:

docker-compose.override.yml

services:
  message-queue:
    environment:
      RABBITMQ_PROMETHEUS_PLUGIN: "true"
    ports:
      - "15692:15692"

Scrape endpoint:

prometheus.yml

scrape_configs:
  - job_name: 'rabbitmq'
    static_configs:
      - targets: ['message-queue:15692']

Redis metrics

Query Redis info:

# Memory usage
docker compose exec cache redis-cli info memory

# Stats
docker compose exec cache redis-cli info stats

# All info
docker compose exec cache redis-cli info

Export to Prometheus using redis_exporter:

docker-compose.override.yml

services:
  redis-exporter:
    image: oliver006/redis_exporter:latest
    environment:
      REDIS_ADDR: cache:6379
    ports:
      - "9121:9121"

Logging

Log levels

Configure log verbosity:

# Set log level
INFRAHUB_LOG_LEVEL=INFO  # DEBUG, INFO, WARNING, ERROR, CRITICAL

Centralized logging

Aggregate logs using Loki, Elasticsearch, or CloudWatch:

Structured logging

Infrahub logs are JSON-formatted for easy parsing:

{
  "timestamp": "2025-03-02T12:00:00Z",
  "level": "INFO",
  "logger": "infrahub.server",
  "message": "Request processed",
  "request_id": "abc123",
  "duration_ms": 45
}

Parse with jq:

docker compose logs infrahub-server | jq 'select(.level=="ERROR")'

Log retention

Configure Docker log rotation:

docker-compose.override.yml

services:
  infrahub-server:
    logging:
      driver: "json-file"
      options:
        max-size: "10m"
        max-file: "3"

Distributed tracing

Enable OpenTelemetry tracing:

docker-compose.override.yml

services:
  infrahub-server:
    environment:
      INFRAHUB_TRACE_ENABLE: "true"
      INFRAHUB_TRACE_EXPORTER_TYPE: otlp
      INFRAHUB_TRACE_EXPORTER_ENDPOINT: http://jaeger:4317
      OTEL_RESOURCE_ATTRIBUTES: service.name=infrahub-server

  jaeger:
    image: jaegertracing/all-in-one:latest
    ports:
      - "16686:16686"  # UI
      - "4317:4317"    # OTLP gRPC
      - "4318:4318"    # OTLP HTTP

Access Jaeger UI:

http://localhost:16686

Alerting

Prometheus alerts

Define alerts for critical conditions:

alerts.yml

groups:
  - name: infrahub
    interval: 30s
    rules:
      - alert: HighErrorRate
        expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High error rate on {{ $labels.instance }}"

      - alert: HighResponseTime
        expr: histogram_quantile(0.95, http_request_duration_seconds_bucket) > 1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High response time on {{ $labels.instance }}"

      - alert: DatabaseDown
        expr: up{job="neo4j"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Neo4j database is down"

      - alert: HighMemoryUsage
        expr: container_memory_usage_bytes / container_spec_memory_limit_bytes > 0.9
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High memory usage on {{ $labels.container_label_com_docker_compose_service }}"

      - alert: DiskSpaceLow
        expr: (node_filesystem_avail_bytes / node_filesystem_size_bytes) < 0.1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Disk space below 10% on {{ $labels.instance }}"

Alert notifications

Configure Alertmanager:

alertmanager.yml

route:
  receiver: 'default'
  group_by: ['alertname', 'severity']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 12h

receivers:
  - name: 'default'
    email_configs:
      - to: '[email protected]'
        from: '[email protected]'
        smarthost: 'smtp.example.com:587'
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/XXX'
        channel: '#infrahub-alerts'

Dashboards

Grafana setup

Deploy Grafana:

docker-compose.override.yml

services:
  grafana:
    image: grafana/grafana:latest
    ports:
      - "3000:3000"
    environment:
      GF_SECURITY_ADMIN_PASSWORD: admin
    volumes:
      - grafana_data:/var/lib/grafana
      - ./grafana/dashboards:/etc/grafana/provisioning/dashboards
      - ./grafana/datasources:/etc/grafana/provisioning/datasources

volumes:
  grafana_data:

Access Grafana:

http://localhost:3000

Sample dashboard panels

API request rate:

rate(http_requests_total[5m])

Database query latency:

histogram_quantile(0.95, rate(neo4j_cypher_query_duration_seconds_bucket[5m]))

Task queue depth:

rabbitmq_queue_messages{queue="infrahub"}

Memory usage:

container_memory_usage_bytes{container_label_com_docker_compose_service="infrahub-server"}

Performance monitoring

Query performance

Enable query logging:

docker-compose.override.yml

services:
  infrahub-server:
    environment:
      INFRAHUB_MISC_PRINT_QUERY_DETAILS: "true"

View slow queries:

docker compose logs infrahub-server | grep "query_duration" | jq 'select(.query_duration_ms > 1000)'

Database profiling

Profile Neo4j queries:

// Explain query plan
EXPLAIN MATCH (n:Node) WHERE n.name = 'example' RETURN n;

// Profile query execution
PROFILE MATCH (n:Node) WHERE n.name = 'example' RETURN n;

Resource utilization

Monitor container resource usage:

# Docker stats
docker stats

# Specific container
docker stats infrahub-server-1

Kubernetes resource metrics:

# Pod metrics
kubectl top pods -n infrahub

# Node metrics
kubectl top nodes

Troubleshooting

Common issues

High memory usage:

# Check Neo4j heap
docker compose exec database cypher-shell -u neo4j \
  -c "CALL dbms.queryJmx('java.lang:type=Memory');"

# Increase heap size
NEO4J_dbms_memory_heap_max__size=4G

Slow API responses:

# Check query cache hit rate
curl http://localhost:8000/api/metrics | grep cache_hit_rate

# Increase cache size
INFRAHUB_CACHE_DATABASE=1

Task queue backlog:

# Check queue depth
curl -u infrahub:infrahub http://localhost:15672/api/queues | jq '.[] | {name, messages}'

# Scale workers
docker compose up -d --scale task-worker=4

Configuration - Configure monitoring settings
Architecture - Understand system components
Backup and restore - Backup monitoring data
Upgrade - Monitor upgrades
Grafana - Dashboarding and visualization
Prometheus - Metrics collection
Jaeger - Distributed tracing

Get Started

Core Concepts

Schema & Data Modeling

Data Management

Version Control & Branching

Transformations & Artifacts

Integration & Automation

Deployment & Operations

Monitoring components

Health checks

Container health

Metrics collection

Prometheus metrics

Key metrics to monitor

Neo4j metrics

RabbitMQ metrics

Redis metrics

Logging

Log levels

Centralized logging

Structured logging

Log retention

Distributed tracing

Alerting

Prometheus alerts

Alert notifications

Dashboards

Grafana setup

Sample dashboard panels

Performance monitoring

Query performance

Database profiling

Resource utilization

Troubleshooting

Common issues

Build docs developers (and LLMs) love

Get Started

Core Concepts

Schema & Data Modeling

Data Management

Version Control & Branching

Transformations & Artifacts

Integration & Automation

Deployment & Operations

​Monitoring components

​Health checks

​Container health

​Metrics collection

​Prometheus metrics

​Key metrics to monitor

​Neo4j metrics

​RabbitMQ metrics

​Redis metrics

​Logging

​Log levels

​Centralized logging

​Structured logging

​Log retention

​Distributed tracing

​Alerting

​Prometheus alerts

​Alert notifications

​Dashboards

​Grafana setup

​Sample dashboard panels

​Performance monitoring

​Query performance

​Database profiling

​Resource utilization

​Troubleshooting

​Common issues

​Related resources

Build docs developers (and LLMs) love

Monitoring components

Health checks

Container health

Metrics collection

Prometheus metrics

Key metrics to monitor

Neo4j metrics

RabbitMQ metrics

Redis metrics

Logging

Log levels

Centralized logging

Structured logging

Log retention

Distributed tracing

Alerting

Prometheus alerts

Alert notifications

Dashboards

Grafana setup

Sample dashboard panels

Performance monitoring

Query performance

Database profiling

Resource utilization

Troubleshooting

Common issues

Related resources