Skip to main content
Monitor your Infrahub deployment to ensure reliability, identify performance bottlenecks, and troubleshoot issues. This page covers monitoring strategies, metrics, logging, and observability.

Monitoring components

Health checks

Infrahub provides built-in health check endpoints: API server health:
curl http://localhost:8000/api/health
Returns HTTP 200 if healthy. Configuration endpoint:
curl http://localhost:8000/api/config
Returns version and configuration details. Component health checks:

    Container health

    Docker Compose includes built-in health checks:
    # Check all services
    docker compose ps
    
    # Services should show (healthy) status
    
    Kubernetes liveness and readiness probes:
    livenessProbe:
      httpGet:
        path: /api/health
        port: 8000
      initialDelaySeconds: 30
      periodSeconds: 10
    
    readinessProbe:
      httpGet:
        path: /api/health
        port: 8000
      initialDelaySeconds: 10
      periodSeconds: 5
    

    Metrics collection

    Prometheus metrics

    Enable OpenTelemetry metrics export:
    docker-compose.override.yml
    services:
      infrahub-server:
        environment:
          INFRAHUB_TRACE_ENABLE: "true"
          INFRAHUB_TRACE_EXPORTER_TYPE: otlp
          INFRAHUB_TRACE_EXPORTER_PROTOCOL: grpc
          INFRAHUB_TRACE_EXPORTER_ENDPOINT: http://prometheus:9090
    

    Key metrics to monitor

    API server metrics:
    • Request rate (requests/second)
    • Response time (p50, p95, p99)
    • Error rate (4xx, 5xx responses)
    • Active connections
    • Worker utilization
    Database metrics:
    • Query execution time
    • Transaction rate
    • Cache hit ratio
    • Connection pool usage
    • Page cache usage
    • Heap memory usage
    Task worker metrics:
    • Active tasks
    • Task queue depth
    • Task failure rate
    • Task execution time
    • Worker concurrency
    System metrics:
    • CPU usage
    • Memory usage
    • Disk I/O
    • Network throughput
    • Disk space usage

    Neo4j metrics

    Query Neo4j metrics:
    // Database size
    CALL db.stats.retrieve('GRAPH COUNTS');
    
    // Query performance
    CALL dbms.queryJmx('org.neo4j:instance=kernel#0,name=Transactions');
    
    // Page cache
    CALL dbms.queryJmx('org.neo4j:instance=kernel#0,name=Page cache');
    
    Expose Neo4j metrics to Prometheus:
    docker-compose.override.yml
    services:
      database:
        environment:
          NEO4J_metrics_prometheus_enabled: "true"
          NEO4J_metrics_prometheus_endpoint: "0.0.0.0:2004"
        ports:
          - "2004:2004"
    

    RabbitMQ metrics

    Access RabbitMQ management interface:
    http://localhost:15672
    
    Credentials: infrahub / infrahub Export Prometheus metrics:
    docker-compose.override.yml
    services:
      message-queue:
        environment:
          RABBITMQ_PROMETHEUS_PLUGIN: "true"
        ports:
          - "15692:15692"
    
    Scrape endpoint:
    prometheus.yml
    scrape_configs:
      - job_name: 'rabbitmq'
        static_configs:
          - targets: ['message-queue:15692']
    

    Redis metrics

    Query Redis info:
    # Memory usage
    docker compose exec cache redis-cli info memory
    
    # Stats
    docker compose exec cache redis-cli info stats
    
    # All info
    docker compose exec cache redis-cli info
    
    Export to Prometheus using redis_exporter:
    docker-compose.override.yml
    services:
      redis-exporter:
        image: oliver006/redis_exporter:latest
        environment:
          REDIS_ADDR: cache:6379
        ports:
          - "9121:9121"
    

    Logging

    Log levels

    Configure log verbosity:
    # Set log level
    INFRAHUB_LOG_LEVEL=INFO  # DEBUG, INFO, WARNING, ERROR, CRITICAL
    

    Centralized logging

    Aggregate logs using Loki, Elasticsearch, or CloudWatch:

      Structured logging

      Infrahub logs are JSON-formatted for easy parsing:
      {
        "timestamp": "2025-03-02T12:00:00Z",
        "level": "INFO",
        "logger": "infrahub.server",
        "message": "Request processed",
        "request_id": "abc123",
        "duration_ms": 45
      }
      
      Parse with jq:
      docker compose logs infrahub-server | jq 'select(.level=="ERROR")'
      

      Log retention

      Configure Docker log rotation:
      docker-compose.override.yml
      services:
        infrahub-server:
          logging:
            driver: "json-file"
            options:
              max-size: "10m"
              max-file: "3"
      

      Distributed tracing

      Enable OpenTelemetry tracing:
      docker-compose.override.yml
      services:
        infrahub-server:
          environment:
            INFRAHUB_TRACE_ENABLE: "true"
            INFRAHUB_TRACE_EXPORTER_TYPE: otlp
            INFRAHUB_TRACE_EXPORTER_ENDPOINT: http://jaeger:4317
            OTEL_RESOURCE_ATTRIBUTES: service.name=infrahub-server
      
        jaeger:
          image: jaegertracing/all-in-one:latest
          ports:
            - "16686:16686"  # UI
            - "4317:4317"    # OTLP gRPC
            - "4318:4318"    # OTLP HTTP
      
      Access Jaeger UI:
      http://localhost:16686
      

      Alerting

      Prometheus alerts

      Define alerts for critical conditions:
      alerts.yml
      groups:
        - name: infrahub
          interval: 30s
          rules:
            - alert: HighErrorRate
              expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.05
              for: 5m
              labels:
                severity: critical
              annotations:
                summary: "High error rate on {{ $labels.instance }}"
      
            - alert: HighResponseTime
              expr: histogram_quantile(0.95, http_request_duration_seconds_bucket) > 1
              for: 5m
              labels:
                severity: warning
              annotations:
                summary: "High response time on {{ $labels.instance }}"
      
            - alert: DatabaseDown
              expr: up{job="neo4j"} == 0
              for: 1m
              labels:
                severity: critical
              annotations:
                summary: "Neo4j database is down"
      
            - alert: HighMemoryUsage
              expr: container_memory_usage_bytes / container_spec_memory_limit_bytes > 0.9
              for: 5m
              labels:
                severity: warning
              annotations:
                summary: "High memory usage on {{ $labels.container_label_com_docker_compose_service }}"
      
            - alert: DiskSpaceLow
              expr: (node_filesystem_avail_bytes / node_filesystem_size_bytes) < 0.1
              for: 5m
              labels:
                severity: warning
              annotations:
                summary: "Disk space below 10% on {{ $labels.instance }}"
      

      Alert notifications

      Configure Alertmanager:
      alertmanager.yml
      route:
        receiver: 'default'
        group_by: ['alertname', 'severity']
        group_wait: 30s
        group_interval: 5m
        repeat_interval: 12h
      
      receivers:
        - name: 'default'
          email_configs:
            - to: '[email protected]'
              from: '[email protected]'
              smarthost: 'smtp.example.com:587'
          slack_configs:
            - api_url: 'https://hooks.slack.com/services/XXX'
              channel: '#infrahub-alerts'
      

      Dashboards

      Grafana setup

      Deploy Grafana:
      docker-compose.override.yml
      services:
        grafana:
          image: grafana/grafana:latest
          ports:
            - "3000:3000"
          environment:
            GF_SECURITY_ADMIN_PASSWORD: admin
          volumes:
            - grafana_data:/var/lib/grafana
            - ./grafana/dashboards:/etc/grafana/provisioning/dashboards
            - ./grafana/datasources:/etc/grafana/provisioning/datasources
      
      volumes:
        grafana_data:
      
      Access Grafana:
      http://localhost:3000
      

      Sample dashboard panels

      API request rate:
      rate(http_requests_total[5m])
      
      Database query latency:
      histogram_quantile(0.95, rate(neo4j_cypher_query_duration_seconds_bucket[5m]))
      
      Task queue depth:
      rabbitmq_queue_messages{queue="infrahub"}
      
      Memory usage:
      container_memory_usage_bytes{container_label_com_docker_compose_service="infrahub-server"}
      

      Performance monitoring

      Query performance

      Enable query logging:
      docker-compose.override.yml
      services:
        infrahub-server:
          environment:
            INFRAHUB_MISC_PRINT_QUERY_DETAILS: "true"
      
      View slow queries:
      docker compose logs infrahub-server | grep "query_duration" | jq 'select(.query_duration_ms > 1000)'
      

      Database profiling

      Profile Neo4j queries:
      // Explain query plan
      EXPLAIN MATCH (n:Node) WHERE n.name = 'example' RETURN n;
      
      // Profile query execution
      PROFILE MATCH (n:Node) WHERE n.name = 'example' RETURN n;
      

      Resource utilization

      Monitor container resource usage:
      # Docker stats
      docker stats
      
      # Specific container
      docker stats infrahub-server-1
      
      Kubernetes resource metrics:
      # Pod metrics
      kubectl top pods -n infrahub
      
      # Node metrics
      kubectl top nodes
      

      Troubleshooting

      Common issues

      High memory usage:
      # Check Neo4j heap
      docker compose exec database cypher-shell -u neo4j \
        -c "CALL dbms.queryJmx('java.lang:type=Memory');"
      
      # Increase heap size
      NEO4J_dbms_memory_heap_max__size=4G
      
      Slow API responses:
      # Check query cache hit rate
      curl http://localhost:8000/api/metrics | grep cache_hit_rate
      
      # Increase cache size
      INFRAHUB_CACHE_DATABASE=1
      
      Task queue backlog:
      # Check queue depth
      curl -u infrahub:infrahub http://localhost:15672/api/queues | jq '.[] | {name, messages}'
      
      # Scale workers
      docker compose up -d --scale task-worker=4
      

      Build docs developers (and LLMs) love