Monitoring & Health Checks

Health Check Endpoint

The system provides a comprehensive health check endpoint at /api/health that monitors all critical components.

Endpoint Details

curl http://localhost:8000/api/health

Response Structure

The health check returns a JSON response with component status (see chat/app.py:196-219):

{
  "status": "healthy",
  "components": {
    "storage": true,
    "query_engine": true,
    "query_parser": true,
    "neo4j": true
  }
}

Status Values

status

string

Overall system health status:

healthy: All components operational
degraded: Some components failing but system partially operational

components

object

Individual component health status:

storage: GraphStorage initialized
query_engine: QueryEngine initialized
query_parser: QueryParser with LLM initialized
neo4j: Active database connection verified

Health Check Implementation

The health check endpoint is implemented in chat/app.py:

chat/app.py:196-219

@app.get("/api/health")
async def health_check():
    """Health check endpoint."""
    global storage, query_engine, query_parser
    
    status = {
        "status": "healthy",
        "components": {
            "storage": storage is not None,
            "query_engine": query_engine is not None,
            "query_parser": query_parser is not None
        }
    }
    
    # Test Neo4j connection
    try:
        if storage:
            storage.execute_cypher("RETURN 1")
            status["components"]["neo4j"] = True
    except Exception:
        status["components"]["neo4j"] = False
        status["status"] = "degraded"
    
    return status

Component Checks

Initialization Checks

Verifies that critical components were initialized during startup:

GraphStorage instance created
QueryEngine instance created
QueryParser with LLM initialized

Neo4j Connection Test

Actively tests the Neo4j database connection by executing a simple query:

storage.execute_cypher("RETURN 1")

If this fails, the system status is set to “degraded”.

Application Startup

The application initializes components during the startup event (chat/app.py:58-79):

chat/app.py:58-79

@app.on_event("startup")
async def startup_event():
    """Initialize components on startup."""
    global storage, query_engine, query_parser
    
    try:
        # Initialize storage
        storage = GraphStorage()
        query_engine = QueryEngine(storage)
        
        # Initialize LLM and parser
        llm = GeminiLLM()
        query_parser = QueryParser(query_engine, llm)
        
        # Load data from configuration files
        await load_configuration_data()
        
        logger.info("Application startup completed successfully")
        
    except Exception as e:
        logger.error(f"Failed to initialize application: {e}")
        raise

Startup Sequence

Database Connection

Connect to Neo4j using environment variables (graph/storage.py:17-39):

self.uri = uri or os.getenv('NEO4J_URI', 'bolt://localhost:7687')
self.user = user or os.getenv('NEO4J_USER', 'neo4j')
self.password = password or os.getenv('NEO4J_PASSWORD', 'password')

self.driver = GraphDatabase.driver(self.uri, auth=(self.user, self.password))

LLM Initialization

Initialize the Gemini LLM client (chat/llm.py:17-24):

self.api_key = api_key or os.getenv('GEMINI_API_KEY')
if not self.api_key:
    raise ValueError("GEMINI_API_KEY environment variable is required")

self.client = genai.Client(api_key=self.api_key)

Data Loading

Load configuration data from YAML files (chat/app.py:90-134):

Clear existing graph data
Parse docker-compose.yml
Parse teams.yaml
Parse k8s-deployments.yaml (optional)
Populate graph database

Logging

The system uses Python’s standard logging module with structured log messages.

Log Configuration

Logging is configured in main.py:16-20:

main.py:16-20

logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
)
logger = logging.getLogger(__name__)

Log Levels

INFO
WARNING
ERROR

Normal operational messages:

Component initialization
Configuration loading
Data statistics (nodes/edges loaded)

2026-03-03 14:23:45 - __main__ - INFO - Connecting to Neo4j...
2026-03-03 14:23:46 - __main__ - INFO - Loaded 25 nodes and 48 edges from Docker Compose

Non-critical issues that don’t prevent operation:

Configuration validation warnings
Missing optional features

2026-03-03 14:23:47 - __main__ - WARNING - Configuration validation found issues, but continuing...

Critical failures:

Missing environment variables
Database connection failures
Configuration parsing errors

2026-03-03 14:23:48 - __main__ - ERROR - Failed to connect to Neo4j: Connection refused

Viewing Logs

# All services
docker-compose logs -f

# Specific service
docker-compose logs -f ekg-app

# Last 100 lines
docker-compose logs --tail=100 ekg-app

Docker Health Checks

Neo4j Health Check

The Neo4j service includes a built-in health check in docker-compose.yml:

docker-compose.yml:15-19

healthcheck:
  test: ["CMD", "cypher-shell", "-u", "neo4j", "-p", "password", "RETURN 1"]
  interval: 10s
  timeout: 5s
  retries: 5

This ensures Neo4j is ready before dependent services start.

Checking Service Health

# View service health status
docker-compose ps

# Output shows health status
NAME                SERVICE   STATUS          PORTS
neo4j               neo4j     Up (healthy)    7474/tcp, 7687/tcp
ekg-app             ekg-app   Up              8000/tcp

Monitoring Metrics

Key Metrics to Track

Request Latency

Monitor API endpoint response times:

/api/query: Query processing time
/api/entities: Entity retrieval time
/api/health: Health check response time

Error Rate

Track HTTP error responses:

500 errors: System failures
503 errors: Service unavailable
Query parsing failures

Neo4j Performance

Monitor database metrics:

Connection pool usage
Query execution time
Graph size (nodes/edges)

Resource Usage

Track container resources:

CPU usage
Memory consumption
Disk space (Neo4j volumes)

Docker Stats

Monitor real-time container resource usage:

docker stats

Output:

CONTAINER    CPU %    MEM USAGE / LIMIT     MEM %    NET I/O          BLOCK I/O
neo4j        2.5%     512MiB / 2GiB        25.6%    1.2MB / 850kB    15MB / 8MB
ekg-app      1.2%     256MiB / 1GiB        25.0%    850kB / 1.2MB    5MB / 1MB

Environment Validation

The system validates environment configuration before starting (main.py:23-37):

main.py:23-37

def check_environment():
    """Check that required environment variables are set."""
    required_vars = ['GEMINI_API_KEY', 'NEO4J_URI', 'NEO4J_USER', 'NEO4J_PASSWORD']
    missing_vars = []
    
    for var in required_vars:
        if not os.getenv(var):
            missing_vars.append(var)
    
    if missing_vars:
        logger.error(f"Missing required environment variables: {', '.join(missing_vars)}")
        logger.error("Please create a .env file based on .env.example")
        return False
    
    return True

The application will exit with error code 1 if required environment variables are missing.

Configuration Validation

The system includes a comprehensive configuration validator (scripts/validate_config.py) that checks:

Docker Compose service definitions
Team ownership mappings
Service dependencies
Kubernetes deployment configurations

Run validation manually:

python scripts/validate_config.py

Sample output:

============================================================
CONFIGURATION VALIDATION RESULTS
============================================================

⚠️  WARNINGS (2):
  1. Service 'api-gateway' has no team ownership defined
  2. Team 'Platform' doesn't own any services

============================================================
✅ Validation PASSED - Only warnings found

Get Started

Core Concepts

Guides

Operations

Health Check Endpoint

Endpoint Details

Response Structure

Status Values

Health Check Implementation

Component Checks

Application Startup

Startup Sequence

Logging

Log Configuration

Log Levels

Viewing Logs

Docker Health Checks

Neo4j Health Check

Checking Service Health

Monitoring Metrics

Key Metrics to Track

Request Latency

Error Rate

Neo4j Performance

Resource Usage

Docker Stats

Environment Validation

Configuration Validation

Next Steps

Deployment

Troubleshooting

Build docs developers (and LLMs) love

Get Started

Core Concepts

Guides

Operations

​Health Check Endpoint

​Endpoint Details

​Response Structure

​Status Values

​Health Check Implementation

​Component Checks

​Application Startup

​Startup Sequence

​Logging

​Log Configuration

​Log Levels

​Viewing Logs

​Docker Health Checks

​Neo4j Health Check

​Checking Service Health

​Monitoring Metrics

​Key Metrics to Track

Request Latency

Error Rate

Neo4j Performance

Resource Usage

​Docker Stats

​Environment Validation

​Configuration Validation

​Next Steps

Deployment

Troubleshooting

Build docs developers (and LLMs) love

Health Check Endpoint

Endpoint Details

Response Structure

Status Values

Health Check Implementation

Component Checks

Application Startup

Startup Sequence

Logging

Log Configuration

Log Levels

Viewing Logs

Docker Health Checks

Neo4j Health Check

Checking Service Health

Monitoring Metrics

Key Metrics to Track

Docker Stats

Environment Validation

Configuration Validation

Next Steps