Monitoring

The monitoring node is the entry point of Sentinel AI’s autonomous operations cycle. It continuously checks the health of configured services via SSH and triggers the remediation workflow when failures are detected.

Overview

The monitor node connects to remote servers via SSH and executes health check commands for each configured service. It maintains a real-time snapshot of service states and immediately detects failures.

The monitor node runs as the first step in the agent’s workflow loop, implementing continuous monitoring until a failure is detected.

Core Functionality

SSH Connection Establishment

Establishes an SSH connection using credentials from the configuration:

ssh = SSHClient(
    hostname=config.SSH_HOST,
    port=config.SSH_PORT,
    username=config.SSH_USER,
    password=config.SSH_PASS
)

Service Health Checks

Iterates through all configured services and executes their health check commands:

for service_name, service_cfg in config.SERVICES.items():
    code, out, err = ssh.execute_command(service_cfg["check_command"])
    is_running = service_cfg["running_indicator"] in out

Status Snapshot Creation

Builds a comprehensive snapshot of all service states:

services_snapshot[service_name] = {
    "status": status,
    "details": details,
    "type": service_cfg["type"]
}

Failure Detection

Detects the first failed service and prepares state for diagnosis:

if not is_running and not any_failure:
    any_failure = f"Servicio '{service_name}' no esta activo."
    failed_service = service_name
    log("monitor", f"{service_name} CAIDO: {out.strip()}")

State Management

The monitor node updates the agent state with critical information:

Success State

When all services are healthy:

current_step: “monitor”
current_error: None
affected_service: None

Failure State

When a service failure is detected:

current_step: “monitor”
current_error: Error description
affected_service: Failed service name

Service Configuration

Each monitored service requires configuration with these properties:

SERVICES = {
    "service_name": {
        "check_command": "systemctl status service_name",
        "running_indicator": "active (running)",
        "type": "systemd"
    }
}

The running_indicator string is used to determine if the service is active by checking if it appears in the command output.

Error Handling

The monitor node implements robust error handling for SSH connection failures:

except Exception as e:
    error_msg = f"Fallo en la conexion SSH: {str(e)}"
    log("error", error_msg)
    
    error_snapshot = {
        name: {"status": "error", "details": "SSH Connect Fail", "type": cfg["type"]}
        for name, cfg in config.SERVICES.items()
    }
    
    return {
        "current_step": "monitor",
        "current_error": error_msg,
        "affected_service": "ssh"
    }

SSH connection failures set the affected_service to “ssh”, which requires special handling in subsequent remediation steps.

Event Logging

The monitor node emits structured events for observability:

status_update

Emitted with complete service snapshot for UI updates

monitor

Logs operational status messages and failure detection

Implementation Location

Source: src/agent/nodes/monitor.py:18

Next Steps

When a failure is detected, the workflow transitions to the diagnosis phase where the error is analyzed using AI and historical data.

Get Started

Core Concepts

Configuration

Agent Operations

Dashboard

Advanced

Overview

Core Functionality

State Management

Success State

Failure State

Service Configuration

Error Handling

Event Logging

status_update

monitor

Implementation Location

Next Steps

Build docs developers (and LLMs) love

Get Started

Core Concepts

Configuration

Agent Operations

Dashboard

Advanced

​Overview

​Core Functionality

​State Management

Success State

Failure State

​Service Configuration

​Error Handling

​Event Logging

status_update

monitor

​Implementation Location

​Next Steps

Build docs developers (and LLMs) love

Overview

Core Functionality

State Management

Service Configuration

Error Handling

Event Logging

Implementation Location

Next Steps