Verification

The verification node validates whether remediation was successful by re-checking the health of the affected service. It determines if the issue is resolved or if additional attempts are needed.

Overview

After executing remediation commands, verification ensures the service has actually recovered. This closes the feedback loop and determines the next action: declare success, retry remediation, or escalate to human operators.

Verification uses the same health check commands as the monitoring node, ensuring consistency in success criteria.

Verification Workflow

Service Configuration Lookup

Retrieves health check configuration for the affected service:

service = state.get("affected_service", "")
service_cfg = config.SERVICES.get(service, {})

if not service_cfg:
    log("error", f"Servicio '{service}' no encontrado en configuracion.")
    return {
        "current_step": "verify",
        "current_error": f"Servicio '{service}' sin configuracion de verificacion.",
        "retry_count": state.get("retry_count", 0) + 1
    }

SSH Connection

Establishes connection to run the health check:

ssh = SSHClient(
    hostname=config.SSH_HOST,
    port=config.SSH_PORT,
    username=config.SSH_USER,
    password=config.SSH_PASS
)

Health Check Execution

Runs the service-specific check command:

code, out, err = ssh.execute_command(service_cfg["check_command"])
ssh.close()

Status Evaluation

Checks if the service is now running:

if service_cfg["running_indicator"] in out:
    log("verify", f"Servicio '{service}' RECUPERADO.")
    return {"current_step": "verify", "current_error": None}

Retry Decision

If still failed, increments retry counter:

else:
    retry = state.get("retry_count", 0) + 1
    log("warning", f"Servicio '{service}' sigue caido. Intento {retry}.")
    return {
        "current_step": "verify",
        "current_error": state.get("current_error"),
        "retry_count": retry
    }

Health Check Consistency

Verification uses the same check logic as monitoring:

code, out, err = ssh.execute_command(service_cfg["check_command"])

if service_cfg["running_indicator"] in out:
    # Service is running

This ensures that a service passing verification will also pass future monitoring checks.

The running_indicator is a string that must appear in the command output for the service to be considered healthy (e.g., “active (running)”).

Success Response

When verification succeeds, the error is cleared:

log("verify", f"Servicio '{service}' RECUPERADO.")
return {"current_step": "verify", "current_error": None}

current_error: None

Signals to the workflow that remediation succeeded

Workflow Termination

With no error, the agent returns to monitoring mode

Failure Response

When verification fails, the retry counter is incremented:

retry = state.get("retry_count", 0) + 1
log("warning", f"Servicio '{service}' sigue caido. Intento {retry}.")

return {
    "current_step": "verify",
    "current_error": state.get("current_error"),
    "retry_count": retry
}

The persistent current_error triggers another remediation cycle, with the diagnosis and planning nodes now aware of the increased retry count.

Retry Logic

The retry counter influences subsequent remediation:

Diagnosis

The diagnosis node receives retry count and adjusts its analysis accordingly

Planning

The planner includes retry count in its prompt, enabling progressive problem-solving strategies

Example from planning node:

HumanMessage(content=(
    f"Error: {error}\n"
    f"Intento: {retry_count + 1}\n"
    f"Diagnostico: {diagnosis}"
))

Missing Service Configuration

If the service isn’t in the configuration:

if not service_cfg:
    log("error", f"Servicio '{service}' no encontrado en configuracion.")
    return {
        "current_step": "verify",
        "current_error": f"Servicio '{service}' sin configuracion de verificacion.",
        "retry_count": state.get("retry_count", 0) + 1
    }

This creates an error state that will likely escalate, as the agent cannot verify a service it doesn’t know how to check.

Exception Handling

Verification handles SSH and execution errors:

except Exception as e:
    log("error", f"Error en verificacion: {e}")
    return {
        "current_step": "verify",
        "current_error": str(e),
        "retry_count": state.get("retry_count", 0) + 1
    }

Exceptions during verification are treated as verification failures, triggering another remediation attempt rather than halting the workflow.

State Fields

Verification updates these state fields:

current_error

None if recovered, preserved if still failed

retry_count

Incremented on each failed verification

current_step

Set to “verify” to mark workflow position

Workflow Integration

The verification result determines the next node:

# In workflow graph definition
if state["current_error"] is None:
    # Return to monitoring
    return "monitor"
else:
    # Retry remediation
    return "diagnose"

Progressive Remediation

With each failed verification:

Retry count increases → Diagnosis and planning adapt strategies
Memory accumulates → Failed approaches are excluded
Context grows → LLM has more information about what didn’t work

This progressive approach increases the likelihood of success with each iteration, as the system learns from its mistakes in real-time.

Max Retry Handling

While not implemented in this node, workflow orchestration typically includes max retry limits:

# Typically in workflow conditional logic
if state["retry_count"] >= config.MAX_RETRIES:
    return "escalate"  # Human intervention required

Service Configuration Example

Services must define health check commands:

SERVICES = {
    "nginx": {
        "check_command": "sudo service nginx status",
        "running_indicator": "is running",
        "type": "service"
    },
    "postgresql": {
        "check_command": "sudo service postgresql status",
        "running_indicator": "active (running)",
        "type": "database"
    }
}

The running_indicator must be a reliable substring that only appears when the service is healthy.

Logging and Observability

Detailed logging tracks verification outcomes:

log("verify", "Comprobando si el servicio se recupero...")
log("verify", f"Servicio '{service}' RECUPERADO.")
log("warning", f"Servicio '{service}' sigue caido. Intento {retry}.")
log("error", f"Error en verificacion: {e}")

Implementation Location

Source: src/agent/nodes/verify.py:17

Workflow Loop

Verification completes the autonomous remediation cycle:

Monitor detects failure
Diagnose analyzes the problem
Plan generates remediation commands
Approve validates security
Execute runs the commands
Verify confirms recovery

If verification fails, the loop repeats from diagnosis with accumulated context and memory.

This closed-loop architecture enables true autonomous operation, with the system continuously learning and adapting until the issue is resolved.

Get Started

Core Concepts

Configuration

Agent Operations

Dashboard

Advanced

Overview

Verification Workflow

Health Check Consistency

Success Response

current_error: None

Workflow Termination

Failure Response

Retry Logic

Diagnosis

Planning

Missing Service Configuration

Exception Handling

State Fields

current_error

retry_count

current_step

Workflow Integration

Progressive Remediation

Max Retry Handling

Service Configuration Example

Logging and Observability

Implementation Location

Workflow Loop

Build docs developers (and LLMs) love

Get Started

Core Concepts

Configuration

Agent Operations

Dashboard

Advanced

​Overview

​Verification Workflow

​Health Check Consistency

​Success Response

current_error: None

Workflow Termination

​Failure Response

​Retry Logic

Diagnosis

Planning

​Missing Service Configuration

​Exception Handling

​State Fields

current_error

retry_count

current_step

​Workflow Integration

​Progressive Remediation

​Max Retry Handling

​Service Configuration Example

​Logging and Observability

​Implementation Location

​Workflow Loop

Build docs developers (and LLMs) love

Overview

Verification Workflow

Health Check Consistency

Success Response

Failure Response

Retry Logic

Missing Service Configuration

Exception Handling

State Fields

Workflow Integration

Progressive Remediation

Max Retry Handling

Service Configuration Example

Logging and Observability

Implementation Location

Workflow Loop