Skip to main content
The verification node validates whether remediation was successful by re-checking the health of the affected service. It determines if the issue is resolved or if additional attempts are needed.

Overview

After executing remediation commands, verification ensures the service has actually recovered. This closes the feedback loop and determines the next action: declare success, retry remediation, or escalate to human operators.
Verification uses the same health check commands as the monitoring node, ensuring consistency in success criteria.

Verification Workflow

1

Service Configuration Lookup

Retrieves health check configuration for the affected service:
service = state.get("affected_service", "")
service_cfg = config.SERVICES.get(service, {})

if not service_cfg:
    log("error", f"Servicio '{service}' no encontrado en configuracion.")
    return {
        "current_step": "verify",
        "current_error": f"Servicio '{service}' sin configuracion de verificacion.",
        "retry_count": state.get("retry_count", 0) + 1
    }
2

SSH Connection

Establishes connection to run the health check:
ssh = SSHClient(
    hostname=config.SSH_HOST,
    port=config.SSH_PORT,
    username=config.SSH_USER,
    password=config.SSH_PASS
)
3

Health Check Execution

Runs the service-specific check command:
code, out, err = ssh.execute_command(service_cfg["check_command"])
ssh.close()
4

Status Evaluation

Checks if the service is now running:
if service_cfg["running_indicator"] in out:
    log("verify", f"Servicio '{service}' RECUPERADO.")
    return {"current_step": "verify", "current_error": None}
5

Retry Decision

If still failed, increments retry counter:
else:
    retry = state.get("retry_count", 0) + 1
    log("warning", f"Servicio '{service}' sigue caido. Intento {retry}.")
    return {
        "current_step": "verify",
        "current_error": state.get("current_error"),
        "retry_count": retry
    }

Health Check Consistency

Verification uses the same check logic as monitoring:
code, out, err = ssh.execute_command(service_cfg["check_command"])

if service_cfg["running_indicator"] in out:
    # Service is running
This ensures that a service passing verification will also pass future monitoring checks.
The running_indicator is a string that must appear in the command output for the service to be considered healthy (e.g., “active (running)”).

Success Response

When verification succeeds, the error is cleared:
log("verify", f"Servicio '{service}' RECUPERADO.")
return {"current_step": "verify", "current_error": None}

current_error: None

Signals to the workflow that remediation succeeded

Workflow Termination

With no error, the agent returns to monitoring mode

Failure Response

When verification fails, the retry counter is incremented:
retry = state.get("retry_count", 0) + 1
log("warning", f"Servicio '{service}' sigue caido. Intento {retry}.")

return {
    "current_step": "verify",
    "current_error": state.get("current_error"),
    "retry_count": retry
}
The persistent current_error triggers another remediation cycle, with the diagnosis and planning nodes now aware of the increased retry count.

Retry Logic

The retry counter influences subsequent remediation:

Diagnosis

The diagnosis node receives retry count and adjusts its analysis accordingly

Planning

The planner includes retry count in its prompt, enabling progressive problem-solving strategies
Example from planning node:
HumanMessage(content=(
    f"Error: {error}\n"
    f"Intento: {retry_count + 1}\n"
    f"Diagnostico: {diagnosis}"
))

Missing Service Configuration

If the service isn’t in the configuration:
if not service_cfg:
    log("error", f"Servicio '{service}' no encontrado en configuracion.")
    return {
        "current_step": "verify",
        "current_error": f"Servicio '{service}' sin configuracion de verificacion.",
        "retry_count": state.get("retry_count", 0) + 1
    }
This creates an error state that will likely escalate, as the agent cannot verify a service it doesn’t know how to check.

Exception Handling

Verification handles SSH and execution errors:
except Exception as e:
    log("error", f"Error en verificacion: {e}")
    return {
        "current_step": "verify",
        "current_error": str(e),
        "retry_count": state.get("retry_count", 0) + 1
    }
Exceptions during verification are treated as verification failures, triggering another remediation attempt rather than halting the workflow.

State Fields

Verification updates these state fields:

current_error

None if recovered, preserved if still failed

retry_count

Incremented on each failed verification

current_step

Set to “verify” to mark workflow position

Workflow Integration

The verification result determines the next node:
# In workflow graph definition
if state["current_error"] is None:
    # Return to monitoring
    return "monitor"
else:
    # Retry remediation
    return "diagnose"

Progressive Remediation

With each failed verification:
  1. Retry count increases → Diagnosis and planning adapt strategies
  2. Memory accumulates → Failed approaches are excluded
  3. Context grows → LLM has more information about what didn’t work
This progressive approach increases the likelihood of success with each iteration, as the system learns from its mistakes in real-time.

Max Retry Handling

While not implemented in this node, workflow orchestration typically includes max retry limits:
# Typically in workflow conditional logic
if state["retry_count"] >= config.MAX_RETRIES:
    return "escalate"  # Human intervention required

Service Configuration Example

Services must define health check commands:
SERVICES = {
    "nginx": {
        "check_command": "sudo service nginx status",
        "running_indicator": "is running",
        "type": "service"
    },
    "postgresql": {
        "check_command": "sudo service postgresql status",
        "running_indicator": "active (running)",
        "type": "database"
    }
}
The running_indicator must be a reliable substring that only appears when the service is healthy.

Logging and Observability

Detailed logging tracks verification outcomes:
log("verify", "Comprobando si el servicio se recupero...")
log("verify", f"Servicio '{service}' RECUPERADO.")
log("warning", f"Servicio '{service}' sigue caido. Intento {retry}.")
log("error", f"Error en verificacion: {e}")

Implementation Location

Source: src/agent/nodes/verify.py:17

Workflow Loop

Verification completes the autonomous remediation cycle:
  1. Monitor detects failure
  2. Diagnose analyzes the problem
  3. Plan generates remediation commands
  4. Approve validates security
  5. Execute runs the commands
  6. Verify confirms recovery
If verification fails, the loop repeats from diagnosis with accumulated context and memory.
This closed-loop architecture enables true autonomous operation, with the system continuously learning and adapting until the issue is resolved.

Build docs developers (and LLMs) love