The verification node validates whether remediation was successful by re-checking the health of the affected service. It determines if the issue is resolved or if additional attempts are needed.
Overview
After executing remediation commands, verification ensures the service has actually recovered. This closes the feedback loop and determines the next action: declare success, retry remediation, or escalate to human operators.
Verification uses the same health check commands as the monitoring node, ensuring consistency in success criteria.
Verification Workflow
Service Configuration Lookup
Retrieves health check configuration for the affected service: service = state.get( "affected_service" , "" )
service_cfg = config. SERVICES .get(service, {})
if not service_cfg:
log( "error" , f "Servicio ' { service } ' no encontrado en configuracion." )
return {
"current_step" : "verify" ,
"current_error" : f "Servicio ' { service } ' sin configuracion de verificacion." ,
"retry_count" : state.get( "retry_count" , 0 ) + 1
}
SSH Connection
Establishes connection to run the health check: ssh = SSHClient(
hostname = config. SSH_HOST ,
port = config. SSH_PORT ,
username = config. SSH_USER ,
password = config. SSH_PASS
)
Health Check Execution
Runs the service-specific check command: code, out, err = ssh.execute_command(service_cfg[ "check_command" ])
ssh.close()
Status Evaluation
Checks if the service is now running: if service_cfg[ "running_indicator" ] in out:
log( "verify" , f "Servicio ' { service } ' RECUPERADO." )
return { "current_step" : "verify" , "current_error" : None }
Retry Decision
If still failed, increments retry counter: else :
retry = state.get( "retry_count" , 0 ) + 1
log( "warning" , f "Servicio ' { service } ' sigue caido. Intento { retry } ." )
return {
"current_step" : "verify" ,
"current_error" : state.get( "current_error" ),
"retry_count" : retry
}
Health Check Consistency
Verification uses the same check logic as monitoring:
code, out, err = ssh.execute_command(service_cfg[ "check_command" ])
if service_cfg[ "running_indicator" ] in out:
# Service is running
This ensures that a service passing verification will also pass future monitoring checks.
The running_indicator is a string that must appear in the command output for the service to be considered healthy (e.g., “active (running)”).
Success Response
When verification succeeds, the error is cleared:
log( "verify" , f "Servicio ' { service } ' RECUPERADO." )
return { "current_step" : "verify" , "current_error" : None }
current_error: None Signals to the workflow that remediation succeeded
Workflow Termination With no error, the agent returns to monitoring mode
Failure Response
When verification fails, the retry counter is incremented:
retry = state.get( "retry_count" , 0 ) + 1
log( "warning" , f "Servicio ' { service } ' sigue caido. Intento { retry } ." )
return {
"current_step" : "verify" ,
"current_error" : state.get( "current_error" ),
"retry_count" : retry
}
The persistent current_error triggers another remediation cycle, with the diagnosis and planning nodes now aware of the increased retry count.
Retry Logic
The retry counter influences subsequent remediation:
Diagnosis The diagnosis node receives retry count and adjusts its analysis accordingly
Planning The planner includes retry count in its prompt, enabling progressive problem-solving strategies
Example from planning node:
HumanMessage( content = (
f "Error: { error } \n "
f "Intento: { retry_count + 1 } \n "
f "Diagnostico: { diagnosis } "
))
Missing Service Configuration
If the service isn’t in the configuration:
if not service_cfg:
log( "error" , f "Servicio ' { service } ' no encontrado en configuracion." )
return {
"current_step" : "verify" ,
"current_error" : f "Servicio ' { service } ' sin configuracion de verificacion." ,
"retry_count" : state.get( "retry_count" , 0 ) + 1
}
This creates an error state that will likely escalate, as the agent cannot verify a service it doesn’t know how to check.
Exception Handling
Verification handles SSH and execution errors:
except Exception as e:
log( "error" , f "Error en verificacion: { e } " )
return {
"current_step" : "verify" ,
"current_error" : str (e),
"retry_count" : state.get( "retry_count" , 0 ) + 1
}
Exceptions during verification are treated as verification failures, triggering another remediation attempt rather than halting the workflow.
State Fields
Verification updates these state fields:
current_error None if recovered, preserved if still failed
retry_count Incremented on each failed verification
current_step Set to “verify” to mark workflow position
Workflow Integration
The verification result determines the next node:
# In workflow graph definition
if state[ "current_error" ] is None :
# Return to monitoring
return "monitor"
else :
# Retry remediation
return "diagnose"
With each failed verification:
Retry count increases → Diagnosis and planning adapt strategies
Memory accumulates → Failed approaches are excluded
Context grows → LLM has more information about what didn’t work
This progressive approach increases the likelihood of success with each iteration, as the system learns from its mistakes in real-time.
Max Retry Handling
While not implemented in this node, workflow orchestration typically includes max retry limits:
# Typically in workflow conditional logic
if state[ "retry_count" ] >= config. MAX_RETRIES :
return "escalate" # Human intervention required
Service Configuration Example
Services must define health check commands:
SERVICES = {
"nginx" : {
"check_command" : "sudo service nginx status" ,
"running_indicator" : "is running" ,
"type" : "service"
},
"postgresql" : {
"check_command" : "sudo service postgresql status" ,
"running_indicator" : "active (running)" ,
"type" : "database"
}
}
The running_indicator must be a reliable substring that only appears when the service is healthy.
Logging and Observability
Detailed logging tracks verification outcomes:
log( "verify" , "Comprobando si el servicio se recupero..." )
log( "verify" , f "Servicio ' { service } ' RECUPERADO." )
log( "warning" , f "Servicio ' { service } ' sigue caido. Intento { retry } ." )
log( "error" , f "Error en verificacion: { e } " )
Implementation Location
Source: src/agent/nodes/verify.py:17
Workflow Loop
Verification completes the autonomous remediation cycle:
Monitor detects failure
Diagnose analyzes the problem
Plan generates remediation commands
Approve validates security
Execute runs the commands
Verify confirms recovery
If verification fails, the loop repeats from diagnosis with accumulated context and memory.
This closed-loop architecture enables true autonomous operation, with the system continuously learning and adapting until the issue is resolved.