Agent architecture

Clanker’s agent system transforms natural language questions into coordinated AWS telemetry investigations. The architecture employs semantic analysis, decision trees, and dependency-aware parallel execution to gather cloud infrastructure data efficiently.

System overview

The agent package (internal/agent/) orchestrates intelligent context gathering through several specialized subsystems:

agent/
├── agent.go          # High-level orchestrator and public entry points
├── coordinator/      # Dependency-aware parallel execution driver
├── decisiontree/     # Intent rules that map queries to agent types
├── memory/           # Rolling knowledge of previous investigations
├── model/            # Shared structs and type aliases
└── semantic/         # Lightweight NLP classifier for intents

Investigation flow

When you run clanker ask "what lambda functions are failing?", the agent follows this execution path:

Semantic analysis

The semantic.Analyzer performs keyword-based intent classification without external NLP calls. It extracts:

Primary intent (troubleshoot, monitor, analyze)
Urgency level (critical, high, medium, low)
Target services (lambda, ecs, s3, etc.)
Time frame (recent, last_hour, last_day)
Data types (logs, metrics, status)

See internal/agent/semantic/analyzer.go:27

Decision tree traversal

The decision tree maps semantic intent to concrete agent types and execution parameters:

type Node struct {
    ID         string
    Name       string
    Condition  string        // e.g., "contains_keywords(['error', 'fail'])"
    Priority   int
    AgentTypes []string      // e.g., ["log", "metrics"]
    Parameters model.AWSData
}

Nodes are evaluated depth-first. Matching conditions spawn their configured agent types.See internal/agent/decisiontree/tree.go:31

Dependency scheduling

The DependencyScheduler groups agents by execution order:

Order 1: Independent collectors (log, metrics, k8s)
Order 2: Infrastructure agents requiring basic data (infrastructure, deployment)
Order 3: Analysis agents requiring enriched data (security, queue)
Order 4+: Higher-order insights (cost, availability)

Each agent declares required and provided data:

AgentTypeLog = AgentType{
    Name: "log",
    Dependencies: Dependency{
        ProvidedData:   []string{"logs", "error_patterns", "log_metrics"},
        ExecutionOrder: 1,
    },
}

AgentTypeSecurity = AgentType{
    Name: "security",
    Dependencies: Dependency{
        RequiredData:   []string{"logs", "service_config"},
        ProvidedData:   []string{"security_status", "access_patterns"},
        ExecutionOrder: 3,
    },
}

See internal/agent/coordinator/agent_types.go:5

Parallel execution

Within each order group, agents run concurrently:

for _, group := range planned {
    var wg sync.WaitGroup
    for _, cfg := range group.Agents {
        if scheduler.Ready(cfg.AgentType, dataBus) {
            agent := newParallelAgent(cfg)
            registry.Register(agent)
            wg.Add(1)
            go runPlannedAgent(ctx, &wg, agent)
        }
    }
    wg.Wait()
}

Each agent:

Copies the main context
Executes AWS operations (CLI calls or SDK methods)
Stores results in its local Results map
Publishes promised data to the SharedDataBus

See internal/agent/coordinator/coordinator.go:63

Result aggregation

The coordinator merges all successful agent outputs:

aggregated := make(model.AWSData)
for _, agent := range registry.Agents() {
    if agent.Status != "completed" {
        continue
    }
    agentKey := agent.Type.Name
    aggregated[agentKey] = agent.Results
    for key, value := range agent.Results {
        aggregated[fmt.Sprintf("%s_%s", agentKey, key)] = value
    }
}

Metadata includes execution counts, timestamps, and decision path.See internal/agent/coordinator/coordinator.go:138

Context building

The final context string merges:

Semantic analysis summary
All parallel agent results (grouped by agent type)
Service-specific log analysis
Error patterns and metrics
Agent reasoning chain (chain of thought)

This structured context is fed to the LLM for final answer generation.See internal/agent/agent.go:306

Core components

Agent orchestrator

The Agent type in agent.go wires everything together:

type Agent struct {
    client       *awsclient.Client
    debug        bool
    maxSteps     int
    aiDecisionFn func(context.Context, string) (string, error)
}

func (a *Agent) InvestigateQuery(ctx context.Context, query string) (*AgentContext, error)

Key responsibilities:

Run semantic analysis
Traverse decision tree via coordinator
Spawn parallel agents or fall back to sequential planner
Build final context for LLM

See internal/agent/agent.go:42

Coordinator

The Coordinator drives dependency-tree-based parallel execution:

type Coordinator struct {
    DecisionTree *dt.Tree
    MainContext  *model.AgentContext
    client       *awsclient.Client
    registry     *AgentRegistry
    dataBus      *SharedDataBus
    scheduler    *DependencyScheduler
}

Public methods:

Analyze(query string) — traverse decision tree
SpawnAgents(ctx, applicable) — launch agents by dependency order
WaitForCompletion(ctx, timeout) — block until all agents finish
AggregateResults() — merge successful outputs
Stats() — execution metrics

See internal/agent/coordinator/coordinator.go:34

Shared data bus

The SharedDataBus stores dependency data produced by agents:

type SharedDataBus struct {
    mu   sync.RWMutex
    data map[string]any
}

func (b *SharedDataBus) Store(key string, value any)
func (b *SharedDataBus) Load(key string) (any, bool)
func (b *SharedDataBus) HasAll(keys []string) bool

Agents publish data using keys from ProvidedData. Downstream agents check RequiredData before executing. See internal/agent/coordinator/state.go:10

Agent registry

The AgentRegistry tracks running agents and maintains counters:

type AgentStats struct {
    Total     int
    Completed int
    Failed    int
}

type AgentRegistry struct {
    mu     sync.RWMutex
    agents []*ParallelAgent
    stats  AgentStats
}

Thread-safe methods:

Register(agent) — add agent and increment total
MarkCompleted() / MarkFailed() — update counters
Agents() — snapshot of all agents
Stats() — execution summary

See internal/agent/coordinator/state.go:58

Agent types

Clanker includes these built-in specialist agents:

Log agent

Execution order: 1 (independent)Provides: logs, error_patterns, log_metricsOperations:

Discover relevant log groups
Sample recent log entries
Filter error patterns
Extract log stream metadata

See internal/agent/coordinator/agent_types.go:28

Metrics agent

Execution order: 1 (independent)Provides: metrics, performance_data, thresholdsOperations:

Query CloudWatch metrics
Check alarm states
Aggregate performance data

See internal/agent/coordinator/agent_types.go:36

Infrastructure agent

Execution order: 2Provides: service_config, deployment_status, resource_healthOperations:

List EC2, ECS, Lambda resources
Describe service configurations
Check deployment status

See internal/agent/coordinator/agent_types.go:44

Security agent

Execution order: 3Requires: logs, service_configProvides: security_status, access_patterns, vulnerabilitiesOperations:

Analyze IAM policies
Check security group rules
Audit access logs

See internal/agent/coordinator/agent_types.go:52

Cost agent

Execution order: 4Requires: metrics, resource_healthProvides: cost_analysis, usage_patterns, optimization_suggestionsOperations:

Query Cost Explorer
Analyze resource utilization
Generate optimization recommendations

See internal/agent/coordinator/agent_types.go:61

K8s agent

Execution order: 1 (independent)Provides: k8s_resources, k8s_healthOperations:

List pods, deployments, services
Check resource status
Gather cluster metrics

See internal/agent/coordinator/agent_types.go:20

Sequential fallback

When the decision tree returns no applicable nodes, the agent falls back to a traditional sequential approach:

if len(applicableNodes) == 0 {
    a.runSequentialPlanner(ctx, agentCtx)
}

The sequential planner:

Calls an LLM decision function to determine the next action
Executes the chosen action (gather logs, metrics, etc.)
Repeats until complete or maxSteps is reached

This provides a safety net for queries that don’t match decision tree patterns. See internal/agent/planner.go:13

Extending the system

Add a new agent type

Define the agent in internal/agent/coordinator/agent_types.go:

AgentTypeMonitoring = AgentType{
    Name: "monitoring",
    Dependencies: Dependency{
        RequiredData:   []string{"metrics"},
        ProvidedData:   []string{"alerts", "dashboards"},
        ExecutionOrder: 3,
    },
}

Add operations in internal/agent/coordinator/operations.go:

func (c *Coordinator) getOperationsForAgentType(agt AgentType) []awsclient.LLMOperation {
    switch agt.Name {
    case "monitoring":
        return []awsclient.LLMOperation{
            {Operation: "list_cloudwatch_alarms", Parameters: map[string]any{}},
        }
    }
}

Update decision tree

Add decision tree nodes that spawn your agent:

&Node{
    ID:         "monitoring-check",
    Name:       "Monitoring Analysis",
    Condition:  "contains_keywords(['alert', 'alarm', 'dashboard'])",
    Priority:   8,
    AgentTypes: []string{"monitoring"},
}

Keep shared structs in internal/agent/model/ to avoid circular imports. Run gofmt after edits and ensure go build ./... stays green.

Performance considerations

Parallelism

Agents in the same execution order run concurrently, reducing total investigation time. Use --agent-trace to see lifecycle logs.

Timeouts

Each agent type has a WaitTimeout (typically 5-8 seconds). The coordinator waits up to 15 seconds for all agents to complete.

Dependency checks

Agents only execute when their dependencies are satisfied on the data bus. This prevents wasted work and ensures data consistency.

Graceful degradation

Failed agents don’t block the pipeline. The coordinator aggregates whatever data is available and proceeds with partial results.

Debugging agent execution

Enable detailed agent tracing:

clanker ask "show lambda errors" --agent-trace

This outputs:

Decision tree matches and priorities
Execution order groups
Agent start/completion events
Dependency satisfaction checks
Final aggregation stats

Alternatively, set in config:

.clanker.yaml

agent:
  trace: true

See internal/agent/coordinator/coordinator.go:17 for the trace flag check.

Debugging

Debug flags and trace output

Backend API

Credential storage and multi-machine sync

Custom profiles

AI provider configuration

Ask command

Natural language queries

Get Started

Core Concepts

AWS

Kubernetes

Cloud Providers

Integrations

Advanced

Agent architecture

System overview

Investigation flow

Core components

Agent orchestrator

Coordinator

Shared data bus

Agent registry

Agent types

Sequential fallback

Extending the system

Performance considerations

Parallelism

Timeouts

Dependency checks

Graceful degradation

Debugging agent execution

Debugging

Backend API

Custom profiles

Ask command

Build docs developers (and LLMs) love

Get Started

Core Concepts

AWS

Kubernetes

Cloud Providers

Integrations

Advanced

​System overview

​Investigation flow

​Core components

​Agent orchestrator

​Coordinator

​Shared data bus

​Agent registry

​Agent types

​Sequential fallback

​Extending the system

​Performance considerations

Parallelism

Timeouts

Dependency checks

Graceful degradation

​Debugging agent execution

​Related resources

Debugging

Backend API

Custom profiles

Ask command

Build docs developers (and LLMs) love

System overview

Investigation flow

Core components

Agent orchestrator

Coordinator

Shared data bus

Agent registry

Agent types

Sequential fallback

Extending the system

Performance considerations

Debugging agent execution

Related resources