Agent Workflow

Overview

agent-desktop is designed around a deterministic observation-action loop that AI agents execute to control desktop applications. The workflow separates observation (reading UI state) from action (modifying UI state), enabling agents to make informed decisions at each step.

Agent loop:  snapshot → decide → act → snapshot → decide → act → ...

This architecture ensures agents always work with fresh UI state and can recover from failures by re-observing before retrying.

The Three-Phase Cycle

1. Snapshot (Observe)

Capture the current accessibility tree and generate refs for all interactive elements:

agent-desktop snapshot --app Finder -i

Example snapshot output

{
  "version": "1.0",
  "ok": true,
  "command": "snapshot",
  "data": {
    "app": "Finder",
    "window": {
      "id": "w-4521",
      "title": "Documents"
    },
    "ref_count": 14,
    "tree": {
      "role": "window",
      "name": "Documents",
      "children": [
        {
          "ref_id": "@e1",
          "role": "button",
          "name": "New Folder",
          "states": ["enabled"]
        },
        {
          "ref_id": "@e2",
          "role": "textfield",
          "name": "Search",
          "value": ""
        }
      ]
    }
  }
}

Each interactive element receives a ref (@e1, @e2, etc.) that can be used to target actions. Static elements (labels, groups) appear for context but receive no ref.

2. Decide (Agent Logic)

The AI agent analyzes the snapshot JSON and determines the next action based on:

Current UI state (element roles, names, values, states)
Task objective (e.g., “Open quarterly report document”)
Available refs for interactive elements

Decision logic lives outside agent-desktop. The tool provides structured data; agents interpret it using LLMs or rule-based systems.

3. Act (Execute)

Perform the chosen action using a ref from the previous snapshot:

agent-desktop click @e3
agent-desktop type @e5 "quarterly report"
agent-desktop press cmd+s

Actions are AX-first: agent-desktop exhausts accessibility API strategies before falling back to mouse/keyboard synthesis. This makes automation more reliable and compatible with assistive technologies.

Why Refs Are Critical

Determinism

Refs provide stable identifiers within a snapshot session:

# Snapshot assigns refs
agent-desktop snapshot --app TextEdit -i
# → @e1 = "Save" button, @e2 = text area

# Agent can reliably reference elements
agent-desktop type @e2 "Hello world"
agent-desktop click @e1

Without refs, agents would need to re-query the accessibility tree for every action using fragile selectors (“find button named ‘Save’”), which breaks when UI changes mid-session.

Performance

Refs avoid expensive tree traversals. Action commands use optimistic re-identification to locate elements instantly:

Load ref metadata from ~/.agent-desktop/last_refmap.json
Match element by (pid, role, name, bounds_hash)
Return STALE_REF error if element changed

This is 10-100x faster than searching the entire tree.

Handling UI Changes

Stale Refs

Refs become stale when the UI changes after the snapshot:

New window opened
Dialog appeared/dismissed
Element moved or removed
Application restarted

When an action encounters a stale ref:

{
  "version": "1.0",
  "ok": false,
  "command": "click",
  "error": {
    "code": "STALE_REF",
    "message": "@e7 not found in current RefMap",
    "suggestion": "Run 'snapshot' to refresh, then retry with updated ref"
  }
}

The recovery pattern is always the same:

snapshot → act → STALE_REF? → snapshot again → retry

Example Recovery Flow

# Initial snapshot
agent-desktop snapshot --app Safari -i
# → @e5 = address bar

# Try to type URL
agent-desktop type @e5 "github.com"
# → STALE_REF (user clicked something, UI changed)

# Re-snapshot to get fresh refs
agent-desktop snapshot --app Safari -i
# → @e3 = address bar (ref ID changed because tree structure changed)

# Retry with new ref
agent-desktop type @e3 "github.com"
# → Success

Agents must never cache or reuse refs across snapshots. Each snapshot replaces the refmap entirely.

Optimization: Interactive-Only Mode

Use the -i / --interactive-only flag to reduce snapshot size:

agent-desktop snapshot --app Xcode -i

Without -i: Tree includes all elements (labels, groups, separators, containers) → 5000+ nodes
With -i: Tree includes only interactive elements with refs → 200-300 nodes This dramatically reduces:

Snapshot execution time (300ms → 50ms for complex apps)
JSON payload size (500KB → 50KB)
LLM token consumption (fewer nodes to analyze)

Always use -i unless you need full structural context.

Batch Execution

For multi-step sequences, use the batch command to execute actions atomically:

agent-desktop batch '[
  {"command": "click", "args": {"ref_id": "@e2"}},
  {"command": "type", "args": {"ref_id": "@e5", "text": "hello"}},
  {"command": "press", "args": {"combo": "return"}}
]' --stop-on-error

Batch mode:

Executes commands sequentially
Stops on first error if --stop-on-error is set
Returns array of results
Does not automatically re-snapshot between actions

Batch is useful for deterministic sequences where you know refs won’t change. For dynamic scenarios, prefer separate commands with snapshot validation between steps.

Complete Workflow Example

# 1. Observe: Get current UI state
agent-desktop snapshot --app Finder -i
# Analyze JSON → decide to create new folder

# 2. Act: Click "New Folder" button
agent-desktop click @e8

# 3. Observe: Re-snapshot to see new state (folder name input appeared)
agent-desktop snapshot --app Finder -i

# 4. Act: Type folder name
agent-desktop type @e12 "Project Files"

# 5. Act: Confirm
agent-desktop press return

# 6. Observe: Verify folder created
agent-desktop snapshot --app Finder -i
# Check tree for "Project Files" folder element

Key Takeaways

Snapshots create fresh refs

Every snapshot command replaces the refmap entirely. Old refs become invalid.

Refs are session-scoped

Refs are valid until the next snapshot. Don’t cache them across observations.

Always handle STALE_REF

The correct recovery is: snapshot → retry. Never skip re-observation.

Use -i for efficiency

Interactive-only mode is the default recommendation for agent workflows.

Next Steps

Ref System

Deep dive into ref allocation and resolution

Error Handling

Learn recovery patterns for all error codes

Get Started

Core Concepts

Command Categories

Guides

Advanced

Overview

The Three-Phase Cycle

1. Snapshot (Observe)

2. Decide (Agent Logic)

3. Act (Execute)

Why Refs Are Critical

Determinism

Performance

Handling UI Changes

Stale Refs

Example Recovery Flow

Optimization: Interactive-Only Mode

Batch Execution

Complete Workflow Example

Key Takeaways

Next Steps

Ref System

Error Handling

Build docs developers (and LLMs) love

Get Started

Core Concepts

Command Categories

Guides

Advanced

​Overview

​The Three-Phase Cycle

​1. Snapshot (Observe)

​2. Decide (Agent Logic)

​3. Act (Execute)

​Why Refs Are Critical

​Determinism

​Performance

​Handling UI Changes

​Stale Refs

​Example Recovery Flow

​Optimization: Interactive-Only Mode

​Batch Execution

​Complete Workflow Example

​Key Takeaways

​Next Steps

Ref System

Error Handling

Build docs developers (and LLMs) love

Overview

The Three-Phase Cycle

1. Snapshot (Observe)

2. Decide (Agent Logic)

3. Act (Execute)

Why Refs Are Critical

Determinism

Performance

Handling UI Changes

Stale Refs

Example Recovery Flow

Optimization: Interactive-Only Mode

Batch Execution

Complete Workflow Example

Key Takeaways

Next Steps