Overview
agent-desktop is designed around a deterministic observation-action loop that AI agents execute to control desktop applications. The workflow separates observation (reading UI state) from action (modifying UI state), enabling agents to make informed decisions at each step.The Three-Phase Cycle
1. Snapshot (Observe)
Capture the current accessibility tree and generate refs for all interactive elements:Example snapshot output
Example snapshot output
@e1, @e2, etc.) that can be used to target actions. Static elements (labels, groups) appear for context but receive no ref.
2. Decide (Agent Logic)
The AI agent analyzes the snapshot JSON and determines the next action based on:- Current UI state (element roles, names, values, states)
- Task objective (e.g., “Open quarterly report document”)
- Available refs for interactive elements
Decision logic lives outside agent-desktop. The tool provides structured data; agents interpret it using LLMs or rule-based systems.
3. Act (Execute)
Perform the chosen action using a ref from the previous snapshot:Why Refs Are Critical
Determinism
Refs provide stable identifiers within a snapshot session:Performance
Refs avoid expensive tree traversals. Action commands use optimistic re-identification to locate elements instantly:- Load ref metadata from
~/.agent-desktop/last_refmap.json - Match element by
(pid, role, name, bounds_hash) - Return
STALE_REFerror if element changed
Handling UI Changes
Stale Refs
Refs become stale when the UI changes after the snapshot:- New window opened
- Dialog appeared/dismissed
- Element moved or removed
- Application restarted
Example Recovery Flow
Optimization: Interactive-Only Mode
Use the-i / --interactive-only flag to reduce snapshot size:
-i: Tree includes all elements (labels, groups, separators, containers) → 5000+ nodesWith
-i: Tree includes only interactive elements with refs → 200-300 nodes
This dramatically reduces:
- Snapshot execution time (300ms → 50ms for complex apps)
- JSON payload size (500KB → 50KB)
- LLM token consumption (fewer nodes to analyze)
-i unless you need full structural context.
Batch Execution
For multi-step sequences, use thebatch command to execute actions atomically:
- Executes commands sequentially
- Stops on first error if
--stop-on-erroris set - Returns array of results
- Does not automatically re-snapshot between actions
Batch is useful for deterministic sequences where you know refs won’t change. For dynamic scenarios, prefer separate commands with snapshot validation between steps.
Complete Workflow Example
Key Takeaways
Snapshots create fresh refs
Snapshots create fresh refs
Every
snapshot command replaces the refmap entirely. Old refs become invalid.Refs are session-scoped
Refs are session-scoped
Refs are valid until the next snapshot. Don’t cache them across observations.
Always handle STALE_REF
Always handle STALE_REF
The correct recovery is: snapshot → retry. Never skip re-observation.
Use -i for efficiency
Use -i for efficiency
Interactive-only mode is the default recommendation for agent workflows.
Next Steps
Ref System
Deep dive into ref allocation and resolution
Error Handling
Learn recovery patterns for all error codes