Skip to main content
Stagehand combines AI intelligence with direct browser control to create reliable automation that adapts to real-world web applications. Here’s how it works under the hood.

Architecture Overview

Stagehand’s V3 architecture orchestrates multiple components that work together:

V3 Core

The main orchestrator that manages browser lifecycle, handles method routing, and coordinates between all components.

Handlers

Specialized classes (ActHandler, ExtractHandler, ObserveHandler) that translate user instructions into browser actions.

Context & Pages

Manages CDP connections, frame trees, and page lifecycle across both local Chrome and Browserbase.

Cache System

Self-healing cache that replays successful actions without LLM calls.

The AI + Code Pipeline

When you call a Stagehand method, here’s what happens:

1. Instruction Processing

You provide a natural language instruction:
await stagehand.act("click the submit button");
Instructions are processed alongside optional custom system prompts, allowing you to guide the AI’s behavior.

2. DOM Snapshot Capture

Stagehand captures a hybrid accessibility tree that combines:
  • Semantic structure from the accessibility tree
  • Interactive elements from the DOM
  • Shadow DOM piercing to access elements inside web components
v3.ts:155-158
const { combinedTree, combinedXpathMap } = await captureHybridSnapshot(
  page,
  { experimental: true },
);
This creates a compact, LLM-friendly representation of the page that focuses on actionable elements rather than overwhelming the model with every DOM node.

3. LLM Inference

The instruction and DOM snapshot are sent to the LLM with a carefully crafted prompt:
prompt.ts:150-169
const actSystemPrompt = `
You are helping the user automate the browser by finding elements based on what action the user wants to take on the page

You will be given:
1. a user defined instruction about what action to take
2. a hierarchical accessibility tree showing the semantic structure of the page.

Return the element that matches the instruction if it exists. Otherwise, return an empty object.`;
The LLM returns structured data identifying:
  • Element selector (XPath)
  • Action method (click, type, scroll, etc.)
  • Arguments (text to type, keys to press, etc.)

4. Deterministic Execution

Once the LLM identifies the action, Stagehand executes it using deterministic browser control:
actHandler.ts:191-197
const firstResult = await this.takeDeterministicAction(
  firstAction,
  page,
  this.defaultDomSettleTimeoutMs,
  llmClient,
  ensureTimeRemaining,
  variables,
);
This separation is crucial:
AI makes decisions about what to do
Code executes actions deterministically via CDP
The AI never directly controls the browser—it identifies targets, and reliable code handles the actual interactions.

5. Self-Healing

If an element has moved or changed, Stagehand can automatically adapt:
  1. Initial attempt using the cached selector fails
  2. Re-capture the current DOM state
  3. Diff the trees to find where the element moved
  4. Update the selector and retry
  5. Update the cache with the new selector
actCache.ts:261-269
if (
  success &&
  actions.length > 0 &&
  this.haveActionsChanged(entry.actions, actions)
) {
  await this.refreshCacheEntry(context, {
    ...entry,
    actions,
  });
}

Browser Connection Modes

Stagehand supports two execution environments:
For development and testing
const stagehand = new Stagehand({
  env: "LOCAL",
  verbose: 2,
});
Stagehand launches and controls a local Chrome instance via chrome-launcher. Perfect for:
  • Local development
  • Debugging with visible browser
  • Testing on your machine
Source: v3.ts:717-873
Both modes use the same CDP protocol under the hood, so your code works identically in either environment.

CDP Connection Management

All browser control flows through a single CdpConnection managed by V3Context:
context.ts:153-172
static async create(
  wsUrl: string,
  opts?: {
    env?: "LOCAL" | "BROWSERBASE";
    apiClient?: StagehandAPIClient | null;
  },
): Promise<V3Context> {
  const conn = await CdpConnection.connect(wsUrl);
  const ctx = new V3Context(conn, opts?.env ?? "LOCAL");
  await ctx.bootstrap();
  await ctx.waitForFirstTopLevelPage(getFirstTopLevelPageTimeoutMs());
  return ctx;
}
V3Context handles target lifecycle, frame events, and OOPIF (out-of-process iframes) automatically, so you don’t have to think about it.

Handler Architecture

Each Stagehand method is backed by a specialized handler:
HandlerPurposeReturns
ActHandlerPerforms actions (click, type, etc.)ActResult with success status and executed actions
ExtractHandlerExtracts data from the pageStructured data matching your schema
ObserveHandlerFinds actionable elementsArray of Action objects
V3AgentHandlerMulti-step autonomous executionAgentResult with full execution history
Each handler:
  1. Accepts high-level instructions
  2. Captures DOM snapshots
  3. Queries the LLM
  4. Executes deterministic actions
  5. Reports metrics and results

Event Bus System

Stagehand uses an EventEmitter for internal communication:
v3.ts:155
public readonly bus: EventEmitter = new EventEmitter();
This enables:
  • Screenshot capture events during agent execution
  • Page lifecycle notifications
  • Error propagation across components
  • Plugin hooks (future feature)

Metrics & Observability

Stagehand tracks detailed metrics for every LLM call:
v3.ts:241-267
public stagehandMetrics: StagehandMetrics = {
  actPromptTokens: 0,
  actCompletionTokens: 0,
  actReasoningTokens: 0,
  actCachedInputTokens: 0,
  actInferenceTimeMs: 0,
  extractPromptTokens: 0,
  extractCompletionTokens: 0,
  // ... more metrics
};
Access them at any time:
const metrics = await stagehand.metrics;
console.log(`Total tokens used: ${metrics.totalPromptTokens + metrics.totalCompletionTokens}`);
console.log(`Cache hits saved: ${metrics.totalCachedInputTokens} tokens`);

Key Design Principles

The LLM identifies elements and plans actions, but all browser control uses deterministic CDP commands. This gives you the best of both worlds: adaptability from AI, reliability from code.
Successful actions are cached and replayed without LLM calls. The cache self-heals when pages change, providing speed and reliability.
Whether you’re running locally or on Browserbase, the API stays the same. The V3 class abstracts away all environment differences.
Every action is logged, every metric is tracked, and session recordings are available. You always know what Stagehand is doing.

Next Steps

Write Effective AI Rules

Learn how to guide the AI with clear instructions

Understand Browser Contexts

Master pages, frames, and context management

Leverage Caching

Speed up execution with self-healing cache

See Examples

Explore real-world usage patterns

Build docs developers (and LLMs) love