Skip to main content
Stagehand uses advanced caching strategies to reduce latency and token costs. This includes prompt caching for repeated content and conversation history compression for long-running agents.

Overview

Caching strategies in Stagehand:
  • Prompt caching - Cache system prompts and static content
  • Image compression - Reduce token usage in conversation history
  • Conversation management - Maintain context while minimizing tokens
  • Provider-specific optimizations - Leverage native caching features

Prompt Caching

Anthropic Prompt Caching

Anthropic supports caching with cache control blocks. Stagehand automatically uses this for system prompts and accessibility trees. How it works:
// System prompt with caching
const messages = [
  {
    role: "system",
    content: [
      {
        type: "text",
        text: systemPrompt,
        cache_control: { type: "ephemeral" }, // Cache this content
      },
    ],
  },
  // User messages...
];
Benefits:
  • System prompts are cached across requests
  • Reduces input token costs by ~90% for cached content
  • Cache persists for 5 minutes of inactivity
  • Particularly effective for accessibility trees
Accessibility Tree Caching: Location: Various act/extract implementations
const ariaTree = await page.getAriaTree();

const messages = [
  {
    role: "user",
    content: [
      {
        type: "text",
        text: `Accessibility tree:\n${ariaTree}`,
        cache_control: { type: "ephemeral" }, // Cache the tree
      },
      {
        type: "text",
        text: instruction,
      },
    ],
  },
];
Token Savings Example:
// First request: 5000 input tokens
// Subsequent requests with cache: 500 input tokens (90% reduction)
// Cache hit charges: ~10% of uncached cost

OpenAI Prompt Caching

OpenAI does not currently support explicit prompt caching, but Stagehand optimizes requests by:
  • Reusing system prompts across calls
  • Minimizing message history
  • Structuring requests for potential future caching support

Google Prompt Caching

Google’s caching is handled automatically by the model. Stagehand optimizes by:
  • Structuring system instructions consistently
  • Reusing conversation history format
  • Minimizing changes to cached content

Image Compression

Anthropic Image Compression

Location: packages/core/lib/v3/agent/utils/imageCompression.ts Strategy:
  • Keep first 2 images in conversation at full quality
  • Compress all subsequent images to 25% quality
  • Reduces token usage while maintaining context
Implementation:
export function compressConversationImages(
  items: ResponseInputItem[],
  keepFirstN = 2,
): void {
  let imageCount = 0;

  for (const item of items) {
    if ("role" in item && item.role === "user") {
      const content = item.content;
      if (Array.isArray(content)) {
        for (const block of content) {
          if (block.type === "image") {
            imageCount++;
            if (imageCount > keepFirstN) {
              // Compress this image
              const base64Data = block.source.data;
              const buffer = Buffer.from(base64Data, "base64");
              const compressed = await sharp(buffer)
                .jpeg({ quality: 25 })
                .toBuffer();
              block.source.data = compressed.toString("base64");
            }
          }
        }
      }
    }
  }
}
Usage in CUA:
// In AnthropicCUAClient.ts
const nextInputItems: ResponseInputItem[] = [...inputItems];

// Compress images before adding new message
compressConversationImages(nextInputItems);

nextInputItems.push(assistantMessage);
nextInputItems.push(userToolResultsMessage);
Token Savings:
// Full quality image: ~1500 tokens
// 25% quality image: ~400 tokens
// Savings: ~73% per compressed image

Google Image Compression

Location: packages/core/lib/v3/agent/utils/imageCompression.ts Implementation:
export function compressGoogleConversationImages(
  items: Content[],
  keepFirstN = 2,
): { items: Content[]; compressed: boolean } {
  let imageCount = 0;
  let compressed = false;

  for (const item of items) {
    if (item.role === "user" && item.parts) {
      for (const part of item.parts) {
        if (part.inlineData?.mimeType === "image/png") {
          imageCount++;
          if (imageCount > keepFirstN) {
            // Compress to JPEG 25%
            const buffer = Buffer.from(part.inlineData.data, "base64");
            const compressedBuffer = await sharp(buffer)
              .jpeg({ quality: 25 })
              .toBuffer();
            part.inlineData.data = compressedBuffer.toString("base64");
            part.inlineData.mimeType = "image/jpeg";
            compressed = true;
          }
        }
      }
    }
  }

  return { items, compressed };
}
Usage:
// In GoogleCUAClient.ts:executeStep()
const compressedResult = compressGoogleConversationImages(
  this.history,
  2, // Keep first 2 images
);
const compressedHistory = compressedResult.items;

const response = await this.client.models.generateContent({
  model: this.modelName,
  contents: compressedHistory,
  config: this.generateContentConfig,
});

Conversation History Management

CUA Conversation History

All CUA clients maintain conversation history to preserve context: Anthropic Pattern:
private async executeStep(
  inputItems: ResponseInputItem[],
  logger: (message: LogLine) => void,
): Promise<{ /* ... */ }> {
  // Get model response
  const result = await this.getAction(inputItems);
  
  // Build next input items
  const nextInputItems: ResponseInputItem[] = [...inputItems];
  
  // Compress images
  compressConversationImages(nextInputItems);
  
  // Add assistant message
  nextInputItems.push(assistantMessage);
  
  // Add tool results
  if (toolResults.length > 0) {
    nextInputItems.push(userToolResultsMessage);
  }
  
  return { nextInputItems, /* ... */ };
}
Google Pattern:
private history: Content[] = [];

async executeStep(logger: (message: LogLine) => void) {
  // Compress history before request
  const compressedResult = compressGoogleConversationImages(this.history, 2);
  const compressedHistory = compressedResult.items;
  
  // Get response
  const response = await this.client.models.generateContent({
    contents: compressedHistory,
    // ...
  });
  
  // Add to history
  this.history.push(sanitizedContent);
  
  if (functionResponses.length > 0) {
    this.history.push({
      role: "user",
      parts: functionResponses,
    });
  }
}
OpenAI Pattern:
private reasoningItems: Map<string, ResponseItem> = new Map();

async executeStep(
  inputItems: ResponseInputItem[],
  previousResponseId: string | undefined,
) {
  // Use previous_response_id for history
  const requestParams = {
    model: this.modelName,
    input: inputItems,
    previous_response_id: previousResponseId,
  };
  
  const response = await this.client.responses.create(requestParams);
  
  // Track reasoning items
  for (const item of response.output) {
    if (item.type === "reasoning") {
      this.reasoningItems.set(item.id, item);
    }
  }
  
  return { responseId: response.id };
}

History Truncation Strategies

Keep recent messages:
function truncateHistory(
  history: ResponseInputItem[],
  maxMessages = 10,
): ResponseInputItem[] {
  // Always keep system message
  const systemMessages = history.filter((m) => m.role === "system");
  const otherMessages = history.filter((m) => m.role !== "system");
  
  // Keep last N messages
  const recentMessages = otherMessages.slice(-maxMessages);
  
  return [...systemMessages, ...recentMessages];
}
Token-based truncation:
function truncateByTokens(
  history: ResponseInputItem[],
  maxTokens = 100000,
): ResponseInputItem[] {
  const systemMessages = history.filter((m) => m.role === "system");
  const otherMessages = history.filter((m) => m.role !== "system").reverse();
  
  let tokenCount = estimateTokens(systemMessages);
  const keptMessages: ResponseInputItem[] = [];
  
  for (const message of otherMessages) {
    const messageTokens = estimateTokens([message]);
    if (tokenCount + messageTokens > maxTokens) break;
    
    keptMessages.unshift(message);
    tokenCount += messageTokens;
  }
  
  return [...systemMessages, ...keptMessages];
}

Provider-Specific Optimizations

Anthropic Cache Control

// Mark content for caching
const messages = [
  {
    role: "system",
    content: [
      {
        type: "text",
        text: longSystemPrompt,
        cache_control: { type: "ephemeral" },
      },
    ],
  },
];

// First request: Full token count
// Subsequent requests: Cache hit (10% cost)

Google Content Reuse

// Structure content consistently for better caching
this.generateContentConfig = {
  temperature: 1,
  topP: 0.95,
  topK: 40,
  maxOutputTokens: 8192,
  tools: [{
    computerUse: { environment: this.environment },
  }],
};

// Reuse config across requests
const response = await this.client.models.generateContent({
  model: this.modelName,
  contents: compressedHistory,
  config: this.generateContentConfig, // Consistent config
});

OpenAI Response Chaining

// Use previous_response_id to chain requests
let previousResponseId: string | undefined;

for (let step = 0; step < maxSteps; step++) {
  const response = await this.client.responses.create({
    model: this.modelName,
    input: inputItems,
    previous_response_id: previousResponseId, // Link to previous
  });
  
  previousResponseId = response.id;
}

Performance Monitoring

Track Token Usage

let totalInputTokens = 0;
let totalOutputTokens = 0;
let totalCachedTokens = 0;

while (!completed && currentStep < maxSteps) {
  const result = await this.executeStep(inputItems, logger);
  
  totalInputTokens += result.usage.input_tokens;
  totalOutputTokens += result.usage.output_tokens;
  
  if (result.usage.cached_input_tokens) {
    totalCachedTokens += result.usage.cached_input_tokens;
  }
  
  currentStep++;
}

console.log("Token usage:", {
  input: totalInputTokens,
  output: totalOutputTokens,
  cached: totalCachedTokens,
  savings: `${((totalCachedTokens / totalInputTokens) * 100).toFixed(1)}%`,
});

Log Compression Results

const before = estimateSize(inputItems);
compressConversationImages(inputItems);
const after = estimateSize(inputItems);

logger({
  category: "caching",
  message: `Compressed images: ${before}KB → ${after}KB (${((1 - after / before) * 100).toFixed(1)}% reduction)`,
  level: 2,
});

Best Practices

  1. Use prompt caching: Mark static content with cache_control
  2. Compress images: Keep first 2 at full quality, compress rest
  3. Truncate history: Don’t let conversation grow unbounded
  4. Monitor token usage: Track input/output/cached tokens
  5. Structure consistently: Consistent structure improves caching
  6. Batch operations: Fewer requests = better cache utilization
  7. Use appropriate models: Faster models for cached content

Cost Optimization

Example savings with caching:
// Without caching:
// 10 requests × 5000 input tokens = 50,000 tokens
// Cost: $0.15 (at $3/1M tokens)

// With prompt caching (4000 tokens cached):
// Request 1: 5000 input tokens = $0.015
// Requests 2-10: 1000 new + 400 cached = 1400 tokens each
// Cost: $0.015 + (9 × $0.0042) = $0.053
// Savings: 65%
With image compression:
// Full quality: 10 images × 1500 tokens = 15,000 tokens
// Compressed: 2 full + 8 compressed (400 tokens) = 6,200 tokens
// Savings: 59%

References

  • Image Compression: packages/core/lib/v3/agent/utils/imageCompression.ts
  • Anthropic CUA: packages/core/lib/v3/agent/AnthropicCUAClient.ts:351
  • Google CUA: packages/core/lib/v3/agent/GoogleCUAClient.ts:357
  • OpenAI CUA: packages/core/lib/v3/agent/OpenAICUAClient.ts:420

Build docs developers (and LLMs) love