Advanced Caching Strategies

Stagehand uses advanced caching strategies to reduce latency and token costs. This includes prompt caching for repeated content and conversation history compression for long-running agents.

Overview

Caching strategies in Stagehand:

Prompt caching - Cache system prompts and static content
Image compression - Reduce token usage in conversation history
Conversation management - Maintain context while minimizing tokens
Provider-specific optimizations - Leverage native caching features

Prompt Caching

Anthropic Prompt Caching

Anthropic supports caching with cache control blocks. Stagehand automatically uses this for system prompts and accessibility trees. How it works:

// System prompt with caching
const messages = [
  {
    role: "system",
    content: [
      {
        type: "text",
        text: systemPrompt,
        cache_control: { type: "ephemeral" }, // Cache this content
      },
    ],
  },
  // User messages...
];

Benefits:

System prompts are cached across requests
Reduces input token costs by ~90% for cached content
Cache persists for 5 minutes of inactivity
Particularly effective for accessibility trees

Accessibility Tree Caching: Location: Various act/extract implementations

const ariaTree = await page.getAriaTree();

const messages = [
  {
    role: "user",
    content: [
      {
        type: "text",
        text: `Accessibility tree:\n${ariaTree}`,
        cache_control: { type: "ephemeral" }, // Cache the tree
      },
      {
        type: "text",
        text: instruction,
      },
    ],
  },
];

Token Savings Example:

// First request: 5000 input tokens
// Subsequent requests with cache: 500 input tokens (90% reduction)
// Cache hit charges: ~10% of uncached cost

OpenAI Prompt Caching

OpenAI does not currently support explicit prompt caching, but Stagehand optimizes requests by:

Reusing system prompts across calls
Minimizing message history
Structuring requests for potential future caching support

Google Prompt Caching

Google’s caching is handled automatically by the model. Stagehand optimizes by:

Structuring system instructions consistently
Reusing conversation history format
Minimizing changes to cached content

Image Compression

Anthropic Image Compression

Location: packages/core/lib/v3/agent/utils/imageCompression.ts Strategy:

Keep first 2 images in conversation at full quality
Compress all subsequent images to 25% quality
Reduces token usage while maintaining context

Implementation:

export function compressConversationImages(
  items: ResponseInputItem[],
  keepFirstN = 2,
): void {
  let imageCount = 0;

  for (const item of items) {
    if ("role" in item && item.role === "user") {
      const content = item.content;
      if (Array.isArray(content)) {
        for (const block of content) {
          if (block.type === "image") {
            imageCount++;
            if (imageCount > keepFirstN) {
              // Compress this image
              const base64Data = block.source.data;
              const buffer = Buffer.from(base64Data, "base64");
              const compressed = await sharp(buffer)
                .jpeg({ quality: 25 })
                .toBuffer();
              block.source.data = compressed.toString("base64");
            }
          }
        }
      }
    }
  }
}

Usage in CUA:

// In AnthropicCUAClient.ts
const nextInputItems: ResponseInputItem[] = [...inputItems];

// Compress images before adding new message
compressConversationImages(nextInputItems);

nextInputItems.push(assistantMessage);
nextInputItems.push(userToolResultsMessage);

Token Savings:

// Full quality image: ~1500 tokens
// 25% quality image: ~400 tokens
// Savings: ~73% per compressed image

Google Image Compression

Location: packages/core/lib/v3/agent/utils/imageCompression.ts Implementation:

export function compressGoogleConversationImages(
  items: Content[],
  keepFirstN = 2,
): { items: Content[]; compressed: boolean } {
  let imageCount = 0;
  let compressed = false;

  for (const item of items) {
    if (item.role === "user" && item.parts) {
      for (const part of item.parts) {
        if (part.inlineData?.mimeType === "image/png") {
          imageCount++;
          if (imageCount > keepFirstN) {
            // Compress to JPEG 25%
            const buffer = Buffer.from(part.inlineData.data, "base64");
            const compressedBuffer = await sharp(buffer)
              .jpeg({ quality: 25 })
              .toBuffer();
            part.inlineData.data = compressedBuffer.toString("base64");
            part.inlineData.mimeType = "image/jpeg";
            compressed = true;
          }
        }
      }
    }
  }

  return { items, compressed };
}

Usage:

// In GoogleCUAClient.ts:executeStep()
const compressedResult = compressGoogleConversationImages(
  this.history,
  2, // Keep first 2 images
);
const compressedHistory = compressedResult.items;

const response = await this.client.models.generateContent({
  model: this.modelName,
  contents: compressedHistory,
  config: this.generateContentConfig,
});

Conversation History Management

CUA Conversation History

All CUA clients maintain conversation history to preserve context: Anthropic Pattern:

private async executeStep(
  inputItems: ResponseInputItem[],
  logger: (message: LogLine) => void,
): Promise<{ /* ... */ }> {
  // Get model response
  const result = await this.getAction(inputItems);
  
  // Build next input items
  const nextInputItems: ResponseInputItem[] = [...inputItems];
  
  // Compress images
  compressConversationImages(nextInputItems);
  
  // Add assistant message
  nextInputItems.push(assistantMessage);
  
  // Add tool results
  if (toolResults.length > 0) {
    nextInputItems.push(userToolResultsMessage);
  }
  
  return { nextInputItems, /* ... */ };
}

Google Pattern:

private history: Content[] = [];

async executeStep(logger: (message: LogLine) => void) {
  // Compress history before request
  const compressedResult = compressGoogleConversationImages(this.history, 2);
  const compressedHistory = compressedResult.items;
  
  // Get response
  const response = await this.client.models.generateContent({
    contents: compressedHistory,
    // ...
  });
  
  // Add to history
  this.history.push(sanitizedContent);
  
  if (functionResponses.length > 0) {
    this.history.push({
      role: "user",
      parts: functionResponses,
    });
  }
}

OpenAI Pattern:

private reasoningItems: Map<string, ResponseItem> = new Map();

async executeStep(
  inputItems: ResponseInputItem[],
  previousResponseId: string | undefined,
) {
  // Use previous_response_id for history
  const requestParams = {
    model: this.modelName,
    input: inputItems,
    previous_response_id: previousResponseId,
  };
  
  const response = await this.client.responses.create(requestParams);
  
  // Track reasoning items
  for (const item of response.output) {
    if (item.type === "reasoning") {
      this.reasoningItems.set(item.id, item);
    }
  }
  
  return { responseId: response.id };
}

History Truncation Strategies

Keep recent messages:

function truncateHistory(
  history: ResponseInputItem[],
  maxMessages = 10,
): ResponseInputItem[] {
  // Always keep system message
  const systemMessages = history.filter((m) => m.role === "system");
  const otherMessages = history.filter((m) => m.role !== "system");
  
  // Keep last N messages
  const recentMessages = otherMessages.slice(-maxMessages);
  
  return [...systemMessages, ...recentMessages];
}

Token-based truncation:

function truncateByTokens(
  history: ResponseInputItem[],
  maxTokens = 100000,
): ResponseInputItem[] {
  const systemMessages = history.filter((m) => m.role === "system");
  const otherMessages = history.filter((m) => m.role !== "system").reverse();
  
  let tokenCount = estimateTokens(systemMessages);
  const keptMessages: ResponseInputItem[] = [];
  
  for (const message of otherMessages) {
    const messageTokens = estimateTokens([message]);
    if (tokenCount + messageTokens > maxTokens) break;
    
    keptMessages.unshift(message);
    tokenCount += messageTokens;
  }
  
  return [...systemMessages, ...keptMessages];
}

Provider-Specific Optimizations

Anthropic Cache Control

// Mark content for caching
const messages = [
  {
    role: "system",
    content: [
      {
        type: "text",
        text: longSystemPrompt,
        cache_control: { type: "ephemeral" },
      },
    ],
  },
];

// First request: Full token count
// Subsequent requests: Cache hit (10% cost)

Google Content Reuse

// Structure content consistently for better caching
this.generateContentConfig = {
  temperature: 1,
  topP: 0.95,
  topK: 40,
  maxOutputTokens: 8192,
  tools: [{
    computerUse: { environment: this.environment },
  }],
};

// Reuse config across requests
const response = await this.client.models.generateContent({
  model: this.modelName,
  contents: compressedHistory,
  config: this.generateContentConfig, // Consistent config
});

OpenAI Response Chaining

// Use previous_response_id to chain requests
let previousResponseId: string | undefined;

for (let step = 0; step < maxSteps; step++) {
  const response = await this.client.responses.create({
    model: this.modelName,
    input: inputItems,
    previous_response_id: previousResponseId, // Link to previous
  });
  
  previousResponseId = response.id;
}

Performance Monitoring

Track Token Usage

let totalInputTokens = 0;
let totalOutputTokens = 0;
let totalCachedTokens = 0;

while (!completed && currentStep < maxSteps) {
  const result = await this.executeStep(inputItems, logger);
  
  totalInputTokens += result.usage.input_tokens;
  totalOutputTokens += result.usage.output_tokens;
  
  if (result.usage.cached_input_tokens) {
    totalCachedTokens += result.usage.cached_input_tokens;
  }
  
  currentStep++;
}

console.log("Token usage:", {
  input: totalInputTokens,
  output: totalOutputTokens,
  cached: totalCachedTokens,
  savings: `${((totalCachedTokens / totalInputTokens) * 100).toFixed(1)}%`,
});

Log Compression Results

const before = estimateSize(inputItems);
compressConversationImages(inputItems);
const after = estimateSize(inputItems);

logger({
  category: "caching",
  message: `Compressed images: ${before}KB → ${after}KB (${((1 - after / before) * 100).toFixed(1)}% reduction)`,
  level: 2,
});

Best Practices

Use prompt caching: Mark static content with cache_control
Compress images: Keep first 2 at full quality, compress rest
Truncate history: Don’t let conversation grow unbounded
Monitor token usage: Track input/output/cached tokens
Structure consistently: Consistent structure improves caching
Batch operations: Fewer requests = better cache utilization
Use appropriate models: Faster models for cached content

Cost Optimization

Example savings with caching:

// Without caching:
// 10 requests × 5000 input tokens = 50,000 tokens
// Cost: $0.15 (at $3/1M tokens)

// With prompt caching (4000 tokens cached):
// Request 1: 5000 input tokens = $0.015
// Requests 2-10: 1000 new + 400 cached = 1400 tokens each
// Cost: $0.015 + (9 × $0.0042) = $0.053
// Savings: 65%

With image compression:

// Full quality: 10 images × 1500 tokens = 15,000 tokens
// Compressed: 2 full + 8 compressed (400 tokens) = 6,200 tokens
// Savings: 59%

References

Image Compression: packages/core/lib/v3/agent/utils/imageCompression.ts
Anthropic CUA: packages/core/lib/v3/agent/AnthropicCUAClient.ts:351
Google CUA: packages/core/lib/v3/agent/GoogleCUAClient.ts:357
OpenAI CUA: packages/core/lib/v3/agent/OpenAICUAClient.ts:420

Getting Started

Core Concepts

Core Methods

Configuration

Integrations

Best Practices

Advanced Features

Advanced Caching Strategies

Overview

Prompt Caching

Anthropic Prompt Caching

OpenAI Prompt Caching

Google Prompt Caching

Image Compression

Anthropic Image Compression

Google Image Compression

Conversation History Management

CUA Conversation History

History Truncation Strategies

Provider-Specific Optimizations

Anthropic Cache Control

Google Content Reuse

OpenAI Response Chaining

Performance Monitoring

Track Token Usage

Log Compression Results

Best Practices

Cost Optimization

References

Build docs developers (and LLMs) love

Getting Started

Core Concepts

Core Methods

Configuration

Integrations

Best Practices

Advanced Features

​Overview

​Prompt Caching

​Anthropic Prompt Caching

​OpenAI Prompt Caching

​Google Prompt Caching

​Image Compression

​Anthropic Image Compression

​Google Image Compression

​Conversation History Management

​CUA Conversation History

​History Truncation Strategies

​Provider-Specific Optimizations

​Anthropic Cache Control

​Google Content Reuse

​OpenAI Response Chaining

​Performance Monitoring

​Track Token Usage

​Log Compression Results

​Best Practices

​Cost Optimization

​References

Build docs developers (and LLMs) love

Overview

Prompt Caching

Anthropic Prompt Caching

OpenAI Prompt Caching

Google Prompt Caching

Image Compression

Anthropic Image Compression

Google Image Compression

Conversation History Management

CUA Conversation History

History Truncation Strategies

Provider-Specific Optimizations

Anthropic Cache Control

Google Content Reuse

OpenAI Response Chaining

Performance Monitoring

Track Token Usage

Log Compression Results

Best Practices

Cost Optimization

References