Skip to main content

Overview

The extract() method allows you to extract structured data from web pages using natural language instructions and Zod schemas. It leverages AI to understand page content and return data in the exact format you need.

Method Signature

extract<T extends StagehandZodSchema>(
  instruction?: string,
  schema?: T,
  options?: ExtractOptions
): Promise<InferStagehandSchema<T>>

Parameters

instruction
string
Natural language description of what data to extract (e.g., “Extract all product listings with their prices”). Optional when no schema is provided - returns page text.
schema
StagehandZodSchema
Zod schema defining the structure of data to extract. Supports z.object(), z.array(), and nested schemas.
import { z } from "zod";

const schema = z.object({
  title: z.string().describe("Page title"),
  price: z.string().describe("Product price"),
});
options
ExtractOptions
Optional configuration for extraction.

Return Value

Returns a Promise that resolves to data matching your Zod schema structure.
  • With schema: Returns typed data matching the schema
  • Without schema: Returns { extraction: string } or { pageText: string }

Usage Examples

Basic Extraction

import { Stagehand } from "@stagehand/api";
import { z } from "zod";

const stagehand = new Stagehand({
  env: "BROWSERBASE",
  apiKey: process.env.BROWSERBASE_API_KEY,
});

await stagehand.init();
const page = stagehand.context.pages()[0];

await page.goto("https://news.ycombinator.com");

const articles = await stagehand.extract(
  "Extract the top 5 article titles",
  z.object({
    titles: z.array(z.string()),
  })
);

console.log(articles.titles);

Extracting Lists

await page.goto("https://www.apartments.com/san-francisco-ca/");

const listings = await stagehand.extract(
  "Extract all apartment listings with prices and addresses",
  z.object({
    listings: z.array(
      z.object({
        price: z.string().describe("The price of the listing"),
        address: z.string().describe("The address of the listing"),
      })
    ),
  })
);

console.log(`Found ${listings.listings.length} apartments`);
listings.listings.forEach((listing) => {
  console.log(`${listing.address}: ${listing.price}`);
});

Nested Data Structures

const productData = await stagehand.extract(
  "Extract product information",
  z.object({
    product: z.object({
      name: z.string(),
      price: z.string(),
      features: z.array(z.string()),
      reviews: z.object({
        rating: z.number(),
        count: z.number(),
        topReview: z.string(),
      }),
    }),
  })
);

console.log(productData.product.name);
console.log(`Rating: ${productData.product.reviews.rating}/5`);

Extracting URLs

// Zod's .url() fields are automatically converted to clickable URLs
const links = await stagehand.extract(
  "Get all navigation links",
  z.object({
    links: z.array(
      z.object({
        text: z.string(),
        url: z.string().url(), // Automatically extracts href attribute
      })
    ),
  })
);

for (const link of links.links) {
  console.log(`${link.text}: ${link.url}`);
}

Focused Extraction

// Extract from a specific section of the page
const sidebarData = await stagehand.extract(
  "Extract trending topics",
  z.object({
    topics: z.array(z.string()),
  }),
  {
    selector: "aside.sidebar", // CSS selector
  }
);

// Or use XPath
const contentData = await stagehand.extract(
  "Extract main content",
  schema,
  {
    selector: "xpath=//main[@id='content']",
  }
);

No-Schema Extraction

// Without instruction or schema - returns page text
const { pageText } = await stagehand.extract();
console.log(pageText);

// With instruction only - returns free-form extraction
const { extraction } = await stagehand.extract(
  "What is the main topic of this page?"
);
console.log(extraction);

Multi-Page Extraction

const page1 = stagehand.context.pages()[0];
const page2 = await stagehand.context.newPage();

await page1.goto("https://example.com/page1");
await page2.goto("https://example.com/page2");

const data1 = await stagehand.extract(
  "Extract title",
  z.object({ title: z.string() }),
  { page: page1 }
);

const data2 = await stagehand.extract(
  "Extract title",
  z.object({ title: z.string() }),
  { page: page2 }
);

Using Descriptions

// Add .describe() to help the AI understand what to extract
const userData = await stagehand.extract(
  "Extract user profile information",
  z.object({
    username: z.string().describe("The user's display name"),
    email: z.string().describe("The user's email address"),
    joinDate: z.string().describe("Date the user joined, in MM/DD/YYYY format"),
    isVerified: z.boolean().describe("Whether the user's account is verified"),
  })
);

Handling Missing Data

// Use .optional() for fields that might not exist
const result = await stagehand.extract(
  "Extract article metadata",
  z.object({
    title: z.string(),
    author: z.string().optional(),
    publishDate: z.string().optional(),
    readTime: z.string().optional(),
  })
);

if (result.author) {
  console.log(`By ${result.author}`);
}

With Timeout

try {
  const data = await stagehand.extract(
    "Extract complex data",
    schema,
    { timeout: 30000 } // 30 seconds
  );
} catch (error) {
  if (error instanceof ExtractTimeoutError) {
    console.error("Extraction timed out");
  }
}

Supported Schema Types

Stagehand’s extract() supports most Zod schema types:
  • Primitives: z.string(), z.number(), z.boolean()
  • Objects: z.object({ ... })
  • Arrays: z.array(...)
  • Optionals: .optional()
  • Nested structures: Objects within objects, arrays of objects
  • URLs: z.string().url() - automatically extracts href attributes
  • Descriptions: .describe("...") - helps guide extraction

How It Works

  1. Snapshot: Captures an accessibility tree of the page
  2. LLM Processing: Sends the instruction and schema to the AI model
  3. Extraction: AI identifies and extracts matching data
  4. Validation: Data is validated against your Zod schema
  5. Return: Typed data matching your schema structure

Performance Tips

  1. Use focused selectors - Extract from specific page sections
    await stagehand.extract(instruction, schema, {
      selector: ".product-details"
    });
    
  2. Be specific with descriptions - Help the AI understand context
    z.string().describe("The product price in USD format")
    
  3. Use appropriate schemas - Don’t over-complicate structure
    // Good - simple and clear
    z.object({ price: z.string() })
    
    // Overkill - unnecessary complexity
    z.object({ 
      price: z.object({ 
        amount: z.string(), 
        currency: z.string() 
      })
    })
    

Error Handling

try {
  const data = await stagehand.extract(instruction, schema);
  console.log(data);
} catch (error) {
  if (error instanceof ExtractTimeoutError) {
    console.error("Extraction timed out");
  } else if (error instanceof StagehandInvalidArgumentError) {
    console.error("Invalid schema or instruction");
  } else {
    console.error("Extraction failed:", error);
  }
}

Best Practices

  1. Clear instructions - Be explicit about what to extract
  2. Use descriptions - Add .describe() to schema fields
  3. Handle optionals - Use .optional() for fields that may not exist
  4. Focus extraction - Use selector option for large pages
  5. Type safety - Let TypeScript infer types from your schema
// TypeScript automatically knows the structure
const result = await stagehand.extract(
  "Extract data",
  z.object({
    title: z.string(),
    count: z.number(),
  })
);

// result.title is string
// result.count is number
  • act() - Perform actions on the page
  • observe() - Preview actions before executing
  • agent() - Autonomous multi-step automation

Build docs developers (and LLMs) love