Skip to main content

Overview

SmartScraperGraph is a scraping pipeline that automates the process of extracting information from web pages using a natural language model to interpret and answer prompts. It’s the most commonly used graph for web scraping tasks.

Class Signature

class SmartScraperGraph(AbstractGraph):
    def __init__(
        self,
        prompt: str,
        source: str,
        config: dict,
        schema: Optional[Type[BaseModel]] = None,
    )

Constructor Parameters

prompt
str
required
The natural language prompt describing what information to extract from the source.
source
str
required
The source to scrape. Can be:
  • A URL starting with http:// or https://
  • A local directory path for offline HTML files
config
dict
required
Configuration parameters for the graph. Must include:
  • llm: LLM configuration (e.g., {"model": "openai/gpt-4o"})
Optional parameters:
  • verbose (bool): Enable detailed logging
  • headless (bool): Run browser in headless mode (default: True)
  • html_mode (bool): Process raw HTML without parsing (default: False)
  • reasoning (bool): Enable chain-of-thought reasoning (default: False)
  • reattempt (bool): Retry if initial extraction fails (default: False)
  • additional_info (str): Extra context for the LLM
  • cut (bool): Trim long documents (default: True)
  • force (bool): Force re-fetch even if cached
  • loader_kwargs (dict): Additional parameters for page loading
  • browser_base (dict): BrowserBase configuration
  • scrape_do (dict): ScrapeDo configuration
  • storage_state (str): Path to browser state file
schema
Type[BaseModel]
default:"None"
Optional Pydantic model defining the expected output structure. Ensures type-safe extraction.

Attributes

prompt
str
The user’s extraction prompt.
source
str
The source URL or local directory path.
config
dict
Configuration dictionary for the graph.
schema
BaseModel
Optional output schema for structured data extraction.
llm_model
object
The configured language model instance.
verbose
bool
Flag indicating whether verbose logging is enabled.
headless
bool
Flag indicating whether to run browser in headless mode.
input_key
str
Either “url” or “local_dir” based on the source type.

Methods

run()

Executes the scraping process and returns the answer to the prompt.
def run(self) -> str
return
str
The extracted information as a string, or “No answer found.” if extraction fails.

Basic Usage

from scrapegraphai.graphs import SmartScraperGraph

graph_config = {
    "llm": {
        "model": "openai/gpt-4o",
        "api_key": "your-api-key"
    }
}

smart_scraper = SmartScraperGraph(
    prompt="List me all the attractions in Chioggia.",
    source="https://en.wikipedia.org/wiki/Chioggia",
    config=graph_config
)

result = smart_scraper.run()
print(result)

Structured Output with Schema

from pydantic import BaseModel, Field
from typing import List

class Attraction(BaseModel):
    name: str = Field(description="Name of the attraction")
    description: str = Field(description="Brief description")
    category: str = Field(description="Type of attraction")

class Attractions(BaseModel):
    attractions: List[Attraction]

smart_scraper = SmartScraperGraph(
    prompt="List all the attractions with their descriptions.",
    source="https://en.wikipedia.org/wiki/Chioggia",
    config=graph_config,
    schema=Attractions
)

result = smart_scraper.run()
# Result is automatically validated against the schema
print(result)

Advanced Configuration

HTML Mode

Process raw HTML without parsing for maximum speed:
config = {
    "llm": {"model": "openai/gpt-4o"},
    "html_mode": True  # Skip parsing step
}

smart_scraper = SmartScraperGraph(
    prompt="Extract product prices",
    source="https://example.com/products",
    config=config
)

Reasoning Mode

Enable chain-of-thought reasoning for complex extractions:
config = {
    "llm": {"model": "openai/gpt-4o"},
    "reasoning": True,  # Enable step-by-step reasoning
    "additional_info": "Focus on numerical data and statistics"
}

smart_scraper = SmartScraperGraph(
    prompt="Analyze the company's financial performance",
    source="https://example.com/annual-report",
    config=config
)

Reattempt Mode

Automatically retry if initial extraction fails:
config = {
    "llm": {"model": "openai/gpt-4o"},
    "reattempt": True  # Retry on empty or "NA" results
}

smart_scraper = SmartScraperGraph(
    prompt="Find the CEO's name",
    source="https://example.com/about",
    config=config
)

Local HTML Files

smart_scraper = SmartScraperGraph(
    prompt="Extract all contact information",
    source="/path/to/local/page.html",
    config=graph_config
)

result = smart_scraper.run()

Browser Configuration

Using BrowserBase

config = {
    "llm": {"model": "openai/gpt-4o"},
    "browser_base": {
        "api_key": "your-browserbase-key",
        "project_id": "your-project-id"
    }
}

Using ScrapeDo

config = {
    "llm": {"model": "openai/gpt-4o"},
    "scrape_do": {
        "api_key": "your-scrapedo-key"
    }
}

Storage State (Cookies/Auth)

config = {
    "llm": {"model": "openai/gpt-4o"},
    "storage_state": "./auth_state.json",  # Browser state with cookies
    "headless": False
}

Accessing Graph State

result = smart_scraper.run()

# Access final state
final_state = smart_scraper.get_state()
print(final_state["answer"])       # The extracted answer
print(final_state["parsed_doc"])  # Parsed document
print(final_state["doc"])         # Raw document

# Access execution info
exec_info = smart_scraper.get_execution_info()
for node_info in exec_info:
    print(f"{node_info['node_name']}: {node_info['exec_time']}s")
    print(f"Tokens: {node_info['total_tokens']}")
    print(f"Cost: ${node_info['total_cost_USD']}")

Graph Workflow Variations

The graph automatically adapts its workflow based on configuration: Standard workflow (html_mode=False, reasoning=False, reattempt=False):
FetchNode → ParseNode → GenerateAnswerNode
HTML mode (html_mode=True):
FetchNode → GenerateAnswerNode
With reasoning (reasoning=True):
FetchNode → ParseNode → ReasoningNode → GenerateAnswerNode
With reattempt (reattempt=True):
FetchNode → ParseNode → GenerateAnswerNode → ConditionalNode → RegenNode

Error Handling

try:
    result = smart_scraper.run()
    if result == "No answer found.":
        print("Extraction failed")
    else:
        print(f"Success: {result}")
except Exception as e:
    print(f"Error during scraping: {e}")

Performance Tips

  1. Use HTML mode for simple extractions to skip parsing overhead
  2. Enable cut to trim long documents and reduce token usage
  3. Set appropriate chunk_size based on your LLM’s context window
  4. Use caching with cache_path to avoid re-fetching pages
  5. Enable verbose mode during development for debugging

Build docs developers (and LLMs) love