Skip to main content
The SmartScraperGraph is the simplest and most powerful way to extract data from a single webpage using natural language prompts.

Overview

This example demonstrates how to:
  • Configure a basic scraping graph
  • Use natural language to describe what you want to extract
  • Process and display the results
  • Monitor execution details

Complete Example

Here’s a working example that extracts an article from Wired.com:
import json
import os
from dotenv import load_dotenv
from scrapegraphai.graphs import SmartScraperGraph
from scrapegraphai.utils import prettify_exec_info

load_dotenv()

# Define the configuration for the graph
graph_config = {
    "llm": {
        "api_key": os.getenv("OPENAI_API_KEY"),
        "model": "openai/gpt-4o-mini",
    },
    "verbose": True,
    "headless": False,
}

# Create the SmartScraperGraph instance and run it
smart_scraper_graph = SmartScraperGraph(
    prompt="Extract me the first article",
    source="https://www.wired.com",
    config=graph_config,
)

result = smart_scraper_graph.run()
print(json.dumps(result, indent=4))

# Get graph execution info
graph_exec_info = smart_scraper_graph.get_execution_info()
print(prettify_exec_info(graph_exec_info))

Step-by-Step Breakdown

1

Import dependencies

import json
import os
from dotenv import load_dotenv
from scrapegraphai.graphs import SmartScraperGraph
from scrapegraphai.utils import prettify_exec_info

load_dotenv()
Import the required modules and load environment variables from your .env file.
2

Configure the graph

graph_config = {
    "llm": {
        "api_key": os.getenv("OPENAI_API_KEY"),
        "model": "openai/gpt-4o-mini",
    },
    "verbose": True,
    "headless": False,
}
Define your configuration with:
  • llm: The language model to use (OpenAI GPT-4o-mini in this case)
  • verbose: Enable detailed logging
  • headless: Set to False to see the browser in action
3

Create and run the graph

smart_scraper_graph = SmartScraperGraph(
    prompt="Extract me the first article",
    source="https://www.wired.com",
    config=graph_config,
)

result = smart_scraper_graph.run()
Create a SmartScraperGraph instance with:
  • prompt: Natural language description of what to extract
  • source: URL of the webpage to scrape
  • config: Your configuration dictionary
4

Process the results

print(json.dumps(result, indent=4))

# Get execution details
graph_exec_info = smart_scraper_graph.get_execution_info()
print(prettify_exec_info(graph_exec_info))
Display the extracted data and execution information for debugging.

Configuration Options

graph_config = {
    "llm": {
        "api_key": os.getenv("OPENAI_API_KEY"),
        "model": "openai/gpt-4o-mini",
    },
    "verbose": True,
    "headless": True,
}

Expected Output

The script will return structured JSON data:
{
    "title": "The Latest in AI: What You Need to Know",
    "author": "John Doe",
    "date": "2024-03-15",
    "content": "Artificial intelligence continues to evolve...",
    "url": "https://www.wired.com/story/ai-latest-news"
}

Common Use Cases

News Articles

Extract headlines, authors, dates, and content from news websites

Product Information

Scrape product names, prices, descriptions, and reviews

Contact Details

Extract emails, phone numbers, and addresses from business websites

Event Data

Gather event names, dates, locations, and descriptions

Tips for Better Results

Be specific in your prompts: Instead of “get data”, use “Extract the article title, author name, publication date, and first paragraph”.
Use headless mode for production: Set "headless": True to run the browser in the background for better performance.
Handle errors gracefully: Wrap your scraping code in try-except blocks to handle network issues and parsing errors.

Monitoring Execution

The get_execution_info() method provides valuable insights:
graph_exec_info = smart_scraper_graph.get_execution_info()
print(prettify_exec_info(graph_exec_info))
This shows:
  • Execution time for each node
  • Token usage and costs
  • Errors or warnings
  • Graph traversal path

Next Steps

Multi-Page Scraping

Learn to scrape multiple URLs at once

Custom Schemas

Define structured output with Pydantic

Troubleshooting

Issue: Browser doesn’t open
  • Make sure Playwright is installed: playwright install
  • Check if headless is set to False
Issue: API rate limits
  • Reduce the number of requests
  • Add delays between requests
  • Use a different model or provider
Issue: Extraction is incomplete
  • Make your prompt more specific
  • Check if the page requires JavaScript rendering
  • Verify the page structure hasn’t changed

Build docs developers (and LLMs) love