Skip to main content

Overview

The Web Scraping AI Agent makes web scraping as simple as describing what you want to extract. Using ScrapeGraph AI technology, it converts natural language prompts into intelligent web scraping workflows that extract structured data from any website—no coding required.
FREE Tutorial Available: Follow the complete step-by-step tutorial to learn how to build this from scratch with detailed explanations and best practices.

Two Implementations

This project includes two versions optimized for different use cases:

Local Library

File: ai_scrapper.py, local_ai_scrapper.pyUses open-source ScrapeGraph AI library running locally✅ Free to use (no API costs) ✅ Full control over execution ✅ Privacy-friendly (all data stays local)❌ Requires local installation ❌ Limited by your hardware ❌ Need to manage updates

Cloud SDK

Folder: scrapegraph_ai_sdk/Uses managed ScrapeGraph AI API with advanced features✅ No setup required (just API key) ✅ Scalable and fast ✅ Advanced features (SmartCrawler, SearchScraper) ✅ Always up-to-date❌ Pay-per-use (credit-based) ❌ Requires internet connection

Features

Local Library Version

Smart Scraping

  • Natural language extraction prompts
  • GPT-4o or local LLM support
  • Automatic HTML parsing
  • Structured data output

Flexible Models

  • OpenAI GPT-4o for best quality
  • GPT-5 support
  • Local models via Ollama (Llama, Mistral, etc.)
  • No vendor lock-in

Easy Interface

  • Streamlit web UI
  • URL input and prompt entry
  • Instant results display
  • JSON output format

Privacy First

  • All processing happens locally or in your LLM account
  • No data sent to third-party scrapers
  • Open-source transparency

Cloud SDK Version

SmartScraper

Extract structured data using natural language prompts

SearchScraper

AI-powered web search with structured results

SmartCrawler

Crawl multiple pages intelligently (up to 50+ pages)

Markdownify

Convert webpages to clean markdown format

Setup

1

Clone Repository

git clone https://github.com/Shubhamsaboo/awesome-llm-apps.git
cd awesome-llm-apps/starter_ai_agents/web_scraping_ai_agent
2

Install Dependencies

pip install -r requirements.txt
Required packages:
  • streamlit - Web interface
  • scrapegraphai - Scraping library
  • playwright - Browser automation
3

Get OpenAI API Key

  • Sign up at OpenAI Platform
  • Generate an API key
  • You’ll enter it in the app (no environment variable needed)
4

Run the Application

For cloud models (OpenAI):
streamlit run ai_scrapper.py
For local models (Ollama):
streamlit run local_ai_scrapper.py
Open http://localhost:8501 in your browser

Usage

Local Library Version

1

Enter API Key

Input your OpenAI API key in the sidebar (for cloud models)
2

Select Model

Choose between GPT-4o, GPT-5, or local models
3

Enter URL

Provide the website URL you want to scrape
4

Write Prompt

Describe what data you want to extract:
Extract all product names, prices, and ratings
5

Scrape

Click “Scrape” and view the structured results

Cloud SDK Version

1

Initialize Client

from scrapegraph_py import Client

client = Client(api_key="your-api-key")
2

Choose Method

Select the appropriate scraping method:
  • SmartScraper for single pages
  • SearchScraper for web searches
  • SmartCrawler for multi-page crawling
  • Markdownify for markdown conversion
3

Execute Request

Make the API call with your parameters
4

Process Results

Receive structured JSON or markdown data

Code Examples

Local Library: Basic Scraping

import streamlit as st
from scrapegraphai.graphs import SmartScraperGraph

st.title("Web Scraping AI Agent 🕵️‍♂️")
st.caption("Scrape websites using OpenAI API")

# Get API key
openai_api_key = st.text_input("OpenAI API Key", type="password")

if openai_api_key:
    # Select model
    model = st.radio("Select the model", ["gpt-4o", "gpt-5"], index=0)
    
    # Configure scraper
    graph_config = {
        "llm": {
            "api_key": openai_api_key,
            "model": model,
        },
    }
    
    # Get URL and prompt
    url = st.text_input("Enter the URL of the website")
    user_prompt = st.text_input("What do you want to extract?")
    
    # Create scraper
    smart_scraper_graph = SmartScraperGraph(
        prompt=user_prompt,
        source=url,
        config=graph_config
    )
    
    # Scrape
    if st.button("Scrape"):
        result = smart_scraper_graph.run()
        st.write(result)

Cloud SDK: SmartScraper

from scrapegraph_py import Client

# Initialize client
client = Client(api_key="your-sgai-api-key")

# Extract structured data
response = client.smartscraper(
    website_url="https://example.com/products",
    user_prompt="Extract all product names, prices, and availability"
)

print(response)
# Output: {"products": [{"name": "...", "price": "...", "available": true}, ...]}

Cloud SDK: SearchScraper

# AI-powered web search with structured results
response = client.searchscraper(
    user_prompt="Find the top 5 AI news websites",
    num_results=5
)

for result in response["results"]:
    print(f"{result['title']}: {result['url']}")

Cloud SDK: SmartCrawler

# Crawl multiple pages
request_id = client.smartcrawler(
    url="https://docs.example.com",
    user_prompt="Extract all API endpoints and their descriptions",
    max_pages=50
)

# Check progress
status = client.smartcrawler_progress(request_id)
print(f"Progress: {status['progress']}%")

# Get results when complete
if status['status'] == 'completed':
    results = client.smartcrawler_result(request_id)
    print(results)

Cloud SDK: Markdownify

# Convert webpage to markdown
response = client.markdownify(
    website_url="https://example.com/article"
)

print(response["markdown"])
# Output: Clean markdown version of the webpage

Example Use Cases

Prompt: “Extract product names, prices, and availability”Use for:
  • Price monitoring and comparison
  • Inventory tracking
  • Competitor analysis
  • Market research
response = client.smartscraper(
    website_url="https://shop.example.com/category/electronics",
    user_prompt="Extract product name, current price, original price, discount percentage, and stock status for all items"
)
Prompt: “Extract article title, author, date, and main content”Use for:
  • News aggregation
  • Content curation
  • Research databases
  • Media monitoring
response = client.smartscraper(
    website_url="https://news.example.com/latest",
    user_prompt="Extract headline, author name, publication date, article summary, and full text"
)
Prompt: “Find company names, emails, and phone numbers”Use for:
  • B2B prospecting
  • Contact list building
  • Sales outreach
  • Market intelligence
response = client.smartscraper(
    website_url="https://directory.example.com",
    user_prompt="Extract company name, contact email, phone number, and address for each listing"
)
Prompt: “Extract property details, prices, and location”Use for:
  • Market analysis
  • Investment research
  • Comparative pricing
  • Trend tracking
response = client.smartscraper(
    website_url="https://realestate.example.com/listings",
    user_prompt="Extract property address, price, bedrooms, bathrooms, square footage, and agent contact"
)
Prompt: “Extract job title, company, salary, and requirements”Use for:
  • Job aggregation
  • Salary research
  • Skills analysis
  • Market trends
response = client.smartscraper(
    website_url="https://careers.example.com/openings",
    user_prompt="Extract job title, company name, location, salary range, required skills, and application deadline"
)
Prompt: “Extract all API endpoints and parameters”Use for:
  • API documentation
  • Integration planning
  • Code generation
  • Technical research
request_id = client.smartcrawler(
    url="https://api-docs.example.com",
    user_prompt="Extract API endpoint URLs, HTTP methods, parameters, and response formats",
    max_pages=100
)

Feature Comparison

FeatureLocal LibraryCloud SDK
SetupInstall dependenciesAPI key only
CostFree (+ LLM costs)Pay-per-use
ProcessingYour hardwareCloud-based
SpeedDepends on hardwareFast & optimized
SmartScraper
SearchScraper
SmartCrawler
Markdownify
Scheduled Jobs
ScalabilityLimitedUnlimited
MaintenanceSelf-managedFully managed

Which Version Should You Use?

✅ You want a free, open-source solution✅ You have good hardware (modern CPU/GPU)✅ You need full control over the process✅ Privacy is critical (sensitive data)✅ You’re learning or prototyping✅ You want to customize the scraping logic
Pro Tip: Start with the local version to learn and experiment, then switch to the SDK for production workloads!

Best Practices

Respect Robots.txt: Always check a website’s robots.txt file and terms of service before scraping. Respect rate limits and crawl delays.
Be Specific: Write detailed prompts. Instead of “get product info”, use “extract product name, SKU, price, color options, and stock status”.
Test First: Test your scraping prompts on a single page before crawling an entire site.

Writing Effective Prompts

1

Identify Data Points

List exactly what fields you want:
  • Product name
  • Price (including currency)
  • Availability status
  • Rating (if present)
2

Be Explicit

Specify formats and edge cases:
  • “Extract price as a number without currency symbols”
  • “If rating is not available, return null”
3

Structure Output

Request specific JSON structure:
  • “Return results as array of objects”
  • “Each object should have ‘name’, ‘price’, and ‘url’ keys”

Troubleshooting

Issue: Scraper returns no data or misses fieldsSolutions:
  • Make prompt more specific
  • Check if website uses JavaScript (may need browser automation)
  • Try different model (GPT-4o vs local model)
  • Verify URL is accessible
  • Test with simpler page first
Issue: Scraping times out or hangsSolutions:
  • Check internet connection
  • Try smaller/simpler pages
  • Use Cloud SDK for heavy scraping
  • Increase timeout in config
  • Check if website blocks scrapers
Issue: Extracted data has wrong formatSolutions:
  • Refine prompt to specify exact format
  • Add data validation examples in prompt
  • Use schema definition if SDK supports it
  • Post-process results with Python
Issue: Getting blocked or rate limitedSolutions:
  • Add delays between requests
  • Use Cloud SDK (better rate limit handling)
  • Rotate user agents if needed
  • Respect robots.txt crawl-delay
  • Consider using proxy services

Performance Tips

Local Library

  • Use local models (Llama, Mistral) for cost savings
  • Start with simpler pages for testing
  • Monitor memory usage with large pages
  • Cache results when possible

Cloud SDK

  • Use SmartCrawler for multi-page scraping
  • Leverage scheduled jobs for regular scraping
  • Monitor credit usage
  • Use appropriate max_pages limits
Always Review Terms of Service: Many websites prohibit automated scraping. Review and comply with each website’s ToS.

Do

✅ Check robots.txt ✅ Respect crawl delays ✅ Use reasonable rate limits ✅ Identify your bot in user-agent ✅ Scrape public data only

Don't

❌ Scrape copyrighted content for profit ❌ Overwhelm servers with requests ❌ Bypass authentication or paywalls ❌ Scrape personal data without consent ❌ Ignore cease-and-desist notices

Next Steps

Tutorial

Follow the complete step-by-step tutorial

ScrapeGraph Docs

Read the official ScrapeGraph AI documentation

More Examples

Explore other AI agent examples

GitHub

View source code and contribute

Build docs developers (and LLMs) love