Web Scraping AI Agent - Awesome LLM Apps

Overview

The Web Scraping AI Agent makes web scraping as simple as describing what you want to extract. Using ScrapeGraph AI technology, it converts natural language prompts into intelligent web scraping workflows that extract structured data from any website—no coding required.

FREE Tutorial Available: Follow the complete step-by-step tutorial to learn how to build this from scratch with detailed explanations and best practices.

Two Implementations

This project includes two versions optimized for different use cases:

Local Library

File: ai_scrapper.py, local_ai_scrapper.pyUses open-source ScrapeGraph AI library running locally✅ Free to use (no API costs) ✅ Full control over execution ✅ Privacy-friendly (all data stays local)❌ Requires local installation ❌ Limited by your hardware ❌ Need to manage updates

Cloud SDK

Folder: scrapegraph_ai_sdk/Uses managed ScrapeGraph AI API with advanced features✅ No setup required (just API key) ✅ Scalable and fast ✅ Advanced features (SmartCrawler, SearchScraper) ✅ Always up-to-date❌ Pay-per-use (credit-based) ❌ Requires internet connection

Features

Local Library Version

Smart Scraping

Natural language extraction prompts
GPT-4o or local LLM support
Automatic HTML parsing
Structured data output

Flexible Models

OpenAI GPT-4o for best quality
GPT-5 support
Local models via Ollama (Llama, Mistral, etc.)
No vendor lock-in

Easy Interface

Streamlit web UI
URL input and prompt entry
Instant results display
JSON output format

Privacy First

All processing happens locally or in your LLM account
No data sent to third-party scrapers
Open-source transparency

Cloud SDK Version

SmartScraper

Extract structured data using natural language prompts

SearchScraper

AI-powered web search with structured results

SmartCrawler

Crawl multiple pages intelligently (up to 50+ pages)

Markdownify

Convert webpages to clean markdown format

Setup

Local Library
Cloud SDK

Clone Repository

git clone https://github.com/Shubhamsaboo/awesome-llm-apps.git
cd awesome-llm-apps/starter_ai_agents/web_scraping_ai_agent

Install Dependencies

pip install -r requirements.txt

Required packages:

streamlit - Web interface
scrapegraphai - Scraping library
playwright - Browser automation

Get OpenAI API Key

Sign up at OpenAI Platform
Generate an API key
You’ll enter it in the app (no environment variable needed)

Run the Application

For cloud models (OpenAI):

streamlit run ai_scrapper.py

For local models (Ollama):

streamlit run local_ai_scrapper.py

Open http://localhost:8501 in your browser

Navigate to SDK Folder

cd scrapegraph_ai_sdk/

Install Dependencies

pip install -r requirements.txt

Get API Key

Sign up at scrapegraphai.com
Get your API key from the dashboard

Set API Key

export SGAI_API_KEY='your-api-key-here'

Or add to .env file:

SGAI_API_KEY=your-api-key-here

Run Demos

# Quick test
python quickstart.py

# SmartScraper demo
python smart_scraper_demo.py

# Interactive app
streamlit run scrapegraph_app.py

Usage

Local Library Version

Enter API Key

Input your OpenAI API key in the sidebar (for cloud models)

Select Model

Choose between GPT-4o, GPT-5, or local models

Enter URL

Provide the website URL you want to scrape

Write Prompt

Describe what data you want to extract:

Extract all product names, prices, and ratings

Scrape

Click “Scrape” and view the structured results

Cloud SDK Version

Initialize Client

from scrapegraph_py import Client

client = Client(api_key="your-api-key")

Choose Method

Select the appropriate scraping method:

SmartScraper for single pages
SearchScraper for web searches
SmartCrawler for multi-page crawling
Markdownify for markdown conversion

Execute Request

Make the API call with your parameters

Process Results

Receive structured JSON or markdown data

Code Examples

Local Library: Basic Scraping

import streamlit as st
from scrapegraphai.graphs import SmartScraperGraph

st.title("Web Scraping AI Agent 🕵️‍♂️")
st.caption("Scrape websites using OpenAI API")

# Get API key
openai_api_key = st.text_input("OpenAI API Key", type="password")

if openai_api_key:
    # Select model
    model = st.radio("Select the model", ["gpt-4o", "gpt-5"], index=0)
    
    # Configure scraper
    graph_config = {
        "llm": {
            "api_key": openai_api_key,
            "model": model,
        },
    }
    
    # Get URL and prompt
    url = st.text_input("Enter the URL of the website")
    user_prompt = st.text_input("What do you want to extract?")
    
    # Create scraper
    smart_scraper_graph = SmartScraperGraph(
        prompt=user_prompt,
        source=url,
        config=graph_config
    )
    
    # Scrape
    if st.button("Scrape"):
        result = smart_scraper_graph.run()
        st.write(result)

Cloud SDK: SmartScraper

from scrapegraph_py import Client

# Initialize client
client = Client(api_key="your-sgai-api-key")

# Extract structured data
response = client.smartscraper(
    website_url="https://example.com/products",
    user_prompt="Extract all product names, prices, and availability"
)

print(response)
# Output: {"products": [{"name": "...", "price": "...", "available": true}, ...]}

Cloud SDK: SearchScraper

# AI-powered web search with structured results
response = client.searchscraper(
    user_prompt="Find the top 5 AI news websites",
    num_results=5
)

for result in response["results"]:
    print(f"{result['title']}: {result['url']}")

Cloud SDK: SmartCrawler

# Crawl multiple pages
request_id = client.smartcrawler(
    url="https://docs.example.com",
    user_prompt="Extract all API endpoints and their descriptions",
    max_pages=50
)

# Check progress
status = client.smartcrawler_progress(request_id)
print(f"Progress: {status['progress']}%")

# Get results when complete
if status['status'] == 'completed':
    results = client.smartcrawler_result(request_id)
    print(results)

Cloud SDK: Markdownify

# Convert webpage to markdown
response = client.markdownify(
    website_url="https://example.com/article"
)

print(response["markdown"])
# Output: Clean markdown version of the webpage

Example Use Cases

E-commerce Scraping

Prompt: “Extract product names, prices, and availability”Use for:

Price monitoring and comparison
Inventory tracking
Competitor analysis
Market research

response = client.smartscraper(
    website_url="https://shop.example.com/category/electronics",
    user_prompt="Extract product name, current price, original price, discount percentage, and stock status for all items"
)

Content Aggregation

Prompt: “Extract article title, author, date, and main content”Use for:

News aggregation
Content curation
Research databases
Media monitoring

response = client.smartscraper(
    website_url="https://news.example.com/latest",
    user_prompt="Extract headline, author name, publication date, article summary, and full text"
)

Lead Generation

Prompt: “Find company names, emails, and phone numbers”Use for:

B2B prospecting
Contact list building
Sales outreach
Market intelligence

response = client.smartscraper(
    website_url="https://directory.example.com",
    user_prompt="Extract company name, contact email, phone number, and address for each listing"
)

Real Estate Data

Prompt: “Extract property details, prices, and location”Use for:

Market analysis
Investment research
Comparative pricing
Trend tracking

response = client.smartscraper(
    website_url="https://realestate.example.com/listings",
    user_prompt="Extract property address, price, bedrooms, bathrooms, square footage, and agent contact"
)

Job Listings

Prompt: “Extract job title, company, salary, and requirements”Use for:

Job aggregation
Salary research
Skills analysis
Market trends

response = client.smartscraper(
    website_url="https://careers.example.com/openings",
    user_prompt="Extract job title, company name, location, salary range, required skills, and application deadline"
)

Documentation Extraction

Prompt: “Extract all API endpoints and parameters”Use for:

API documentation
Integration planning
Code generation
Technical research

request_id = client.smartcrawler(
    url="https://api-docs.example.com",
    user_prompt="Extract API endpoint URLs, HTTP methods, parameters, and response formats",
    max_pages=100
)

Feature Comparison

Feature	Local Library	Cloud SDK
Setup	Install dependencies	API key only
Cost	Free (+ LLM costs)	Pay-per-use
Processing	Your hardware	Cloud-based
Speed	Depends on hardware	Fast & optimized
SmartScraper	✅	✅
SearchScraper	❌	✅
SmartCrawler	❌	✅
Markdownify	❌	✅
Scheduled Jobs	❌	✅
Scalability	Limited	Unlimited
Maintenance	Self-managed	Fully managed

Which Version Should You Use?

Choose Local Library If...
Choose Cloud SDK If...

✅ You want a free, open-source solution✅ You have good hardware (modern CPU/GPU)✅ You need full control over the process✅ Privacy is critical (sensitive data)✅ You’re learning or prototyping✅ You want to customize the scraping logic

Pro Tip: Start with the local version to learn and experiment, then switch to the SDK for production workloads!

Best Practices

Respect Robots.txt: Always check a website’s robots.txt file and terms of service before scraping. Respect rate limits and crawl delays.

Be Specific: Write detailed prompts. Instead of “get product info”, use “extract product name, SKU, price, color options, and stock status”.

Test First: Test your scraping prompts on a single page before crawling an entire site.

Writing Effective Prompts

Identify Data Points

List exactly what fields you want:

Product name
Price (including currency)
Availability status
Rating (if present)

Be Explicit

Specify formats and edge cases:

“Extract price as a number without currency symbols”
“If rating is not available, return null”

Structure Output

Request specific JSON structure:

“Return results as array of objects”
“Each object should have ‘name’, ‘price’, and ‘url’ keys”

Troubleshooting

Empty or Incomplete Results

Issue: Scraper returns no data or misses fieldsSolutions:

Make prompt more specific
Check if website uses JavaScript (may need browser automation)
Try different model (GPT-4o vs local model)
Verify URL is accessible
Test with simpler page first

Timeout Errors

Issue: Scraping times out or hangsSolutions:

Check internet connection
Try smaller/simpler pages
Use Cloud SDK for heavy scraping
Increase timeout in config
Check if website blocks scrapers

Invalid or Malformed Data

Issue: Extracted data has wrong formatSolutions:

Refine prompt to specify exact format
Add data validation examples in prompt
Use schema definition if SDK supports it
Post-process results with Python

Rate Limiting

Issue: Getting blocked or rate limitedSolutions:

Add delays between requests
Use Cloud SDK (better rate limit handling)
Rotate user agents if needed
Respect robots.txt crawl-delay
Consider using proxy services

Performance Tips

Local Library

Use local models (Llama, Mistral) for cost savings
Start with simpler pages for testing
Monitor memory usage with large pages
Cache results when possible

Cloud SDK

Use SmartCrawler for multi-page scraping
Leverage scheduled jobs for regular scraping
Monitor credit usage
Use appropriate max_pages limits

Legal and Ethical Considerations

Always Review Terms of Service: Many websites prohibit automated scraping. Review and comply with each website’s ToS.

Do

✅ Check robots.txt ✅ Respect crawl delays ✅ Use reasonable rate limits ✅ Identify your bot in user-agent ✅ Scrape public data only

Don't

❌ Scrape copyrighted content for profit ❌ Overwhelm servers with requests ❌ Bypass authentication or paywalls ❌ Scrape personal data without consent ❌ Ignore cease-and-desist notices

Next Steps

Tutorial

Follow the complete step-by-step tutorial

ScrapeGraph Docs

Read the official ScrapeGraph AI documentation

More Examples

Explore other AI agent examples

GitHub

View source code and contribute

Starter Examples

Advanced Examples

RAG Examples

​Overview

​Two Implementations

Local Library

Cloud SDK

​Features

​Local Library Version

Smart Scraping

Flexible Models

Easy Interface

Privacy First

​Cloud SDK Version

SmartScraper

SearchScraper

SmartCrawler

Markdownify

​Setup

​Usage

​Local Library Version

​Cloud SDK Version

​Code Examples

​Local Library: Basic Scraping

​Cloud SDK: SmartScraper

​Cloud SDK: SearchScraper

​Cloud SDK: SmartCrawler

​Cloud SDK: Markdownify

​Example Use Cases

​Feature Comparison

​Which Version Should You Use?

​Best Practices

​Writing Effective Prompts

​Troubleshooting

​Performance Tips

Local Library

Cloud SDK

​Legal and Ethical Considerations

Do

Don't

​Next Steps

Tutorial

ScrapeGraph Docs

More Examples

GitHub

Build docs developers (and LLMs) love

Overview

Two Implementations

Features

Local Library Version

Cloud SDK Version

Setup

Usage

Local Library Version

Cloud SDK Version

Code Examples

Local Library: Basic Scraping

Cloud SDK: SmartScraper

Cloud SDK: SearchScraper

Cloud SDK: SmartCrawler

Cloud SDK: Markdownify

Example Use Cases

Feature Comparison

Which Version Should You Use?

Best Practices

Writing Effective Prompts

Troubleshooting

Performance Tips

Legal and Ethical Considerations

Next Steps