Overview
The Web Scraping AI Agent makes web scraping as simple as describing what you want to extract. Using ScrapeGraph AI technology, it converts natural language prompts into intelligent web scraping workflows that extract structured data from any website—no coding required.Two Implementations
This project includes two versions optimized for different use cases:Local Library
File:
ai_scrapper.py, local_ai_scrapper.pyUses open-source ScrapeGraph AI library running locally✅ Free to use (no API costs)
✅ Full control over execution
✅ Privacy-friendly (all data stays local)❌ Requires local installation
❌ Limited by your hardware
❌ Need to manage updatesCloud SDK
Folder:
scrapegraph_ai_sdk/Uses managed ScrapeGraph AI API with advanced features✅ No setup required (just API key)
✅ Scalable and fast
✅ Advanced features (SmartCrawler, SearchScraper)
✅ Always up-to-date❌ Pay-per-use (credit-based)
❌ Requires internet connectionFeatures
Local Library Version
Smart Scraping
- Natural language extraction prompts
- GPT-4o or local LLM support
- Automatic HTML parsing
- Structured data output
Flexible Models
- OpenAI GPT-4o for best quality
- GPT-5 support
- Local models via Ollama (Llama, Mistral, etc.)
- No vendor lock-in
Easy Interface
- Streamlit web UI
- URL input and prompt entry
- Instant results display
- JSON output format
Privacy First
- All processing happens locally or in your LLM account
- No data sent to third-party scrapers
- Open-source transparency
Cloud SDK Version
SmartScraper
Extract structured data using natural language prompts
SearchScraper
AI-powered web search with structured results
SmartCrawler
Crawl multiple pages intelligently (up to 50+ pages)
Markdownify
Convert webpages to clean markdown format
Setup
- Local Library
- Cloud SDK
Install Dependencies
streamlit- Web interfacescrapegraphai- Scraping libraryplaywright- Browser automation
Get OpenAI API Key
- Sign up at OpenAI Platform
- Generate an API key
- You’ll enter it in the app (no environment variable needed)
Usage
Local Library Version
Cloud SDK Version
Choose Method
Select the appropriate scraping method:
- SmartScraper for single pages
- SearchScraper for web searches
- SmartCrawler for multi-page crawling
- Markdownify for markdown conversion
Code Examples
Local Library: Basic Scraping
Cloud SDK: SmartScraper
Cloud SDK: SearchScraper
Cloud SDK: SmartCrawler
Cloud SDK: Markdownify
Example Use Cases
E-commerce Scraping
E-commerce Scraping
Prompt: “Extract product names, prices, and availability”Use for:
- Price monitoring and comparison
- Inventory tracking
- Competitor analysis
- Market research
Content Aggregation
Content Aggregation
Prompt: “Extract article title, author, date, and main content”Use for:
- News aggregation
- Content curation
- Research databases
- Media monitoring
Lead Generation
Lead Generation
Prompt: “Find company names, emails, and phone numbers”Use for:
- B2B prospecting
- Contact list building
- Sales outreach
- Market intelligence
Real Estate Data
Real Estate Data
Prompt: “Extract property details, prices, and location”Use for:
- Market analysis
- Investment research
- Comparative pricing
- Trend tracking
Job Listings
Job Listings
Prompt: “Extract job title, company, salary, and requirements”Use for:
- Job aggregation
- Salary research
- Skills analysis
- Market trends
Documentation Extraction
Documentation Extraction
Prompt: “Extract all API endpoints and parameters”Use for:
- API documentation
- Integration planning
- Code generation
- Technical research
Feature Comparison
| Feature | Local Library | Cloud SDK |
|---|---|---|
| Setup | Install dependencies | API key only |
| Cost | Free (+ LLM costs) | Pay-per-use |
| Processing | Your hardware | Cloud-based |
| Speed | Depends on hardware | Fast & optimized |
| SmartScraper | ✅ | ✅ |
| SearchScraper | ❌ | ✅ |
| SmartCrawler | ❌ | ✅ |
| Markdownify | ❌ | ✅ |
| Scheduled Jobs | ❌ | ✅ |
| Scalability | Limited | Unlimited |
| Maintenance | Self-managed | Fully managed |
Which Version Should You Use?
- Choose Local Library If...
- Choose Cloud SDK If...
✅ You want a free, open-source solution✅ You have good hardware (modern CPU/GPU)✅ You need full control over the process✅ Privacy is critical (sensitive data)✅ You’re learning or prototyping✅ You want to customize the scraping logic
Pro Tip: Start with the local version to learn and experiment, then switch to the SDK for production workloads!
Best Practices
Test First: Test your scraping prompts on a single page before crawling an entire site.
Writing Effective Prompts
Identify Data Points
List exactly what fields you want:
- Product name
- Price (including currency)
- Availability status
- Rating (if present)
Be Explicit
Specify formats and edge cases:
- “Extract price as a number without currency symbols”
- “If rating is not available, return null”
Troubleshooting
Empty or Incomplete Results
Empty or Incomplete Results
Issue: Scraper returns no data or misses fieldsSolutions:
- Make prompt more specific
- Check if website uses JavaScript (may need browser automation)
- Try different model (GPT-4o vs local model)
- Verify URL is accessible
- Test with simpler page first
Timeout Errors
Timeout Errors
Issue: Scraping times out or hangsSolutions:
- Check internet connection
- Try smaller/simpler pages
- Use Cloud SDK for heavy scraping
- Increase timeout in config
- Check if website blocks scrapers
Invalid or Malformed Data
Invalid or Malformed Data
Issue: Extracted data has wrong formatSolutions:
- Refine prompt to specify exact format
- Add data validation examples in prompt
- Use schema definition if SDK supports it
- Post-process results with Python
Rate Limiting
Rate Limiting
Issue: Getting blocked or rate limitedSolutions:
- Add delays between requests
- Use Cloud SDK (better rate limit handling)
- Rotate user agents if needed
- Respect robots.txt crawl-delay
- Consider using proxy services
Performance Tips
Local Library
- Use local models (Llama, Mistral) for cost savings
- Start with simpler pages for testing
- Monitor memory usage with large pages
- Cache results when possible
Cloud SDK
- Use SmartCrawler for multi-page scraping
- Leverage scheduled jobs for regular scraping
- Monitor credit usage
- Use appropriate max_pages limits
Legal and Ethical Considerations
Do
✅ Check robots.txt
✅ Respect crawl delays
✅ Use reasonable rate limits
✅ Identify your bot in user-agent
✅ Scrape public data only
Don't
❌ Scrape copyrighted content for profit
❌ Overwhelm servers with requests
❌ Bypass authentication or paywalls
❌ Scrape personal data without consent
❌ Ignore cease-and-desist notices
Next Steps
Tutorial
Follow the complete step-by-step tutorial
ScrapeGraph Docs
Read the official ScrapeGraph AI documentation
More Examples
Explore other AI agent examples
GitHub
View source code and contribute
