Overview
SmartScraperGraph is the most popular and versatile graph in ScrapeGraphAI. It automates the process of extracting information from web pages using a natural language model to interpret and answer prompts.Features
- Extract structured data from any web page using natural language prompts
- Automatic HTML parsing and content chunking
- Support for both URLs and local HTML files
- Schema-based output for type-safe data extraction
- Configurable reasoning and reattempt modes for improved accuracy
Parameters
The SmartScraperGraph constructor accepts the following parameters:Configuration Options
| Parameter | Type | Default | Description |
|---|---|---|---|
llm | dict | Required | LLM model configuration |
verbose | bool | False | Enable detailed logging |
headless | bool | True | Run browser in headless mode |
html_mode | bool | False | Skip HTML parsing for faster execution |
reasoning | bool | False | Enable reasoning step before extraction |
reattempt | bool | False | Retry if extraction fails |
force | bool | False | Force fetch even with cache |
cut | bool | True | Cut content to model token limit |
Usage Examples
- OpenAI
- Ollama
Schema-Based Extraction
Use Pydantic schemas to ensure type-safe, structured output:Advanced Configuration
HTML Mode
Skip HTML parsing for faster execution when dealing with simple pages:Reasoning Mode
Enable reasoning for complex extraction tasks:Reattempt Mode
Automatically retry failed extractions:Local HTML Files
You can scrape local HTML files instead of URLs:Output Format
Therun() method returns extracted data based on your prompt:
Error Handling
Performance Tips
- Use
html_mode=Truefor simple pages to skip parsing overhead - Set
headless=Truein production for better performance - Use
cut=Trueto prevent token limit errors with large pages - Enable
verbose=Trueduring development to debug issues
Related Graphs
SmartScraperMultiGraph
Scrape multiple URLs at once
SearchGraph
Search and scrape results
