Configuration
Configuration in ScrapeGraphAI is done through a dictionary that controls the behavior of graphs, nodes, and LLM models.
Configuration Structure
The main configuration dictionary has the following structure:
graph_config = {
"llm": { # LLM configuration (required)
"model": "provider/model-name",
"api_key": "your-api-key",
# ... model-specific settings
},
"verbose": False, # Enable detailed logging
"headless": True, # Browser headless mode
"timeout": 480, # Request timeout (seconds)
"cache_path": False, # Enable caching
"loader_kwargs": {}, # Browser loader options
# ... additional options
}
The llm configuration is required for all graphs. All other settings are optional.
LLM Configuration
The llm section configures the language model used for generation.
Basic Configuration
"llm": {
"model": "openai/gpt-4o-mini",
"api_key": "sk-...",
"temperature": 0
}
Model Specification
Models can be specified with or without provider prefix:
With Provider (Recommended)
# Format: "provider/model-name"
"llm": {
"model": "openai/gpt-4o-mini"
}
"llm": {
"model": "anthropic/claude-3-opus-20240229"
}
"llm": {
"model": "groq/llama3-70b-8192"
}
Without Provider (Auto-detected)
# ScrapeGraphAI will auto-detect the provider
"llm": {
"model": "gpt-4o-mini" # Detected as OpenAI
}
"llm": {
"model": "llama3.1" # Detected as Ollama
}
If multiple providers support the same model, the first match is used.
Supported Providers
OpenAI
GPT-4, GPT-3.5, etc.
Groq
Fast inference models
Ollama
Local model hosting
Azure OpenAI
Azure-hosted models
Bedrock
AWS Bedrock models
Hugging Face
HF model endpoints
Provider-Specific Settings
"llm": {
"model": "openai/gpt-4o-mini",
"api_key": os.getenv("OPENAI_API_KEY"),
"temperature": 0,
"streaming": False,
"model_tokens": 128000 # Optional: override token limit
}
"llm": {
"model": "anthropic/claude-3-sonnet-20240229",
"api_key": os.getenv("ANTHROPIC_API_KEY"),
"temperature": 0,
"max_tokens": 4096
}
"llm": {
"model": "ollama/llama3.1",
"temperature": 0,
"base_url": "http://localhost:11434", # Ollama server URL
"format": "json" # Force JSON output
}
"llm": {
"model": "google_genai/gemini-pro",
"api_key": os.getenv("GOOGLE_API_KEY"),
"temperature": 0
}
"llm": {
"model": "groq/llama3-70b-8192",
"api_key": os.getenv("GROQ_API_KEY"),
"temperature": 0
}
"llm": {
"model": "azure_openai/gpt-4",
"api_key": os.getenv("AZURE_OPENAI_API_KEY"),
"azure_endpoint": "https://your-resource.openai.azure.com/",
"api_version": "2024-02-15-preview",
"azure_deployment": "your-deployment-name",
"temperature": 0
}
Using Model Instances
You can pass pre-configured model instances:
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(
model="gpt-4o-mini",
temperature=0,
api_key="sk-..."
)
graph_config = {
"llm": {
"model_instance": llm,
"model_tokens": 128000 # Required with model_instance
},
"verbose": True
}
Rate Limiting
Control request rates to avoid API limits:
"llm": {
"model": "openai/gpt-4o-mini",
"api_key": "sk-...",
"rate_limit": {
"requests_per_second": 3, # Max requests per second
"max_retries": 5 # Retry attempts on failure
}
}
Behavior Settings
Verbose Mode
Enable detailed logging for debugging:
"verbose": True # Shows detailed execution logs
Output:
--- Executing Fetch Node ---
Fetching HTML from: https://example.com
--- Executing ParseNode Node ---
Parsing document into 5 chunks
--- Executing GenerateAnswer Node ---
Processing chunks: 100%|██████████| 5/5 [00:03<00:00]
Headless Mode
Control browser visibility:
"headless": True # Browser runs in background (default)
"headless": False # Browser window visible (debugging)
Set headless: False when debugging to see what the browser is doing.
Timeout
Set request timeout in seconds:
"timeout": 30 # 30 seconds (default)
"timeout": 120 # 2 minutes for slow sites
"timeout": None # No timeout (not recommended)
Applies to:
- HTTP requests in FetchNode
- Browser loading in ChromiumLoader
- LLM generation requests
Cache Path
Enable result caching:
"cache_path": "./scraping_cache" # Cache to specific directory
"cache_path": False # Disable caching (default)
Browser Configuration
Loader Arguments
Pass custom arguments to the browser loader:
"loader_kwargs": {
"proxy": "http://proxy.example.com:8080",
"wait_until": "networkidle", # Wait for network idle
"timeout": 60000, # Browser timeout (ms)
"user_agent": "Custom User Agent"
}
Storage State
Use saved browser sessions (cookies, local storage):
"storage_state": "./auth_state.json" # Load saved session
Useful for:
- Logged-in scraping
- Session persistence
- Cookie-based access
Browser Services
"browser_base": {
"api_key": "your-browserbase-key",
"project_id": "your-project-id"
}
Uses BrowserBase for cloud browser automation.
"scrape_do": {
"api_key": "your-scrapedo-key",
"use_proxy": True,
"geoCode": "us",
"super_proxy": False
}
Uses ScrapeDo proxy service for scraping.
Advanced Settings
HTML Mode
Skip parsing and send raw HTML to LLM:
"html_mode": True # Skip ParseNode, use raw HTML
Use when:
- Content is already clean
- You want maximum context
- Parsing breaks important structure
Force Markdown
Force markdown conversion regardless of model:
"force": True # Always convert to markdown
Default behavior:
- OpenAI models: Automatic markdown conversion
- Other models: No conversion unless forced
Reasoning Mode
Enable chain-of-thought reasoning:
"reasoning": True # Add ReasoningNode to pipeline
Adds a reasoning step before answer generation for better quality.
Retry when extraction fails:
"reattempt": True # Retry if answer is empty or "NA"
Adds a ConditionalNode that checks answer quality and regenerates if needed.
Additional Context
Provide extra context to the LLM:
"additional_info": "Focus on products under $50. Ignore out of stock items."
This text is prepended to prompts in GenerateAnswerNode.
Reduction Factor
Control HTML reduction (Code Generator only):
"reduction": 2 # Reduce HTML size by factor of 2
Max Iterations
Control code generation iterations:
"max_iterations": {
"overall": 10,
"syntax": 3,
"execution": 3,
"validation": 3,
"semantic": 3
}
Burr Integration
Enable workflow tracking with Burr:
"burr_kwargs": {
"app_instance_id": "scraping-session-123",
"project_name": "my-scraper",
"storage_dir": "./burr_state"
}
Burr provides state visualization, debugging tools, and execution replay capabilities.
Complete Configuration Example
Here’s a comprehensive example:
import os
from scrapegraphai.graphs import SmartScraperGraph
from pydantic import BaseModel, Field
from typing import List
# Define output schema
class Product(BaseModel):
name: str = Field(description="Product name")
price: float = Field(description="Price in USD")
available: bool = Field(description="In stock status")
class Products(BaseModel):
products: List[Product]
# Complete configuration
graph_config = {
# LLM Configuration (Required)
"llm": {
"model": "openai/gpt-4o-mini",
"api_key": os.getenv("OPENAI_API_KEY"),
"temperature": 0,
"streaming": False,
"rate_limit": {
"requests_per_second": 3,
"max_retries": 5
}
},
# Behavior Settings
"verbose": True, # Show detailed logs
"headless": True, # Run browser in background
"timeout": 60, # 60 second timeout
"cache_path": "./cache", # Enable caching
# Browser Configuration
"loader_kwargs": {
"wait_until": "networkidle",
"timeout": 60000
},
# Processing Options
"html_mode": False, # Use parsed text
"force": True, # Force markdown conversion
"cut": True, # Enable HTML cleanup
"reasoning": False, # Disable reasoning step
"reattempt": True, # Retry on failure
# Additional Context
"additional_info": "Extract only products under $100"
}
# Create and run graph
scraper = SmartScraperGraph(
prompt="Extract all available products",
source="https://example.com/shop",
config=graph_config,
schema=Products
)
result = scraper.run()
print(result)
Environment Variables
Store sensitive data in environment variables:
# .env file
OPENAI_API_KEY=sk-...
ANTHROPIC_API_KEY=sk-ant-...
GROQ_API_KEY=gsk_...
GOOGLE_API_KEY=AIza...
from dotenv import load_dotenv
import os
load_dotenv()
graph_config = {
"llm": {
"model": "openai/gpt-4o-mini",
"api_key": os.getenv("OPENAI_API_KEY")
}
}
Never commit API keys to version control. Always use environment variables or secure vaults.
Configuration Validation
ScrapeGraphAI validates configuration at runtime:
try:
graph = SmartScraperGraph(
prompt="Extract data",
source="https://example.com",
config=invalid_config
)
except ValueError as e:
print(f"Configuration error: {e}")
# Handle configuration errors
Common errors:
- Missing
llm configuration
- Invalid model provider
- Missing required API keys
- Invalid parameter types
Best Practices
Begin with minimal configuration:graph_config = {
"llm": {
"model": "openai/gpt-4o-mini",
"api_key": os.getenv("OPENAI_API_KEY")
}
}
Add settings as needed.
Use Environment Variables
Store credentials securely:# Good
"api_key": os.getenv("OPENAI_API_KEY")
# Bad
"api_key": "sk-hardcoded-key-123"
Enable Verbose for Debugging
Use verbose mode during development:"verbose": True if os.getenv("DEBUG") else False
Adjust timeouts based on site complexity:# Fast static sites
"timeout": 30
# Slow dynamic sites
"timeout": 120
Next Steps
Schemas
Define structured output with Pydantic
Graphs
Learn about graph types and workflows
Examples
See configuration examples
API Reference
Complete API documentation