Overview
Google Gemini provides cutting-edge AI models with industry-leading context windows, perfect for scraping large documents and complex websites.
Key Features:
Massive Context : Up to 2M tokens (Gemini 2.0 Pro)
Multimodal : Process text, images, and video
Fast : Gemini Flash optimized for speed
Cost-Effective : Competitive pricing
Latest Tech : Google’s newest AI capabilities
Prerequisites
Install ScrapeGraphAI
pip install scrapegraphai
playwright install
Set Environment Variable
export GOOGLE_API_KEY = "your-api-key"
Or create a .env file: GOOGLE_API_KEY = your-api-key
Basic Configuration
import os
from dotenv import load_dotenv
from scrapegraphai.graphs import SmartScraperGraph
load_dotenv()
graph_config = {
"llm" : {
"api_key" : os.getenv( "GOOGLE_API_KEY" ),
"model" : "google_genai/gemini-2.0-flash-latest" ,
},
"verbose" : True ,
"headless" : False ,
}
smart_scraper_graph = SmartScraperGraph(
prompt = "Extract all product information" ,
source = "https://example.com" ,
config = graph_config,
)
result = smart_scraper_graph.run()
print (result)
Available Models
Recommended
All Models
Vertex AI
Gemini 2.0 Flash (Best for Most Tasks) graph_config = {
"llm" : {
"api_key" : os.getenv( "GOOGLE_API_KEY" ),
"model" : "google_genai/gemini-2.0-flash-latest" ,
},
}
Context : 1M tokens
Speed : Very fast
Cost : Affordable
Best for : Most scraping tasks
Gemini 2.0 Pro (Maximum Context) graph_config = {
"llm" : {
"api_key" : os.getenv( "GOOGLE_API_KEY" ),
"model" : "google_genai/gemini-2.0-pro-exp" ,
},
}
Context : 2M tokens (largest available!)
Quality : Highest quality
Best for : Very large documents, complex reasoning
Complete list of Gemini models: Model Context Speed Best For gemini-2.0-flash-latest1M Very Fast General use gemini-2.0-flash-exp1M Very Fast Experimental features gemini-2.0-pro-exp2M Fast Maximum context gemini-1.5-pro-latest128K Fast Stable, reliable gemini-1.5-flash-latest128K Very Fast Speed priority gemini-pro128K Fast Legacy support
Gemini 2.0 models are recommended for new projects.
Use Gemini via Google Cloud Vertex AI: graph_config = {
"llm" : {
"model" : "google_vertexai/gemini-2.0-flash" ,
"project_id" : "your-gcp-project-id" ,
"location" : "us-central1" ,
},
}
Vertex AI Benefits:
Enterprise features
VPC integration
Better SLAs
Usage quotas
Configuration Options
Temperature
Control response randomness:
graph_config = {
"llm" : {
"api_key" : os.getenv( "GOOGLE_API_KEY" ),
"model" : "google_genai/gemini-2.0-flash-latest" ,
"temperature" : 0 , # Deterministic (recommended for scraping)
},
}
0: Deterministic, consistent results
0.5: Balanced
1.0: Creative, varied responses
Use temperature: 0 for web scraping to ensure consistent data extraction.
Max Tokens
Limit response length:
graph_config = {
"llm" : {
"api_key" : os.getenv( "GOOGLE_API_KEY" ),
"model" : "google_genai/gemini-2.0-flash-latest" ,
"max_tokens" : 4000 , # Limit output
},
}
Safety Settings
Control content filtering:
graph_config = {
"llm" : {
"api_key" : os.getenv( "GOOGLE_API_KEY" ),
"model" : "google_genai/gemini-2.0-flash-latest" ,
"safety_settings" : [
{
"category" : "HARM_CATEGORY_HARASSMENT" ,
"threshold" : "BLOCK_NONE"
},
{
"category" : "HARM_CATEGORY_HATE_SPEECH" ,
"threshold" : "BLOCK_NONE"
},
],
},
}
Adjust safety settings if legitimate content is being blocked during scraping.
Complete Examples
Basic Scraping
Large Document
Multi-Page Scraping
With Schema
import os
import json
from dotenv import load_dotenv
from scrapegraphai.graphs import SmartScraperGraph
from scrapegraphai.utils import prettify_exec_info
load_dotenv()
graph_config = {
"llm" : {
"api_key" : os.getenv( "GOOGLE_API_KEY" ),
"model" : "google_genai/gemini-2.0-flash-latest" ,
"temperature" : 0 ,
},
"verbose" : True ,
"headless" : False ,
}
smart_scraper = SmartScraperGraph(
prompt = "Extract the main article with title, author, and date" ,
source = "https://www.wired.com" ,
config = graph_config,
)
result = smart_scraper.run()
print (json.dumps(result, indent = 4 ))
# Get execution info
graph_exec_info = smart_scraper.get_execution_info()
print (prettify_exec_info(graph_exec_info))
Using Vertex AI
For enterprise deployments, use Google Cloud Vertex AI:
Install Google Cloud SDK
pip install google-cloud-aiplatform
Authenticate
gcloud auth application-default login
Configure ScrapeGraphAI
graph_config = {
"llm" : {
"model" : "google_vertexai/gemini-2.0-flash" ,
"project_id" : "your-gcp-project-id" ,
"location" : "us-central1" ,
"temperature" : 0 ,
},
}
Cost Optimization
Use Gemini Flash for Speed
Gemini Flash models are faster and cheaper: "model" : "google_genai/gemini-2.0-flash-latest" # Faster, cheaper
Reduce token usage for simple tasks: "max_tokens" : 2000 # Limit response length
Faster scraping = fewer API calls: "headless" : True # Run browser in background
Implement caching for frequently scraped pages: import functools
@functools.lru_cache ( maxsize = 100 )
def scrape_cached ( url ):
return scraper.run()
Troubleshooting
Error : Invalid API keySolution :
Verify API key at Google AI Studio
Ensure key is active and not expired
Check environment variable is set:
Error : 429 Rate limit exceededSolution : Implement retry logic:from tenacity import retry, wait_exponential, stop_after_attempt
@retry ( wait = wait_exponential( min = 1 , max = 60 ), stop = stop_after_attempt( 5 ))
def scrape_with_retry ():
return scraper.run()
Content Blocked by Safety Filters
Error : Content blocked due to safety settingsSolution : Adjust safety settings:"safety_settings" : [
{ "category" : "HARM_CATEGORY_HARASSMENT" , "threshold" : "BLOCK_NONE" },
{ "category" : "HARM_CATEGORY_HATE_SPEECH" , "threshold" : "BLOCK_NONE" },
{ "category" : "HARM_CATEGORY_SEXUALLY_EXPLICIT" , "threshold" : "BLOCK_NONE" },
{ "category" : "HARM_CATEGORY_DANGEROUS_CONTENT" , "threshold" : "BLOCK_NONE" },
]
Error : Model not foundSolution : Use correct model name with provider prefix:# Correct
"model" : "google_genai/gemini-2.0-flash-latest"
# Wrong
"model" : "gemini-2.0-flash-latest" # Missing prefix
Advantages of Gemini
Massive Context Process up to 2M tokens in a single request - perfect for scraping entire websites or long documents.
Multimodal Can process text, images, and video together for richer data extraction.
Fast Gemini Flash models offer industry-leading speed for real-time scraping.
Cost-Effective Competitive pricing compared to other providers, especially for large contexts.
Best Practices
Use Latest Models Always use latest model versions: "model" : "google_genai/gemini-2.0-flash-latest"
Handle Rate Limits Implement exponential backoff for production use.
Next Steps
Advanced Configuration Learn about proxy rotation and browser settings
Groq Explore ultra-fast Groq inference