Skip to main content

Overview

SearchGraph is a powerful scraping pipeline that searches the internet for answers to a given prompt, scrapes the top results, and merges the information into a comprehensive answer. It combines web search with intelligent scraping.

Features

  • Automatic internet search based on your prompt
  • Scrapes multiple search results automatically
  • Merges information from multiple sources
  • Configurable number of results to scrape
  • Returns both the answer and list of URLs considered

Parameters

The SearchGraph constructor accepts the following parameters:
SearchGraph(
    prompt: str,              # Natural language search query
    config: dict,             # Configuration dictionary
    schema: Optional[BaseModel] = None  # Pydantic schema for structured output
)
Unlike SmartScraperGraph, SearchGraph does not require a source parameter. It automatically searches and finds relevant URLs.

Configuration Options

ParameterTypeDefaultDescription
llmdictRequiredLLM model configuration
max_resultsint3Number of search results to scrape
verboseboolFalseEnable detailed logging
search_enginestr"google"Search engine to use
serper_api_keystrOptionalAPI key for Serper search engine

Usage Examples

import os
from dotenv import load_dotenv
from scrapegraphai.graphs import SearchGraph

load_dotenv()

openai_key = os.getenv("OPENAI_API_KEY")

graph_config = {
    "llm": {
        "api_key": openai_key,
        "model": "openai/gpt-4o",
    },
    "max_results": 2,
    "verbose": True,
}

# Create the SearchGraph instance
search_graph = SearchGraph(
    prompt="List me Chioggia's famous dishes",
    config=graph_config
)

# Run the graph
result = search_graph.run()
print(result)

# Get the URLs that were considered
urls = search_graph.get_considered_urls()
print("Scraped URLs:", urls)

Schema-Based Extraction

Use Pydantic schemas to structure the merged output:
from pydantic import BaseModel, Field
from typing import List

class Dish(BaseModel):
    name: str = Field(description="Dish name")
    description: str = Field(description="Dish description")
    ingredients: List[str] = Field(description="Main ingredients")

class DishList(BaseModel):
    dishes: List[Dish] = Field(description="List of famous dishes")

graph_config = {
    "llm": {
        "model": "openai/gpt-4o",
        "api_key": os.getenv("OPENAI_API_KEY"),
    },
    "max_results": 3,
}

search_graph = SearchGraph(
    prompt="What are Chioggia's famous dishes?",
    config=graph_config,
    schema=DishList
)

result = search_graph.run()
print(result)

Getting Considered URLs

Retrieve the list of URLs that were scraped:
search_graph = SearchGraph(
    prompt="Latest AI developments",
    config=graph_config
)

result = search_graph.run()

# Get the URLs that were considered during the search
urls = search_graph.get_considered_urls()
for url in urls:
    print(f"Scraped: {url}")

Custom Search Engine

You can configure different search engines:
graph_config = {
    "llm": {"model": "openai/gpt-4o"},
    "search_engine": "google",  # or "bing", "duckduckgo"
    "max_results": 5,
}

Using Serper API

For more reliable search results, use the Serper API:
import os

graph_config = {
    "llm": {
        "model": "openai/gpt-4o",
        "api_key": os.getenv("OPENAI_API_KEY"),
    },
    "serper_api_key": os.getenv("SERPER_API_KEY"),
    "max_results": 3,
}

search_graph = SearchGraph(
    prompt="Best practices for Python web scraping",
    config=graph_config
)

Output Format

The run() method returns a merged answer from all scraped sources:
result = search_graph.run()
# Returns: Dictionary with merged information from multiple sources

urls = search_graph.get_considered_urls()
# Returns: List[str] of URLs that were scraped

How It Works

  1. Search: The graph searches the internet based on your prompt
  2. Scrape: Top results (configured by max_results) are scraped using SmartScraperGraph
  3. Merge: Information from all sources is intelligently merged into a single answer

Example Output

search_graph = SearchGraph(
    prompt="What are the health benefits of green tea?",
    config=graph_config
)

result = search_graph.run()

# Example output:
# {
#   "benefits": [
#     "Rich in antioxidants",
#     "May improve brain function",
#     "Increases fat burning",
#     "May reduce risk of certain cancers"
#   ],
#   "sources": 3
# }

urls = search_graph.get_considered_urls()
# ["https://healthline.com/...", "https://medicalnewstoday.com/...", ...]

Performance Tips

  • Setting max_results too high will increase execution time significantly
  • Each result is scraped individually using SmartScraperGraph
  • Recommended max_results: 2-5 for optimal performance
  • Use verbose=True to monitor which URLs are being scraped
  • The graph automatically filters out low-quality or inaccessible URLs
  • Merging happens at the end, combining information from all sources

Error Handling

try:
    result = search_graph.run()
    if result:
        print("Search successful:", result)
        print(f"Scraped {len(search_graph.get_considered_urls())} URLs")
    else:
        print("No results found")
except Exception as e:
    print(f"Error during search: {e}")

SmartScraperGraph

Scrape a known URL

SmartScraperMultiGraph

Scrape multiple known URLs

Build docs developers (and LLMs) love