Skip to main content

Overview

The FetchNode is responsible for fetching content from various sources including web pages, PDF files, CSV files, JSON files, XML files, and local HTML content. It acts as the starting point in many scraping workflows.

Class Signature

class FetchNode(BaseNode):
    def __init__(
        self,
        input: str,
        output: List[str],
        node_config: Optional[dict] = None,
        node_name: str = "Fetch",
    )
Source: scrapegraphai/nodes/fetch_node.py:20

Parameters

input
str
required
Boolean expression defining the input keys needed from the state. Supported input types:
  • url - Web page URL
  • local_dir - Local HTML content string
  • pdf - Path to PDF file
  • csv - Path to CSV file
  • json - Path to JSON file
  • xml - Path to XML file
  • md - Path to Markdown file
  • *_dir - Directory paths (pdf_dir, csv_dir, json_dir, xml_dir, md_dir)
output
List[str]
required
List of output keys to be updated in the state. Typically ["document"] or ["doc"]
node_config
dict
Configuration dictionary with the following options:
node_name
str
default:"Fetch"
The unique identifier name for the node

State Keys

Input State

url
str
URL of the web page to fetch
local_dir
str
Local HTML content as a string
pdf
str
Path to PDF file
csv
str
Path to CSV file
json
str
Path to JSON file
xml
str
Path to XML file
md
str
Path to Markdown file

Output State

document
List[Document]
List of LangChain Document objects containing the fetched content with metadata
doc
List[Document]
Copy of the document list for compatibility

Methods

execute(state: dict) -> dict

Executes the node’s logic to fetch content from the specified source and update the state.
def execute(self, state):
    """
    Executes the node's logic to fetch HTML content from a specified URL and
    update the state with this content.
    """
Source: scrapegraphai/nodes/fetch_node.py:90 Returns: Updated state dictionary with fetched document

handle_web_source(state: dict, source: str) -> dict

Handles fetching content from web URLs using ChromiumLoader, BrowserBase, or ScrapeDo.
def handle_web_source(self, state, source):
    """Handles the web source by fetching HTML content from a URL"""
Source: scrapegraphai/nodes/fetch_node.py:262

handle_file(state: dict, input_type: str, source: str) -> dict

Loads content from files based on their type (PDF, CSV, JSON, XML, MD).
def handle_file(self, state, input_type, source):
    """Loads the content of a file based on its input type"""
Source: scrapegraphai/nodes/fetch_node.py:142

handle_local_source(state: dict, source: str) -> dict

Handles local HTML content strings with optional markdown conversion.
def handle_local_source(self, state, source):
    """Handles the local source by fetching HTML content"""
Source: scrapegraphai/nodes/fetch_node.py:219

Usage Examples

Basic Web Scraping

from scrapegraphai.nodes import FetchNode

# Create fetch node
fetch_node = FetchNode(
    input="url",
    output=["document"],
    node_config={
        "headless": True,
        "verbose": False,
        "timeout": 30
    }
)

# Execute node
state = {"url": "https://example.com"}
updated_state = fetch_node.execute(state)

print(updated_state["document"])
# Output: [Document(page_content="...", metadata={"source": "html file"})]

Fetch with Browser Automation

fetch_node = FetchNode(
    input="url",
    output=["document"],
    node_config={
        "headless": False,
        "loader_kwargs": {
            "wait_until": "networkidle",
            "timeout": 60000
        }
    }
)

state = {"url": "https://dynamic-site.com"}
updated_state = fetch_node.execute(state)

Fetch PDF Content

fetch_node = FetchNode(
    input="pdf",
    output=["document"],
    node_config={
        "timeout": 60,  # Longer timeout for large PDFs
        "verbose": True
    }
)

state = {"pdf": "/path/to/document.pdf"}
updated_state = fetch_node.execute(state)

Fetch with BrowserBase

fetch_node = FetchNode(
    input="url",
    output=["document"],
    node_config={
        "browser_base": {
            "api_key": "your_api_key",
            "project_id": "your_project_id"
        }
    }
)

state = {"url": "https://protected-site.com"}
updated_state = fetch_node.execute(state)

Fetch with Proxy (ScrapeDo)

fetch_node = FetchNode(
    input="url",
    output=["document"],
    node_config={
        "scrape_do": {
            "api_key": "your_api_key",
            "use_proxy": True,
            "geoCode": "us",
            "super_proxy": True
        }
    }
)

state = {"url": "https://geo-restricted.com"}
updated_state = fetch_node.execute(state)

Fetch CSV Data

fetch_node = FetchNode(
    input="csv",
    output=["document"]
)

state = {"csv": "/path/to/data.csv"}
updated_state = fetch_node.execute(state)

# Document contains CSV data as string
print(updated_state["document"][0].page_content)

Fetch with Markdown Conversion

from langchain_openai import ChatOpenAI

fetch_node = FetchNode(
    input="url",
    output=["document"],
    node_config={
        "llm_model": ChatOpenAI(model="gpt-4"),
        "force": True,  # Force markdown conversion
    }
)

state = {"url": "https://example.com"}
updated_state = fetch_node.execute(state)
# Content is automatically converted to markdown

Supported Input Types

Input TypeDescriptionExample
urlWeb page URLhttps://example.com
local_dirHTML string<html>...</html>
pdfPDF file path/path/to/file.pdf
csvCSV file path/path/to/data.csv
jsonJSON file path/path/to/data.json
xmlXML file path/path/to/data.xml
mdMarkdown file path/path/to/README.md
pdf_dirPDF directory/path/to/pdfs/
csv_dirCSV directory/path/to/csvs/

Error Handling

The FetchNode raises errors in the following scenarios:
  • Invalid input type: Raises ValueError if the input type is not supported
  • Empty content: Raises ValueError if fetched content is empty or contains only whitespace
  • Timeout: Raises TimeoutError if operation exceeds configured timeout
  • HTTP errors: Logs warning if HTTP request fails (status code != 200)

Best Practices

  1. Set appropriate timeouts - Use longer timeouts for large files or slow websites
  2. Use headless mode in production - Set headless=True for better performance
  3. Configure loader_kwargs - Fine-tune ChromiumLoader behavior for dynamic content
  4. Handle timeouts gracefully - Wrap node execution in try-except blocks
  5. Use BrowserBase for complex sites - Better handling of JavaScript-heavy pages
  6. Enable markdown conversion - Improves LLM processing with cleaner text format

Build docs developers (and LLMs) love