FetchNode

Overview

The FetchNode is responsible for fetching content from various sources including web pages, PDF files, CSV files, JSON files, XML files, and local HTML content. It acts as the starting point in many scraping workflows.

Class Signature

class FetchNode(BaseNode):
    def __init__(
        self,
        input: str,
        output: List[str],
        node_config: Optional[dict] = None,
        node_name: str = "Fetch",
    )

Source: scrapegraphai/nodes/fetch_node.py:20

Parameters

input

str

required

Boolean expression defining the input keys needed from the state. Supported input types:

url - Web page URL
local_dir - Local HTML content string
pdf - Path to PDF file
csv - Path to CSV file
json - Path to JSON file
xml - Path to XML file
md - Path to Markdown file
*_dir - Directory paths (pdf_dir, csv_dir, json_dir, xml_dir, md_dir)

output

List[str]

required

List of output keys to be updated in the state. Typically ["document"] or ["doc"]

node_config

dict

Configuration dictionary with the following options:

Show Configuration Options

headless

bool

default:"True"

Whether to run the browser in headless mode

verbose

bool

default:"False"

Whether to print verbose output during execution

use_soup

bool

default:"False"

Whether to use BeautifulSoup for parsing instead of ChromiumLoader

loader_kwargs

dict

default:"{}"

Additional keyword arguments for the ChromiumLoader

llm_model

object

default:"{}"

Language model instance (ChatOpenAI or AzureChatOpenAI) for content conversion

force

bool

default:"False"

Force markdown conversion even if not using OpenAI models

script_creator

bool

default:"False"

Whether the node is being used in script creation mode

openai_md_enabled

bool

default:"False"

Whether OpenAI markdown conversion is enabled

timeout

int

default:"30"

Timeout in seconds for blocking operations (HTTP requests, PDF parsing, etc.)

cut

bool

default:"True"

Whether to cut/cleanup HTML content

browser_base

dict

default:"None"

BrowserBase configuration with api_key and project_id

scrape_do

dict

default:"None"

ScrapeDo configuration with api_key, use_proxy, geoCode, and super_proxy

storage_state

dict

default:"None"

Browser storage state for authenticated sessions

node_name

str

default:"Fetch"

The unique identifier name for the node

State Keys

Input State

url

str

URL of the web page to fetch

local_dir

str

Local HTML content as a string

pdf

str

Path to PDF file

csv

str

Path to CSV file

json

str

Path to JSON file

xml

str

Path to XML file

str

Path to Markdown file

Output State

document

List[Document]

List of LangChain Document objects containing the fetched content with metadata

doc

List[Document]

Copy of the document list for compatibility

Methods

execute(state: dict) -> dict

Executes the node’s logic to fetch content from the specified source and update the state.

def execute(self, state):
    """
    Executes the node's logic to fetch HTML content from a specified URL and
    update the state with this content.
    """

Source: scrapegraphai/nodes/fetch_node.py:90 Returns: Updated state dictionary with fetched document

handle_web_source(state: dict, source: str) -> dict

Handles fetching content from web URLs using ChromiumLoader, BrowserBase, or ScrapeDo.

def handle_web_source(self, state, source):
    """Handles the web source by fetching HTML content from a URL"""

Source: scrapegraphai/nodes/fetch_node.py:262

handle_file(state: dict, input_type: str, source: str) -> dict

Loads content from files based on their type (PDF, CSV, JSON, XML, MD).

def handle_file(self, state, input_type, source):
    """Loads the content of a file based on its input type"""

Source: scrapegraphai/nodes/fetch_node.py:142

handle_local_source(state: dict, source: str) -> dict

Handles local HTML content strings with optional markdown conversion.

def handle_local_source(self, state, source):
    """Handles the local source by fetching HTML content"""

Source: scrapegraphai/nodes/fetch_node.py:219

Usage Examples

Basic Web Scraping

from scrapegraphai.nodes import FetchNode

# Create fetch node
fetch_node = FetchNode(
    input="url",
    output=["document"],
    node_config={
        "headless": True,
        "verbose": False,
        "timeout": 30
    }
)

# Execute node
state = {"url": "https://example.com"}
updated_state = fetch_node.execute(state)

print(updated_state["document"])
# Output: [Document(page_content="...", metadata={"source": "html file"})]

Fetch with Browser Automation

fetch_node = FetchNode(
    input="url",
    output=["document"],
    node_config={
        "headless": False,
        "loader_kwargs": {
            "wait_until": "networkidle",
            "timeout": 60000
        }
    }
)

state = {"url": "https://dynamic-site.com"}
updated_state = fetch_node.execute(state)

Fetch PDF Content

fetch_node = FetchNode(
    input="pdf",
    output=["document"],
    node_config={
        "timeout": 60,  # Longer timeout for large PDFs
        "verbose": True
    }
)

state = {"pdf": "/path/to/document.pdf"}
updated_state = fetch_node.execute(state)

Fetch with BrowserBase

fetch_node = FetchNode(
    input="url",
    output=["document"],
    node_config={
        "browser_base": {
            "api_key": "your_api_key",
            "project_id": "your_project_id"
        }
    }
)

state = {"url": "https://protected-site.com"}
updated_state = fetch_node.execute(state)

Fetch with Proxy (ScrapeDo)

fetch_node = FetchNode(
    input="url",
    output=["document"],
    node_config={
        "scrape_do": {
            "api_key": "your_api_key",
            "use_proxy": True,
            "geoCode": "us",
            "super_proxy": True
        }
    }
)

state = {"url": "https://geo-restricted.com"}
updated_state = fetch_node.execute(state)

Fetch CSV Data

fetch_node = FetchNode(
    input="csv",
    output=["document"]
)

state = {"csv": "/path/to/data.csv"}
updated_state = fetch_node.execute(state)

# Document contains CSV data as string
print(updated_state["document"][0].page_content)

Fetch with Markdown Conversion

from langchain_openai import ChatOpenAI

fetch_node = FetchNode(
    input="url",
    output=["document"],
    node_config={
        "llm_model": ChatOpenAI(model="gpt-4"),
        "force": True,  # Force markdown conversion
    }
)

state = {"url": "https://example.com"}
updated_state = fetch_node.execute(state)
# Content is automatically converted to markdown

Supported Input Types

Input Type	Description	Example
`url`	Web page URL	`https://example.com`
`local_dir`	HTML string	`<html>...</html>`
`pdf`	PDF file path	`/path/to/file.pdf`
`csv`	CSV file path	`/path/to/data.csv`
`json`	JSON file path	`/path/to/data.json`
`xml`	XML file path	`/path/to/data.xml`
`md`	Markdown file path	`/path/to/README.md`
`pdf_dir`	PDF directory	`/path/to/pdfs/`
`csv_dir`	CSV directory	`/path/to/csvs/`

Error Handling

The FetchNode raises errors in the following scenarios:

Invalid input type: Raises ValueError if the input type is not supported
Empty content: Raises ValueError if fetched content is empty or contains only whitespace
Timeout: Raises TimeoutError if operation exceeds configured timeout
HTTP errors: Logs warning if HTTP request fails (status code != 200)

Best Practices

Set appropriate timeouts - Use longer timeouts for large files or slow websites
Use headless mode in production - Set headless=True for better performance
Configure loader_kwargs - Fine-tune ChromiumLoader behavior for dynamic content
Handle timeouts gracefully - Wrap node execution in try-except blocks
Use BrowserBase for complex sites - Better handling of JavaScript-heavy pages
Enable markdown conversion - Improves LLM processing with cleaner text format

ParseNode - Parse and chunk the fetched content
GenerateAnswerNode - Generate answers from fetched content

Graphs

Nodes

Models

Utilities

Overview

Class Signature

Parameters

State Keys

Input State

Output State

Methods

execute(state: dict) -> dict

handle_web_source(state: dict, source: str) -> dict

handle_file(state: dict, input_type: str, source: str) -> dict

handle_local_source(state: dict, source: str) -> dict

Usage Examples

Basic Web Scraping

Fetch with Browser Automation

Fetch PDF Content

Fetch with BrowserBase

Fetch with Proxy (ScrapeDo)

Fetch CSV Data

Fetch with Markdown Conversion

Supported Input Types

Error Handling

Best Practices

Build docs developers (and LLMs) love

Graphs

Nodes

Models

Utilities

​Overview

​Class Signature

​Parameters

​State Keys

​Input State

​Output State

​Methods

​execute(state: dict) -> dict

​handle_web_source(state: dict, source: str) -> dict

​handle_file(state: dict, input_type: str, source: str) -> dict

​handle_local_source(state: dict, source: str) -> dict

​Usage Examples

​Basic Web Scraping

​Fetch with Browser Automation

​Fetch PDF Content

​Fetch with BrowserBase

​Fetch with Proxy (ScrapeDo)

​Fetch CSV Data

​Fetch with Markdown Conversion

​Supported Input Types

​Error Handling

​Best Practices

​Related Nodes

Build docs developers (and LLMs) love

Overview

Class Signature

Parameters

State Keys

Input State

Output State

Methods

execute(state: dict) -> dict

handle_web_source(state: dict, source: str) -> dict

handle_file(state: dict, input_type: str, source: str) -> dict

handle_local_source(state: dict, source: str) -> dict

Usage Examples

Basic Web Scraping

Fetch with Browser Automation

Fetch PDF Content

Fetch with BrowserBase

Fetch with Proxy (ScrapeDo)

Fetch CSV Data

Fetch with Markdown Conversion

Supported Input Types

Error Handling

Best Practices

Related Nodes