Overview
TheFetchNode is responsible for fetching content from various sources including web pages, PDF files, CSV files, JSON files, XML files, and local HTML content. It acts as the starting point in many scraping workflows.
Class Signature
scrapegraphai/nodes/fetch_node.py:20
Parameters
Boolean expression defining the input keys needed from the state. Supported input types:
url- Web page URLlocal_dir- Local HTML content stringpdf- Path to PDF filecsv- Path to CSV filejson- Path to JSON filexml- Path to XML filemd- Path to Markdown file*_dir- Directory paths (pdf_dir, csv_dir, json_dir, xml_dir, md_dir)
List of output keys to be updated in the state. Typically
["document"] or ["doc"]Configuration dictionary with the following options:
The unique identifier name for the node
State Keys
Input State
URL of the web page to fetch
Local HTML content as a string
Path to PDF file
Path to CSV file
Path to JSON file
Path to XML file
Path to Markdown file
Output State
List of LangChain Document objects containing the fetched content with metadata
Copy of the document list for compatibility
Methods
execute(state: dict) -> dict
Executes the node’s logic to fetch content from the specified source and update the state.scrapegraphai/nodes/fetch_node.py:90
Returns: Updated state dictionary with fetched document
handle_web_source(state: dict, source: str) -> dict
Handles fetching content from web URLs using ChromiumLoader, BrowserBase, or ScrapeDo.scrapegraphai/nodes/fetch_node.py:262
handle_file(state: dict, input_type: str, source: str) -> dict
Loads content from files based on their type (PDF, CSV, JSON, XML, MD).scrapegraphai/nodes/fetch_node.py:142
handle_local_source(state: dict, source: str) -> dict
Handles local HTML content strings with optional markdown conversion.scrapegraphai/nodes/fetch_node.py:219
Usage Examples
Basic Web Scraping
Fetch with Browser Automation
Fetch PDF Content
Fetch with BrowserBase
Fetch with Proxy (ScrapeDo)
Fetch CSV Data
Fetch with Markdown Conversion
Supported Input Types
| Input Type | Description | Example |
|---|---|---|
url | Web page URL | https://example.com |
local_dir | HTML string | <html>...</html> |
pdf | PDF file path | /path/to/file.pdf |
csv | CSV file path | /path/to/data.csv |
json | JSON file path | /path/to/data.json |
xml | XML file path | /path/to/data.xml |
md | Markdown file path | /path/to/README.md |
pdf_dir | PDF directory | /path/to/pdfs/ |
csv_dir | CSV directory | /path/to/csvs/ |
Error Handling
The FetchNode raises errors in the following scenarios:- Invalid input type: Raises
ValueErrorif the input type is not supported - Empty content: Raises
ValueErrorif fetched content is empty or contains only whitespace - Timeout: Raises
TimeoutErrorif operation exceeds configured timeout - HTTP errors: Logs warning if HTTP request fails (status code != 200)
Best Practices
- Set appropriate timeouts - Use longer timeouts for large files or slow websites
- Use headless mode in production - Set
headless=Truefor better performance - Configure loader_kwargs - Fine-tune ChromiumLoader behavior for dynamic content
- Handle timeouts gracefully - Wrap node execution in try-except blocks
- Use BrowserBase for complex sites - Better handling of JavaScript-heavy pages
- Enable markdown conversion - Improves LLM processing with cleaner text format
Related Nodes
- ParseNode - Parse and chunk the fetched content
- GenerateAnswerNode - Generate answers from fetched content
