Overview
TheParseNode is responsible for parsing HTML content from documents, converting it to text, extracting URLs (links and images), and splitting the content into chunks for further processing.
Class Signature
scrapegraphai/nodes/parse_node.py:17
Parameters
Boolean expression defining the input keys needed from the state. Common patterns:
"document"- Only document parsing"document & url"- Document parsing with URL extraction
List of output keys to be updated in the state. Examples:
["parsed_doc"]- Only parsed chunks["parsed_doc", "link_urls", "img_urls"]- Parsed content with URLs
Configuration dictionary with the following options:
The unique identifier name for the node
State Keys
Input State
List of LangChain Document objects containing HTML or text content
Source URL (required only if
parse_urls=True for resolving relative URLs)Output State
List of text chunks after parsing and splitting the document
List of extracted link URLs (only if
parse_urls=True)List of extracted image URLs (only if
parse_urls=True)Methods
execute(state: dict) -> dict
Executes the node’s logic to parse the HTML document content and split it into chunks.scrapegraphai/nodes/parse_node.py:62
Returns: Updated state dictionary with parsed chunks and optionally extracted URLs
_extract_urls(text: str, source: str) -> Tuple[List[str], List[str]]
Extracts URLs from the given text, separating links and images.scrapegraphai/nodes/parse_node.py:131
_clean_urls(urls: List[str]) -> List[str]
Cleans and normalizes extracted URLs by removing markdown artifacts.scrapegraphai/nodes/parse_node.py:179
_is_valid_url(url: str) -> bool
Static method to check if a URL format is valid.scrapegraphai/nodes/parse_node.py:206
Usage Examples
Basic HTML Parsing
Parse with URL Extraction
Parse Markdown Content
Custom Chunk Size
Extract Only Links
Parsing Process
The ParseNode follows this process:-
HTML to Text Conversion (if
parse_html=True)- Uses
Html2TextTransformerto convert HTML to plain text - Preserves link structure for URL extraction
- Uses
-
URL Extraction (if
parse_urls=True)- Extracts absolute and relative URLs using regex patterns
- Resolves relative URLs against the source URL
- Separates links from images based on file extensions
- Cleans URLs by removing markdown artifacts
-
Text Chunking
- Splits text into chunks based on
chunk_size - Adjusts chunk size for HTML vs non-HTML content
- Ensures chunks don’t exceed token limits
- Splits text into chunks based on
-
State Updates
- Updates state with parsed chunks
- Optionally adds extracted URLs
URL Extraction Details
Supported URL Patterns
The node recognizes these URL patterns:- Absolute URLs:
http://example.com,https://example.com - Relative URLs:
/path/to/page,../relative/path - Protocol-relative URLs:
//example.com/path - URLs in markdown:
[text](url),
Image Detection
Images are identified by these file extensions:URL Cleaning
Extracted URLs are cleaned by:- Removing markdown syntax artifacts (
[](),[]()) - Stripping trailing punctuation (
.,-,)) - Resolving relative paths to absolute URLs
- Filtering invalid URL formats
Chunking Strategy
The node adjusts chunk sizes based on content type:-
HTML content (when
parse_html=True): -
Non-HTML content (when
parse_html=False):
Error Handling
The ParseNode handles errors gracefully:- Missing input keys: Raises
KeyErrorwith descriptive message - URL extraction failures: Returns empty URL lists instead of failing
- Invalid documents: Processes first document only if list provided
Best Practices
- Enable URL extraction only when needed - Saves processing time
- Adjust chunk_size for your LLM - Consider model context window
- Provide source URL for relative links - Ensures correct URL resolution
- Use parse_html=True for HTML content - Better text extraction
- Set parse_html=False for markdown - Preserves markdown structure
- Enable verbose mode for debugging - See chunk processing progress
Performance Considerations
- URL extraction adds ~10-20% processing time
- Chunk size affects memory usage (smaller = more chunks = more memory)
- HTML parsing is faster than URL extraction
- Large documents benefit from parallel chunk processing in downstream nodes
Related Nodes
- FetchNode - Fetch content before parsing
- GenerateAnswerNode - Process parsed chunks
- SearchNode - Search using extracted URLs
