ParseNode

Overview

The ParseNode is responsible for parsing HTML content from documents, converting it to text, extracting URLs (links and images), and splitting the content into chunks for further processing.

Class Signature

class ParseNode(BaseNode):
    def __init__(
        self,
        input: str,
        output: List[str],
        node_config: Optional[dict] = None,
        node_name: str = "ParseNode",
    )

Source: scrapegraphai/nodes/parse_node.py:17

Parameters

input

str

required

Boolean expression defining the input keys needed from the state. Common patterns:

"document" - Only document parsing
"document & url" - Document parsing with URL extraction

output

List[str]

required

List of output keys to be updated in the state. Examples:

["parsed_doc"] - Only parsed chunks
["parsed_doc", "link_urls", "img_urls"] - Parsed content with URLs

node_config

dict

Configuration dictionary with the following options:

Show Configuration Options

verbose

bool

default:"False"

Whether to show print statements during execution

parse_html

bool

default:"True"

Whether to parse HTML content using Html2TextTransformer

parse_urls

bool

default:"False"

Whether to extract URLs (links and images) from the content

llm_model

object

required

Language model instance (used for determining chunk size)

chunk_size

int

required

Maximum size of text chunks in tokens/characters

node_name

str

default:"ParseNode"

The unique identifier name for the node

State Keys

Input State

document

List[Document]

List of LangChain Document objects containing HTML or text content

url

str

Source URL (required only if parse_urls=True for resolving relative URLs)

Output State

parsed_doc

List[str]

List of text chunks after parsing and splitting the document

link_urls

List[str]

List of extracted link URLs (only if parse_urls=True)

img_urls

List[str]

List of extracted image URLs (only if parse_urls=True)

Methods

execute(state: dict) -> dict

Executes the node’s logic to parse the HTML document content and split it into chunks.

def execute(self, state: dict) -> dict:
    """
    Executes the node's logic to parse the HTML document content and split it into chunks.
    
    Args:
        state (dict): The current state of the graph.
    
    Returns:
        dict: The updated state with the output key containing the parsed content chunks.
    """

Source: scrapegraphai/nodes/parse_node.py:62 Returns: Updated state dictionary with parsed chunks and optionally extracted URLs

_extract_urls(text: str, source: str) -> Tuple[List[str], List[str]]

Extracts URLs from the given text, separating links and images.

def _extract_urls(self, text: str, source: str) -> Tuple[List[str], List[str]]:
    """
    Extracts URLs from the given text.
    
    Args:
        text (str): The text to extract URLs from.
        source (str): Base URL for resolving relative URLs.
    
    Returns:
        Tuple[List[str], List[str]]: A tuple containing link URLs and image URLs.
    """

Source: scrapegraphai/nodes/parse_node.py:131

_clean_urls(urls: List[str]) -> List[str]

Cleans and normalizes extracted URLs by removing markdown artifacts.

def _clean_urls(self, urls: List[str]) -> List[str]:
    """Cleans the URLs extracted from the text."""

Source: scrapegraphai/nodes/parse_node.py:179

_is_valid_url(url: str) -> bool

Static method to check if a URL format is valid.

@staticmethod
def _is_valid_url(url: str) -> bool:
    """Checks if the URL format is valid."""

Source: scrapegraphai/nodes/parse_node.py:206

Usage Examples

Basic HTML Parsing

from scrapegraphai.nodes import ParseNode
from langchain_openai import ChatOpenAI

# Create parse node
parse_node = ParseNode(
    input="document",
    output=["parsed_doc"],
    node_config={
        "llm_model": ChatOpenAI(model="gpt-4"),
        "chunk_size": 4096,
        "parse_html": True,
        "verbose": False
    }
)

# Execute node
state = {
    "document": [Document(page_content="<html>...</html>")]
}
updated_state = parse_node.execute(state)

print(updated_state["parsed_doc"])
# Output: ["chunk1", "chunk2", "chunk3"]

Parse with URL Extraction

parse_node = ParseNode(
    input="document & url",
    output=["parsed_doc", "link_urls", "img_urls"],
    node_config={
        "llm_model": ChatOpenAI(model="gpt-4"),
        "chunk_size": 4096,
        "parse_html": True,
        "parse_urls": True,  # Enable URL extraction
        "verbose": True
    }
)

state = {
    "document": [Document(page_content="<html>...</html>")],
    "url": "https://example.com"
}
updated_state = parse_node.execute(state)

print("Parsed chunks:", len(updated_state["parsed_doc"]))
print("Links found:", updated_state["link_urls"])
print("Images found:", updated_state["img_urls"])

Parse Markdown Content

parse_node = ParseNode(
    input="document",
    output=["parsed_doc"],
    node_config={
        "llm_model": ChatOpenAI(model="gpt-4"),
        "chunk_size": 2048,
        "parse_html": False,  # Skip HTML parsing for markdown
        "verbose": False
    }
)

state = {
    "document": [Document(page_content="# Title\n\nContent...")]
}
updated_state = parse_node.execute(state)

Custom Chunk Size

parse_node = ParseNode(
    input="document",
    output=["parsed_doc"],
    node_config={
        "llm_model": ChatOpenAI(model="gpt-4"),
        "chunk_size": 8192,  # Larger chunks for long-context models
        "parse_html": True,
        "verbose": True
    }
)

state = {
    "document": [Document(page_content="<html>...</html>")]
}
updated_state = parse_node.execute(state)

Extract Only Links

parse_node = ParseNode(
    input="document & url",
    output=["parsed_doc", "link_urls", "img_urls"],
    node_config={
        "llm_model": ChatOpenAI(model="gpt-4"),
        "chunk_size": 4096,
        "parse_html": True,
        "parse_urls": True
    }
)

state = {
    "document": [Document(page_content="...")],
    "url": "https://example.com/page"
}
updated_state = parse_node.execute(state)

# Filter only HTTP/HTTPS links
http_links = [url for url in updated_state["link_urls"] 
              if url.startswith("http")]
print(f"Found {len(http_links)} HTTP links")

Parsing Process

The ParseNode follows this process:

HTML to Text Conversion (if parse_html=True)
- Uses Html2TextTransformer to convert HTML to plain text
- Preserves link structure for URL extraction
URL Extraction (if parse_urls=True)
- Extracts absolute and relative URLs using regex patterns
- Resolves relative URLs against the source URL
- Separates links from images based on file extensions
- Cleans URLs by removing markdown artifacts
Text Chunking
- Splits text into chunks based on chunk_size
- Adjusts chunk size for HTML vs non-HTML content
- Ensures chunks don’t exceed token limits
State Updates
- Updates state with parsed chunks
- Optionally adds extracted URLs

URL Extraction Details

Supported URL Patterns

The node recognizes these URL patterns:

Absolute URLs: http://example.com, https://example.com
Relative URLs: /path/to/page, ../relative/path
Protocol-relative URLs: //example.com/path
URLs in markdown: [text](url), ![alt](image.jpg)

Image Detection

Images are identified by these file extensions:

image_extensions = [
    ".jpg", ".jpeg", ".png", ".gif", ".bmp", ".svg", 
    ".webp", ".ico", ".tiff"
]

URL Cleaning

Extracted URLs are cleaned by:

Removing markdown syntax artifacts ([](), []())
Stripping trailing punctuation (., -, ))
Resolving relative paths to absolute URLs
Filtering invalid URL formats

Chunking Strategy

The node adjusts chunk sizes based on content type:

HTML content (when parse_html=True):
```
chunk_size = self.chunk_size - 250
```

Non-HTML content (when parse_html=False):

chunk_size = min(chunk_size - 500, int(chunk_size * 0.8))

This ensures chunks don’t exceed LLM token limits after formatting.

Error Handling

The ParseNode handles errors gracefully:

Missing input keys: Raises KeyError with descriptive message
URL extraction failures: Returns empty URL lists instead of failing
Invalid documents: Processes first document only if list provided

Best Practices

Enable URL extraction only when needed - Saves processing time
Adjust chunk_size for your LLM - Consider model context window
Provide source URL for relative links - Ensures correct URL resolution
Use parse_html=True for HTML content - Better text extraction
Set parse_html=False for markdown - Preserves markdown structure
Enable verbose mode for debugging - See chunk processing progress

Performance Considerations

URL extraction adds ~10-20% processing time
Chunk size affects memory usage (smaller = more chunks = more memory)
HTML parsing is faster than URL extraction
Large documents benefit from parallel chunk processing in downstream nodes

FetchNode - Fetch content before parsing
GenerateAnswerNode - Process parsed chunks
SearchNode - Search using extracted URLs

Graphs

Nodes

Models

Utilities

Overview

Class Signature

Parameters

State Keys

Input State

Output State

Methods

execute(state: dict) -> dict

_extract_urls(text: str, source: str) -> Tuple[List[str], List[str]]

_clean_urls(urls: List[str]) -> List[str]

_is_valid_url(url: str) -> bool

Usage Examples

Basic HTML Parsing

Parse with URL Extraction

Parse Markdown Content

Custom Chunk Size

Extract Only Links

Parsing Process

URL Extraction Details

Supported URL Patterns

Image Detection

URL Cleaning

Chunking Strategy

Error Handling

Best Practices

Performance Considerations

Build docs developers (and LLMs) love

Graphs

Nodes

Models

Utilities

​Overview

​Class Signature

​Parameters

​State Keys

​Input State

​Output State

​Methods

​execute(state: dict) -> dict

​_extract_urls(text: str, source: str) -> Tuple[List[str], List[str]]

​_clean_urls(urls: List[str]) -> List[str]

​_is_valid_url(url: str) -> bool

​Usage Examples

​Basic HTML Parsing

​Parse with URL Extraction

​Parse Markdown Content

​Custom Chunk Size

​Extract Only Links

​Parsing Process

​URL Extraction Details

​Supported URL Patterns

​Image Detection

​URL Cleaning

​Chunking Strategy

​Error Handling

​Best Practices

​Performance Considerations

​Related Nodes

Build docs developers (and LLMs) love

Overview

Class Signature

Parameters

State Keys

Input State

Output State

Methods

execute(state: dict) -> dict

_extract_urls(text: str, source: str) -> Tuple[List[str], List[str]]

_clean_urls(urls: List[str]) -> List[str]

_is_valid_url(url: str) -> bool

Usage Examples

Basic HTML Parsing

Parse with URL Extraction

Parse Markdown Content

Custom Chunk Size

Extract Only Links

Parsing Process

URL Extraction Details

Supported URL Patterns

Image Detection

URL Cleaning

Chunking Strategy

Error Handling

Best Practices

Performance Considerations

Related Nodes