XMLScraperGraph

Overview

XMLScraperGraph is a scraping pipeline that extracts information from XML files using natural language queries. It allows you to query XML data without writing XPath or complex XML parsing code.

Class Signature

class XMLScraperGraph(AbstractGraph):
    def __init__(
        self,
        prompt: str,
        source: str,
        config: dict,
        schema: Optional[Type[BaseModel]] = None,
    )

Constructor Parameters

prompt

str

required

The natural language query to extract information from the XML file.

source

str

required

The source XML file or directory. Can be:

Path to a single XML file (e.g., "data.xml")
Path to a directory containing multiple XML files (e.g., "./xml_data/")

config

dict

required

Configuration parameters for the graph. Must include:

llm: LLM configuration (e.g., {"model": "openai/gpt-4o"})

Optional parameters:

verbose (bool): Enable detailed logging
headless (bool): Run in headless mode
additional_info (str): Extra context for the LLM

schema

Type[BaseModel]

default:"None"

Optional Pydantic model defining the expected output structure.

Attributes

prompt

str

The user’s query prompt.

source

str

The XML file path or directory path.

config

dict

Configuration dictionary for the graph.

schema

BaseModel

Optional output schema for structured data extraction.

llm_model

object

The configured language model instance.

input_key

str

Either “xml” (single file) or “xml_dir” (directory) based on the source.

Methods

run()

Executes the XML querying process and returns the answer.

def run(self) -> str

return

str

The extracted information from the XML file(s), or “No answer found.” if extraction fails.

Basic Usage

from scrapegraphai.graphs import XMLScraperGraph

graph_config = {
    "llm": {
        "model": "openai/gpt-4o",
        "api_key": "your-api-key"
    }
}

xml_scraper = XMLScraperGraph(
    prompt="List all the attractions in Chioggia.",
    source="data/chioggia.xml",
    config=graph_config
)

result = xml_scraper.run()
print(result)

Example XML File

<?xml version="1.0" encoding="UTF-8"?>
<city>
  <name>Chioggia</name>
  <country>Italy</country>
  <population>50000</population>
  <attractions>
    <attraction id="1">
      <name>Piazza Vigo</name>
      <type>square</type>
      <description>Main square with historic buildings</description>
      <rating>4.5</rating>
      <hours>
        <open>24/7</open>
      </hours>
    </attraction>
    <attraction id="2">
      <name>Museo della Laguna Sud</name>
      <type>museum</type>
      <description>Museum showcasing local history</description>
      <rating>4.2</rating>
      <hours>
        <open>09:00</open>
        <close>18:00</close>
      </hours>
    </attraction>
    <attraction id="3">
      <name>Cathedral of Santa Maria</name>
      <type>church</type>
      <description>Historic cathedral from the 11th century</description>
      <rating>4.7</rating>
    </attraction>
  </attractions>
  <restaurants>
    <restaurant id="1">
      <name>Ristorante El Gato</name>
      <cuisine>Italian</cuisine>
      <price_range>$$</price_range>
      <rating>4.6</rating>
    </restaurant>
  </restaurants>
</city>

Query Examples

Simple Extraction

xml_scraper = XMLScraperGraph(
    prompt="What is the population of the city?",
    source="data/chioggia.xml",
    config=graph_config
)

result = xml_scraper.run()
# Output: "The population is 50,000"

Nested Element Access

xml_scraper = XMLScraperGraph(
    prompt="List all museum attractions with their opening hours",
    source="data/chioggia.xml",
    config=graph_config
)

result = xml_scraper.run()

Attribute Extraction

xml_scraper = XMLScraperGraph(
    prompt="Extract all attraction IDs and names",
    source="data/chioggia.xml",
    config=graph_config
)

result = xml_scraper.run()

Filtering

xml_scraper = XMLScraperGraph(
    prompt="List all attractions with a rating above 4.5",
    source="data/chioggia.xml",
    config=graph_config
)

result = xml_scraper.run()

Structured Output with Schema

from pydantic import BaseModel, Field
from typing import List, Optional

class Attraction(BaseModel):
    name: str = Field(description="Attraction name")
    type: str = Field(description="Type of attraction")
    rating: float = Field(description="Rating score")
    description: str = Field(description="Description")

class AttractionList(BaseModel):
    attractions: List[Attraction]
    count: int = Field(description="Total number of attractions")
    average_rating: float = Field(description="Average rating")

xml_scraper = XMLScraperGraph(
    prompt="Extract all attractions with their details and calculate average rating",
    source="data/chioggia.xml",
    config=graph_config,
    schema=AttractionList
)

result = xml_scraper.run()
# Result is automatically validated against the schema

Complex XML Structures

RSS Feed

<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0">
  <channel>
    <title>Tech News</title>
    <link>https://example.com</link>
    <description>Latest technology news</description>
    <item>
      <title>AI Breakthrough Announced</title>
      <link>https://example.com/ai-news</link>
      <pubDate>Mon, 06 Mar 2026 10:00:00 GMT</pubDate>
      <description>Scientists achieve major AI milestone</description>
      <category>Artificial Intelligence</category>
    </item>
    <item>
      <title>New Programming Language Released</title>
      <link>https://example.com/lang-news</link>
      <pubDate>Sun, 05 Mar 2026 15:30:00 GMT</pubDate>
      <description>Developer community welcomes innovation</description>
      <category>Software Development</category>
    </item>
  </channel>
</rss>

xml_scraper = XMLScraperGraph(
    prompt="Summarize the latest tech news from the RSS feed",
    source="data/rss_feed.xml",
    config=graph_config
)

result = xml_scraper.run()

Configuration Files

<?xml version="1.0"?>
<configuration>
  <database>
    <host>localhost</host>
    <port>5432</port>
    <name>myapp_db</name>
    <credentials>
      <username>admin</username>
    </credentials>
  </database>
  <features>
    <feature name="analytics" enabled="true">
      <setting key="tracking_id">UA-12345</setting>
    </feature>
    <feature name="payments" enabled="false"/>
  </features>
</configuration>

xml_scraper = XMLScraperGraph(
    prompt="List all enabled features with their settings",
    source="config/app_config.xml",
    config=graph_config
)

result = xml_scraper.run()

Multiple XML Files

# Directory structure:
# xml_data/
#   ├── products_2023.xml
#   ├── products_2024.xml
#   └── products_2025.xml

xml_scraper = XMLScraperGraph(
    prompt="Count total products across all years",
    source="./xml_data/",  # Directory path
    config=graph_config
)

result = xml_scraper.run()
# Automatically processes all XML files in the directory

Graph Workflow

The XMLScraperGraph uses a simple node pipeline:

FetchNode → GenerateAnswerNode

FetchNode: Loads and parses the XML file(s)
GenerateAnswerNode: Processes the XML data and answers the query

Advanced Usage

With Additional Context

config = {
    "llm": {"model": "openai/gpt-4o"},
    "additional_info": """
        This XML contains product catalog data.
        Prices are in EUR. Focus on in-stock items only.
    """
}

xml_scraper = XMLScraperGraph(
    prompt="List available products under 100 EUR",
    source="data/catalog.xml",
    config=config
)

SOAP Response Analysis

<?xml version="1.0"?>
<soap:Envelope xmlns:soap="http://schemas.xmlsoap.org/soap/envelope/">
  <soap:Body>
    <GetProductResponse>
      <Product>
        <ID>12345</ID>
        <Name>Widget</Name>
        <Price currency="USD">29.99</Price>
        <Stock>150</Stock>
      </Product>
    </GetProductResponse>
  </soap:Body>
</soap:Envelope>

xml_scraper = XMLScraperGraph(
    prompt="Extract product information from the SOAP response",
    source="data/soap_response.xml",
    config=graph_config
)

result = xml_scraper.run()

Use Cases

RSS Feed Analysis: Parse and analyze RSS feeds
Configuration Management: Extract information from XML config files
SOAP API Responses: Parse SOAP web service responses
Data Migration: Extract data from XML exports
Document Processing: Query XML documents

Example: RSS News Aggregation

from pydantic import BaseModel, Field
from typing import List
from datetime import datetime

class NewsItem(BaseModel):
    title: str
    link: str
    published: str
    category: str
    summary: str

class NewsSummary(BaseModel):
    articles: List[NewsItem]
    total_count: int
    categories: List[str]
    summary: str = Field(description="Overall summary of the news")

config = {
    "llm": {"model": "openai/gpt-4o"},
    "additional_info": "Focus on technology and AI related news"
}

xml_scraper = XMLScraperGraph(
    prompt="""Extract all news items and provide:
    1. List of articles with details
    2. Total count
    3. All categories
    4. Overall summary of the news
    """,
    source="data/rss_feed.xml",
    config=config,
    schema=NewsSummary
)

result = xml_scraper.run()

Accessing Results

result = xml_scraper.run()

# Get the answer
print("Answer:", result)

# Access full state
final_state = xml_scraper.get_state()
raw_data = final_state.get("doc")
answer = final_state.get("answer")

print(f"Processed XML data size: {len(str(raw_data))} characters")

# Execution info
exec_info = xml_scraper.get_execution_info()
for node_info in exec_info:
    print(f"{node_info['node_name']}: {node_info['exec_time']:.2f}s")
    print(f"Tokens: {node_info['total_tokens']}")
    print(f"Cost: ${node_info['total_cost_USD']:.4f}")

Working with Namespaces

<?xml version="1.0"?>
<root xmlns:app="http://example.com/app">
  <app:products>
    <app:product id="1">
      <app:name>Widget</app:name>
      <app:price>29.99</app:price>
    </app:product>
  </app:products>
</root>

config = {
    "llm": {"model": "openai/gpt-4o"},
    "additional_info": "The XML uses namespaces. Focus on product elements."
}

xml_scraper = XMLScraperGraph(
    prompt="Extract all product names and prices",
    source="data/namespaced.xml",
    config=config
)

Error Handling

import xml.etree.ElementTree as ET

try:
    # Validate XML first
    ET.parse("data/file.xml")
    
    result = xml_scraper.run()
    
    if result == "No answer found.":
        print("Failed to extract information from XML")
    else:
        print(f"Success: {result}")
        
except ET.ParseError as e:
    print(f"Invalid XML format: {e}")
except FileNotFoundError:
    print("XML file not found")
except Exception as e:
    print(f"Error during processing: {e}")

Tips for Better Results

Understand structure: Know your XML structure before querying
Be specific: Clear queries get better answers
Use schema: Define schemas for type-safe output
Handle namespaces: Mention if XML uses namespaces
Provide context: Use additional_info for domain knowledge
Test queries: Start simple and iterate
Validate XML: Ensure XML is well-formed

XML Format Considerations

Aspect	Consideration
Attributes vs Elements	LLM can handle both
Namespaces	Provide context in prompt
CDATA	Automatically handled
Comments	Usually ignored
Deep Nesting	May increase token usage

Performance Considerations

File Size: Large XML files may exceed LLM context limits
Nesting Depth: Deeply nested structures take more tokens
Multiple Files: Processing multiple files increases execution time
Namespaces: Complex namespaces may require more context

XMLScraperMultiGraph - Process multiple XML files separately
JSONScraperGraph - Query JSON files
CSVScraperGraph - Query CSV files

Graphs

Nodes

Models

Utilities

Overview

Class Signature

Constructor Parameters

Attributes

Methods

run()

Basic Usage

Example XML File

Query Examples

Simple Extraction

Nested Element Access

Attribute Extraction

Filtering

Structured Output with Schema

Complex XML Structures

RSS Feed

Configuration Files

Multiple XML Files

Graph Workflow

Advanced Usage

With Additional Context

SOAP Response Analysis

Use Cases

Example: RSS News Aggregation

Accessing Results

Working with Namespaces

Error Handling

Tips for Better Results

XML Format Considerations

Performance Considerations

Build docs developers (and LLMs) love

Graphs

Nodes

Models

Utilities

​Overview

​Class Signature

​Constructor Parameters

​Attributes

​Methods

​run()

​Basic Usage

​Example XML File

​Query Examples

​Simple Extraction

​Nested Element Access

​Attribute Extraction

​Filtering

​Structured Output with Schema

​Complex XML Structures

​RSS Feed

​Configuration Files

​Multiple XML Files

​Graph Workflow

​Advanced Usage

​With Additional Context

​SOAP Response Analysis

​Use Cases

​Example: RSS News Aggregation

​Accessing Results

​Working with Namespaces

​Error Handling

​Tips for Better Results

​XML Format Considerations

​Performance Considerations

​Related Graphs

Build docs developers (and LLMs) love

Overview

Class Signature

Constructor Parameters

Attributes

Methods

run()

Basic Usage

Example XML File

Query Examples

Simple Extraction

Nested Element Access

Attribute Extraction

Filtering

Structured Output with Schema

Complex XML Structures

RSS Feed

Configuration Files

Multiple XML Files

Graph Workflow

Advanced Usage

With Additional Context

SOAP Response Analysis

Use Cases

Example: RSS News Aggregation

Accessing Results

Working with Namespaces

Error Handling

Tips for Better Results

XML Format Considerations

Performance Considerations

Related Graphs