CSVScraperGraph

Overview

CSVScraperGraph is a scraping pipeline that extracts information from CSV files using natural language queries. It allows you to query CSV data without writing SQL or pandas code.

Class Signature

class CSVScraperGraph(AbstractGraph):
    def __init__(
        self,
        prompt: str,
        source: str,
        config: dict,
        schema: Optional[Type[BaseModel]] = None,
    )

Constructor Parameters

prompt

str

required

The natural language query to extract information from the CSV file.

source

str

required

The source CSV file or directory. Can be:

Path to a single CSV file (e.g., "data.csv")
Path to a directory containing multiple CSV files (e.g., "./csv_data/")

config

dict

required

Configuration parameters for the graph. Must include:

llm: LLM configuration (e.g., {"model": "openai/gpt-4o"})

Optional parameters:

verbose (bool): Enable detailed logging
additional_info (str): Extra context for the LLM

schema

Type[BaseModel]

default:"None"

Optional Pydantic model defining the expected output structure.

Attributes

prompt

str

The user’s query prompt.

source

str

The CSV file path or directory path.

config

dict

Configuration dictionary for the graph.

schema

BaseModel

Optional output schema for structured data extraction.

llm_model

object

The configured language model instance.

input_key

str

Either “csv” (single file) or “csv_dir” (directory) based on the source.

Methods

run()

Executes the CSV querying process and returns the answer.

def run(self) -> str

return

str

The extracted information from the CSV file(s), or “No answer found.” if extraction fails.

Basic Usage

from scrapegraphai.graphs import CSVScraperGraph

graph_config = {
    "llm": {
        "model": "openai/gpt-4o",
        "api_key": "your-api-key"
    }
}

csv_scraper = CSVScraperGraph(
    prompt="What are the top 5 products by revenue?",
    source="./data/sales.csv",
    config=graph_config
)

result = csv_scraper.run()
print(result)

Example CSV File

product_name,category,price,quantity_sold,revenue
Laptop Pro,Electronics,1299.99,150,194998.50
Wireless Mouse,Electronics,29.99,500,14995.00
Desk Chair,Furniture,249.99,200,49998.00
Notebook,Stationery,5.99,1000,5990.00
Table Lamp,Furniture,79.99,300,23997.00

Query Examples

Simple Aggregation

csv_scraper = CSVScraperGraph(
    prompt="What is the total revenue across all products?",
    source="./data/sales.csv",
    config=graph_config
)

result = csv_scraper.run()
# Output: "The total revenue across all products is $289,978.50"

Filtering and Sorting

csv_scraper = CSVScraperGraph(
    prompt="List all Electronics products sorted by price in descending order",
    source="./data/sales.csv",
    config=graph_config
)

result = csv_scraper.run()

Statistical Analysis

csv_scraper = CSVScraperGraph(
    prompt="What is the average price per category?",
    source="./data/sales.csv",
    config=graph_config
)

result = csv_scraper.run()

Comparison Queries

csv_scraper = CSVScraperGraph(
    prompt="Which category has the highest total revenue?",
    source="./data/sales.csv",
    config=graph_config
)

result = csv_scraper.run()

Structured Output with Schema

from pydantic import BaseModel, Field
from typing import List

class TopProduct(BaseModel):
    name: str = Field(description="Product name")
    revenue: float = Field(description="Total revenue")
    units_sold: int = Field(description="Number of units sold")

class TopProducts(BaseModel):
    products: List[TopProduct] = Field(description="List of top products")

csv_scraper = CSVScraperGraph(
    prompt="List the top 3 products by revenue with their details",
    source="./data/sales.csv",
    config=graph_config,
    schema=TopProducts
)

result = csv_scraper.run()
# Result is automatically validated and structured

Multiple CSV Files

# Directory structure:
# csv_data/
#   ├── sales_2023.csv
#   ├── sales_2024.csv
#   └── sales_2025.csv

csv_scraper = CSVScraperGraph(
    prompt="Compare total revenue across all years",
    source="./csv_data/",  # Directory path
    config=graph_config
)

result = csv_scraper.run()
# Automatically processes all CSV files in the directory

Graph Workflow

The CSVScraperGraph uses a simple node pipeline:

FetchNode → GenerateAnswerCSVNode

FetchNode: Loads the CSV file(s) into memory
GenerateAnswerCSVNode: Processes the CSV data and answers the query

Advanced Usage

With Additional Context

config = {
    "llm": {"model": "openai/gpt-4o"},
    "additional_info": """
        The CSV contains sales data from our e-commerce platform.
        Revenue is in USD. Focus on year-over-year growth trends.
    """
}

csv_scraper = CSVScraperGraph(
    prompt="Analyze sales trends and identify top performing categories",
    source="./data/sales.csv",
    config=config
)

Complex Analysis

from pydantic import BaseModel
from typing import List, Dict

class CategoryAnalysis(BaseModel):
    category: str
    total_revenue: float
    total_units: int
    avg_price: float
    product_count: int

class SalesAnalysis(BaseModel):
    categories: List[CategoryAnalysis]
    insights: str

config = {
    "llm": {"model": "openai/gpt-4o"},
    "additional_info": "Provide detailed insights on category performance"
}

csv_scraper = CSVScraperGraph(
    prompt="Perform a comprehensive analysis of sales by category including key insights",
    source="./data/sales.csv",
    config=config,
    schema=SalesAnalysis
)

result = csv_scraper.run()

Use Cases

Data Analysis: Query CSV files using natural language
Report Generation: Extract insights for reports
Business Intelligence: Answer business questions from data
Data Exploration: Explore datasets without coding
Automated Insights: Generate automated data summaries

Accessing Results

result = csv_scraper.run()

# Get the answer
print("Answer:", result)

# Access full state
final_state = csv_scraper.get_state()
raw_data = final_state.get("doc")
answer = final_state.get("answer")

print(f"Processed CSV data: {len(str(raw_data))} characters")

# Execution info
exec_info = csv_scraper.get_execution_info()
for node_info in exec_info:
    print(f"{node_info['node_name']}: {node_info['exec_time']:.2f}s")
    print(f"Tokens: {node_info['total_tokens']}")

Example: Sales Dashboard

from pydantic import BaseModel, Field
from typing import List

class ProductPerformance(BaseModel):
    product: str
    revenue: float
    units: int
    rank: int

class CategorySummary(BaseModel):
    category: str
    total_revenue: float
    product_count: int

class Dashboard(BaseModel):
    top_products: List[ProductPerformance]
    category_summary: List[CategorySummary]
    total_revenue: float
    insights: str

config = {
    "llm": {"model": "openai/gpt-4o"},
    "verbose": True
}

csv_scraper = CSVScraperGraph(
    prompt="""Create a sales dashboard with:
    1. Top 5 products by revenue
    2. Summary by category
    3. Total revenue
    4. Key insights
    """,
    source="./data/sales.csv",
    config=config,
    schema=Dashboard
)

result = csv_scraper.run()
print(result)

Performance Considerations

File Size: Large CSV files may take longer to process
LLM Context: Very large CSVs may exceed LLM context limits
Multiple Files: Processing multiple files increases execution time
Complex Queries: More complex analysis requires more tokens

Error Handling

try:
    result = csv_scraper.run()
    
    if result == "No answer found.":
        print("Failed to extract information from CSV")
    else:
        print(f"Success: {result}")
        
except FileNotFoundError:
    print("CSV file not found")
except Exception as e:
    print(f"Error during processing: {e}")

Tips for Better Results

Be specific: Clear queries get better answers
Use schema: Define schemas for structured output
Provide context: Use additional_info for domain knowledge
Test queries: Start with simple queries and iterate
Check data: Ensure CSV is well-formatted

Comparison with Pandas

Feature	CSVScraperGraph	Pandas
Input	Natural language	Python code
Learning Curve	Low	Medium
Flexibility	High	Very High
Performance	Depends on LLM	Fast
Complex Analysis	Good	Excellent
Use Case	Quick insights	Data processing

CSVScraperMultiGraph - Process multiple CSV files separately
JSONScraperGraph - Query JSON files
XMLScraperGraph - Query XML files

Graphs

Nodes

Models

Utilities

Overview

Class Signature

Constructor Parameters

Attributes

Methods

run()

Basic Usage

Example CSV File

Query Examples

Simple Aggregation

Filtering and Sorting

Statistical Analysis

Comparison Queries

Structured Output with Schema

Multiple CSV Files

Graph Workflow

Advanced Usage

With Additional Context

Complex Analysis

Use Cases

Accessing Results

Example: Sales Dashboard

Performance Considerations

Error Handling

Tips for Better Results

Comparison with Pandas

Build docs developers (and LLMs) love

Graphs

Nodes

Models

Utilities

​Overview

​Class Signature

​Constructor Parameters

​Attributes

​Methods

​run()

​Basic Usage

​Example CSV File

​Query Examples

​Simple Aggregation

​Filtering and Sorting

​Statistical Analysis

​Comparison Queries

​Structured Output with Schema

​Multiple CSV Files

​Graph Workflow

​Advanced Usage

​With Additional Context

​Complex Analysis

​Use Cases

​Accessing Results

​Example: Sales Dashboard

​Performance Considerations

​Error Handling

​Tips for Better Results

​Comparison with Pandas

​Related Graphs

Build docs developers (and LLMs) love

Overview

Class Signature

Constructor Parameters

Attributes

Methods

run()

Basic Usage

Example CSV File

Query Examples

Simple Aggregation

Filtering and Sorting

Statistical Analysis

Comparison Queries

Structured Output with Schema

Multiple CSV Files

Graph Workflow

Advanced Usage

With Additional Context

Complex Analysis

Use Cases

Accessing Results

Example: Sales Dashboard

Performance Considerations

Error Handling

Tips for Better Results

Comparison with Pandas

Related Graphs