Overview
CSVScraperGraph is a scraping pipeline that extracts information from CSV files using natural language queries. It allows you to query CSV data without writing SQL or pandas code.
Class Signature
class CSVScraperGraph(AbstractGraph):
def __init__(
self,
prompt: str,
source: str,
config: dict,
schema: Optional[Type[BaseModel]] = None,
)
Constructor Parameters
The natural language query to extract information from the CSV file.
The source CSV file or directory. Can be:
- Path to a single CSV file (e.g.,
"data.csv")
- Path to a directory containing multiple CSV files (e.g.,
"./csv_data/")
Configuration parameters for the graph. Must include:
llm: LLM configuration (e.g., {"model": "openai/gpt-4o"})
Optional parameters:
verbose (bool): Enable detailed logging
additional_info (str): Extra context for the LLM
schema
Type[BaseModel]
default:"None"
Optional Pydantic model defining the expected output structure.
Attributes
The CSV file path or directory path.
Configuration dictionary for the graph.
Optional output schema for structured data extraction.
The configured language model instance.
Either “csv” (single file) or “csv_dir” (directory) based on the source.
Methods
run()
Executes the CSV querying process and returns the answer.
The extracted information from the CSV file(s), or “No answer found.” if extraction fails.
Basic Usage
from scrapegraphai.graphs import CSVScraperGraph
graph_config = {
"llm": {
"model": "openai/gpt-4o",
"api_key": "your-api-key"
}
}
csv_scraper = CSVScraperGraph(
prompt="What are the top 5 products by revenue?",
source="./data/sales.csv",
config=graph_config
)
result = csv_scraper.run()
print(result)
Example CSV File
product_name,category,price,quantity_sold,revenue
Laptop Pro,Electronics,1299.99,150,194998.50
Wireless Mouse,Electronics,29.99,500,14995.00
Desk Chair,Furniture,249.99,200,49998.00
Notebook,Stationery,5.99,1000,5990.00
Table Lamp,Furniture,79.99,300,23997.00
Query Examples
Simple Aggregation
csv_scraper = CSVScraperGraph(
prompt="What is the total revenue across all products?",
source="./data/sales.csv",
config=graph_config
)
result = csv_scraper.run()
# Output: "The total revenue across all products is $289,978.50"
Filtering and Sorting
csv_scraper = CSVScraperGraph(
prompt="List all Electronics products sorted by price in descending order",
source="./data/sales.csv",
config=graph_config
)
result = csv_scraper.run()
Statistical Analysis
csv_scraper = CSVScraperGraph(
prompt="What is the average price per category?",
source="./data/sales.csv",
config=graph_config
)
result = csv_scraper.run()
Comparison Queries
csv_scraper = CSVScraperGraph(
prompt="Which category has the highest total revenue?",
source="./data/sales.csv",
config=graph_config
)
result = csv_scraper.run()
Structured Output with Schema
from pydantic import BaseModel, Field
from typing import List
class TopProduct(BaseModel):
name: str = Field(description="Product name")
revenue: float = Field(description="Total revenue")
units_sold: int = Field(description="Number of units sold")
class TopProducts(BaseModel):
products: List[TopProduct] = Field(description="List of top products")
csv_scraper = CSVScraperGraph(
prompt="List the top 3 products by revenue with their details",
source="./data/sales.csv",
config=graph_config,
schema=TopProducts
)
result = csv_scraper.run()
# Result is automatically validated and structured
Multiple CSV Files
# Directory structure:
# csv_data/
# ├── sales_2023.csv
# ├── sales_2024.csv
# └── sales_2025.csv
csv_scraper = CSVScraperGraph(
prompt="Compare total revenue across all years",
source="./csv_data/", # Directory path
config=graph_config
)
result = csv_scraper.run()
# Automatically processes all CSV files in the directory
Graph Workflow
The CSVScraperGraph uses a simple node pipeline:
FetchNode → GenerateAnswerCSVNode
- FetchNode: Loads the CSV file(s) into memory
- GenerateAnswerCSVNode: Processes the CSV data and answers the query
Advanced Usage
With Additional Context
config = {
"llm": {"model": "openai/gpt-4o"},
"additional_info": """
The CSV contains sales data from our e-commerce platform.
Revenue is in USD. Focus on year-over-year growth trends.
"""
}
csv_scraper = CSVScraperGraph(
prompt="Analyze sales trends and identify top performing categories",
source="./data/sales.csv",
config=config
)
Complex Analysis
from pydantic import BaseModel
from typing import List, Dict
class CategoryAnalysis(BaseModel):
category: str
total_revenue: float
total_units: int
avg_price: float
product_count: int
class SalesAnalysis(BaseModel):
categories: List[CategoryAnalysis]
insights: str
config = {
"llm": {"model": "openai/gpt-4o"},
"additional_info": "Provide detailed insights on category performance"
}
csv_scraper = CSVScraperGraph(
prompt="Perform a comprehensive analysis of sales by category including key insights",
source="./data/sales.csv",
config=config,
schema=SalesAnalysis
)
result = csv_scraper.run()
Use Cases
- Data Analysis: Query CSV files using natural language
- Report Generation: Extract insights for reports
- Business Intelligence: Answer business questions from data
- Data Exploration: Explore datasets without coding
- Automated Insights: Generate automated data summaries
Accessing Results
result = csv_scraper.run()
# Get the answer
print("Answer:", result)
# Access full state
final_state = csv_scraper.get_state()
raw_data = final_state.get("doc")
answer = final_state.get("answer")
print(f"Processed CSV data: {len(str(raw_data))} characters")
# Execution info
exec_info = csv_scraper.get_execution_info()
for node_info in exec_info:
print(f"{node_info['node_name']}: {node_info['exec_time']:.2f}s")
print(f"Tokens: {node_info['total_tokens']}")
Example: Sales Dashboard
from pydantic import BaseModel, Field
from typing import List
class ProductPerformance(BaseModel):
product: str
revenue: float
units: int
rank: int
class CategorySummary(BaseModel):
category: str
total_revenue: float
product_count: int
class Dashboard(BaseModel):
top_products: List[ProductPerformance]
category_summary: List[CategorySummary]
total_revenue: float
insights: str
config = {
"llm": {"model": "openai/gpt-4o"},
"verbose": True
}
csv_scraper = CSVScraperGraph(
prompt="""Create a sales dashboard with:
1. Top 5 products by revenue
2. Summary by category
3. Total revenue
4. Key insights
""",
source="./data/sales.csv",
config=config,
schema=Dashboard
)
result = csv_scraper.run()
print(result)
- File Size: Large CSV files may take longer to process
- LLM Context: Very large CSVs may exceed LLM context limits
- Multiple Files: Processing multiple files increases execution time
- Complex Queries: More complex analysis requires more tokens
Error Handling
try:
result = csv_scraper.run()
if result == "No answer found.":
print("Failed to extract information from CSV")
else:
print(f"Success: {result}")
except FileNotFoundError:
print("CSV file not found")
except Exception as e:
print(f"Error during processing: {e}")
Tips for Better Results
- Be specific: Clear queries get better answers
- Use schema: Define schemas for structured output
- Provide context: Use
additional_info for domain knowledge
- Test queries: Start with simple queries and iterate
- Check data: Ensure CSV is well-formatted
Comparison with Pandas
| Feature | CSVScraperGraph | Pandas |
|---|
| Input | Natural language | Python code |
| Learning Curve | Low | Medium |
| Flexibility | High | Very High |
| Performance | Depends on LLM | Fast |
| Complex Analysis | Good | Excellent |
| Use Case | Quick insights | Data processing |