Skip to main content

Overview

The AI Data Analysis Agent is a powerful tool that makes data analysis accessible to everyone, regardless of SQL expertise. Built with the Agno Agent framework and OpenAI’s GPT-4o, it converts natural language questions into SQL queries, executes them against your data using DuckDB, and returns clear, actionable insights.
FREE Tutorial Available: Follow the complete step-by-step tutorial to learn how to build this from scratch with detailed code walkthroughs and best practices.

Features

File Upload Support

  • Upload CSV and Excel files
  • Automatic data type detection
  • Schema inference
  • Multiple format support (.csv, .xlsx)

Natural Language Queries

  • Convert plain English to SQL
  • No SQL knowledge required
  • Complex aggregations and filters
  • Instant answers about your data

Advanced Analysis

  • Complex data aggregations
  • Filtering and sorting
  • Statistical summaries
  • Data visualizations

Interactive UI

  • User-friendly Streamlit interface
  • Real-time query processing
  • Clear result presentation
  • Data preview tables

How It Works

1

Data Upload

Users upload CSV or Excel files through the Streamlit interface
2

Preprocessing

The agent automatically:
  • Detects data types (dates, numbers, strings)
  • Handles missing values
  • Infers schema
  • Loads data into DuckDB
3

Natural Language Query

Users ask questions in plain English:
  • “What are the top 5 products by revenue?”
  • “Show me average sales by region”
  • “Which customers made purchases over $1000?”
4

SQL Generation & Execution

The AI agent:
  • Converts natural language to SQL
  • Executes query against DuckDB
  • Processes results
  • Returns formatted answers

Setup

1

Clone the Repository

git clone https://github.com/Shubhamsaboo/awesome-llm-apps.git
cd awesome-llm-apps/starter_ai_agents/ai_data_analysis_agent
2

Install Dependencies

pip install -r requirements.txt
Required packages:
  • streamlit==1.41.1 - Web interface
  • openai==1.58.1 - OpenAI API client
  • duckdb>=1.4.1 - In-memory SQL database
  • pandas - Data manipulation
  • numpy==1.26.4 - Numerical computing
  • agno>=2.2.10 - Agent framework
3

Configure API Keys

Get your OpenAI API key from OpenAI PlatformYou’ll enter it in the Streamlit sidebar when running the app (no environment variables needed)
4

Run the Application

streamlit run ai_data_analyst.py
Open your browser to http://localhost:8501

Usage

Quick Start

1

Enter API Key

Provide your OpenAI API key in the sidebar
2

Upload Data

Upload your CSV or Excel file using the file uploader
3

Preview Data

The app displays your data in an interactive table with column names
4

Ask Questions

Type your questions in natural language and click “Submit Query”

Example Queries

What are the total sales by region?
Show me the top 10 products by revenue
Which month had the highest sales?
What's the average order value?
How many customers made repeat purchases?
What's the customer retention rate?
Show me customers with lifetime value over $5000
Which customers haven't purchased in 90 days?
What are the most popular products?
Show me products with declining sales
Which category has the highest margins?
List products that are frequently bought together
Show me year-over-year growth
What are the seasonal trends?
Compare Q1 vs Q2 performance
Show daily active users over the last month

Code Example

Data Preprocessing

import tempfile
import csv
import pandas as pd

def preprocess_and_save(file):
    """Preprocess and save uploaded file with proper type detection."""
    try:
        # Read the uploaded file
        if file.name.endswith('.csv'):
            df = pd.read_csv(file, encoding='utf-8', 
                           na_values=['NA', 'N/A', 'missing'])
        elif file.name.endswith('.xlsx'):
            df = pd.read_excel(file, 
                             na_values=['NA', 'N/A', 'missing'])
        else:
            return None, None, None
        
        # Ensure string columns are properly quoted
        for col in df.select_dtypes(include=['object']):
            df[col] = df[col].astype(str).replace({r'"': '""'}, regex=True)
        
        # Parse dates and numeric columns
        for col in df.columns:
            if 'date' in col.lower():
                df[col] = pd.to_datetime(df[col], errors='coerce')
            elif df[col].dtype == 'object':
                try:
                    df[col] = pd.to_numeric(df[col])
                except (ValueError, TypeError):
                    pass
        
        # Save to temporary CSV
        with tempfile.NamedTemporaryFile(delete=False, suffix=".csv") as temp_file:
            temp_path = temp_file.name
            df.to_csv(temp_path, index=False, quoting=csv.QUOTE_ALL)
        
        return temp_path, df.columns.tolist(), df
        
    except Exception as e:
        st.error(f"Error processing file: {e}")
        return None, None, None

Agent Configuration

from agno.agent import Agent
from agno.models.openai import OpenAIChat
from agno.tools.duckdb import DuckDbTools
from agno.tools.pandas import PandasTools
import streamlit as st

# Initialize DuckDB tools
duckdb_tools = DuckDbTools()

# Load data into DuckDB
duckdb_tools.load_local_csv_to_table(
    path=temp_path,
    table="uploaded_data",
)

# Create the AI Data Analyst Agent
data_analyst_agent = Agent(
    model=OpenAIChat(id="gpt-4o", api_key=openai_key),
    tools=[duckdb_tools, PandasTools()],
    system_message="""
    You are an expert data analyst. Use the 'uploaded_data' table 
    to answer user queries. Generate SQL queries using DuckDB tools 
    to solve the user's query. Provide clear and concise answers 
    with the results.
    """,
    markdown=True,
)

# Process user query
user_query = "What are the top 5 customers by total purchase amount?"
response = data_analyst_agent.run(user_query)

# Display results
st.markdown(response.content)

Complete Streamlit App Structure

import streamlit as st

st.title("📊 Data Analyst Agent")

# Sidebar for API keys
with st.sidebar:
    st.header("API Keys")
    openai_key = st.text_input("Enter your OpenAI API key:", type="password")
    if openai_key:
        st.session_state.openai_key = openai_key
        st.success("API key saved!")
    else:
        st.warning("Please enter your OpenAI API key to proceed.")

# File upload
uploaded_file = st.file_uploader("Upload a CSV or Excel file", 
                                 type=["csv", "xlsx"])

if uploaded_file and "openai_key" in st.session_state:
    # Preprocess data
    temp_path, columns, df = preprocess_and_save(uploaded_file)
    
    if temp_path:
        # Display uploaded data
        st.write("Uploaded Data:")
        st.dataframe(df)
        st.write("Columns:", columns)
        
        # Initialize agent (code shown above)
        # ...
        
        # Query interface
        user_query = st.text_area("Ask a query about the data:")
        
        st.info("💡 Check your terminal for clearer agent output")
        
        if st.button("Submit Query"):
            if user_query.strip():
                with st.spinner('Processing your query...'):
                    response = data_analyst_agent.run(user_query)
                    st.markdown(response.content)
            else:
                st.warning("Please enter a query.")

Advanced Features

DuckDB Integration

DuckDB provides fast, in-memory SQL analysis:
  • Fast: OLAP-optimized for analytical queries
  • In-Memory: No external database needed
  • SQL-Compatible: Standard SQL syntax
  • Pandas Integration: Seamless data exchange

Automatic Type Detection

The agent intelligently detects and converts:
for col in df.columns:
    # Date detection
    if 'date' in col.lower():
        df[col] = pd.to_datetime(df[col], errors='coerce')
    
    # Numeric conversion
    elif df[col].dtype == 'object':
        try:
            df[col] = pd.to_numeric(df[col])
        except (ValueError, TypeError):
            pass  # Keep as string
Handles:
  • Date formats (YYYY-MM-DD, MM/DD/YYYY, etc.)
  • Numeric types (integers, floats)
  • Currency values ($1,234.56)
  • Missing values (NA, N/A, null)
  • String data with proper quoting

Use Cases

Sales Analytics

Analyze sales trends, top products, regional performance, and revenue metrics

Customer Analysis

Segment customers, analyze behavior, calculate lifetime value, identify churn

Financial Reporting

Generate P&L summaries, expense analysis, budget tracking, and forecasts

Operations Data

Inventory analysis, supply chain metrics, production efficiency, logistics

Marketing Analytics

Campaign performance, conversion rates, ROI analysis, channel attribution

HR Analytics

Headcount analysis, turnover rates, compensation benchmarks, performance metrics

Best Practices

Clean Data: Ensure your CSV/Excel files have clear column headers and consistent data formats for best results.
Data Privacy: All data processing happens locally or in your OpenAI account. Avoid uploading sensitive data without proper safeguards.
Terminal Output: The app displays cleaner output in the terminal. Check your terminal for detailed agent reasoning and SQL queries.

Optimizing Queries

1

Be Specific

Instead of “analyze sales”, try “show me total sales by product category for Q1 2024”
2

Reference Column Names

If you know column names, include them: “sum the revenue_usd column by region”
3

Specify Formats

Request specific output: “show as a table”, “sort by highest value”, “limit to top 10”

Troubleshooting

Issue: File won’t upload or shows errorsSolutions:
  • Ensure file is .csv or .xlsx format
  • Check for special characters in column names
  • Verify file isn’t corrupted
  • Try saving as CSV if Excel file fails
Issue: Agent returns no data or “0 rows”Solutions:
  • Verify data is loaded (check data preview)
  • Check column names match your query
  • Try simpler query first: “show me first 5 rows”
  • Ensure filters aren’t too restrictive
Issue: Agent misunderstands querySolutions:
  • Rephrase query more explicitly
  • Reference specific column names
  • Break complex queries into simpler parts
  • Check terminal for generated SQL
Issue: OpenAI API errorsSolutions:
  • Wait a moment between queries
  • Check API key has credits
  • Verify API key is valid
  • Monitor OpenAI dashboard for usage

Performance Tips

File Size

Works best with files under 100MB. For larger datasets, consider sampling or filtering data first.

Query Complexity

Simple aggregations are fast. Complex multi-step queries may take 5-10 seconds.

Result Size

Limit large result sets: “show me top 100” instead of returning millions of rows.

API Costs

Each query uses OpenAI API tokens. Simple queries use ~500 tokens, complex ones may use 2000+.

Next Steps

Tutorial

Follow the complete step-by-step tutorial

More Examples

Explore other AI agent examples

Agno Framework

Learn more about the Agno agent framework

GitHub

View source code and contribute

Build docs developers (and LLMs) love