AI Data Analysis Agent - Awesome LLM Apps

Overview

The AI Data Analysis Agent is a powerful tool that makes data analysis accessible to everyone, regardless of SQL expertise. Built with the Agno Agent framework and OpenAI’s GPT-4o, it converts natural language questions into SQL queries, executes them against your data using DuckDB, and returns clear, actionable insights.

FREE Tutorial Available: Follow the complete step-by-step tutorial to learn how to build this from scratch with detailed code walkthroughs and best practices.

Features

File Upload Support

Upload CSV and Excel files
Automatic data type detection
Schema inference
Multiple format support (.csv, .xlsx)

Natural Language Queries

Convert plain English to SQL
No SQL knowledge required
Complex aggregations and filters
Instant answers about your data

Advanced Analysis

Complex data aggregations
Filtering and sorting
Statistical summaries
Data visualizations

Interactive UI

User-friendly Streamlit interface
Real-time query processing
Clear result presentation
Data preview tables

How It Works

Data Upload

Users upload CSV or Excel files through the Streamlit interface

Preprocessing

The agent automatically:

Detects data types (dates, numbers, strings)
Handles missing values
Infers schema
Loads data into DuckDB

Natural Language Query

Users ask questions in plain English:

“What are the top 5 products by revenue?”
“Show me average sales by region”
“Which customers made purchases over $1000?”

SQL Generation & Execution

The AI agent:

Converts natural language to SQL
Executes query against DuckDB
Processes results
Returns formatted answers

Setup

Clone the Repository

git clone https://github.com/Shubhamsaboo/awesome-llm-apps.git
cd awesome-llm-apps/starter_ai_agents/ai_data_analysis_agent

Install Dependencies

pip install -r requirements.txt

Required packages:

streamlit==1.41.1 - Web interface
openai==1.58.1 - OpenAI API client
duckdb>=1.4.1 - In-memory SQL database
pandas - Data manipulation
numpy==1.26.4 - Numerical computing
agno>=2.2.10 - Agent framework

Configure API Keys

Get your OpenAI API key from OpenAI PlatformYou’ll enter it in the Streamlit sidebar when running the app (no environment variables needed)

Run the Application

streamlit run ai_data_analyst.py

Open your browser to http://localhost:8501

Usage

Quick Start

Enter API Key

Provide your OpenAI API key in the sidebar

Upload Data

Upload your CSV or Excel file using the file uploader

Preview Data

The app displays your data in an interactive table with column names

Ask Questions

Type your questions in natural language and click “Submit Query”

Example Queries

Sales Analysis

What are the total sales by region?
Show me the top 10 products by revenue
Which month had the highest sales?
What's the average order value?

Customer Insights

How many customers made repeat purchases?
What's the customer retention rate?
Show me customers with lifetime value over $5000
Which customers haven't purchased in 90 days?

Product Analytics

What are the most popular products?
Show me products with declining sales
Which category has the highest margins?
List products that are frequently bought together

Time-based Analysis

Show me year-over-year growth
What are the seasonal trends?
Compare Q1 vs Q2 performance
Show daily active users over the last month

Code Example

Data Preprocessing

import tempfile
import csv
import pandas as pd

def preprocess_and_save(file):
    """Preprocess and save uploaded file with proper type detection."""
    try:
        # Read the uploaded file
        if file.name.endswith('.csv'):
            df = pd.read_csv(file, encoding='utf-8', 
                           na_values=['NA', 'N/A', 'missing'])
        elif file.name.endswith('.xlsx'):
            df = pd.read_excel(file, 
                             na_values=['NA', 'N/A', 'missing'])
        else:
            return None, None, None
        
        # Ensure string columns are properly quoted
        for col in df.select_dtypes(include=['object']):
            df[col] = df[col].astype(str).replace({r'"': '""'}, regex=True)
        
        # Parse dates and numeric columns
        for col in df.columns:
            if 'date' in col.lower():
                df[col] = pd.to_datetime(df[col], errors='coerce')
            elif df[col].dtype == 'object':
                try:
                    df[col] = pd.to_numeric(df[col])
                except (ValueError, TypeError):
                    pass
        
        # Save to temporary CSV
        with tempfile.NamedTemporaryFile(delete=False, suffix=".csv") as temp_file:
            temp_path = temp_file.name
            df.to_csv(temp_path, index=False, quoting=csv.QUOTE_ALL)
        
        return temp_path, df.columns.tolist(), df
        
    except Exception as e:
        st.error(f"Error processing file: {e}")
        return None, None, None

Agent Configuration

from agno.agent import Agent
from agno.models.openai import OpenAIChat
from agno.tools.duckdb import DuckDbTools
from agno.tools.pandas import PandasTools
import streamlit as st

# Initialize DuckDB tools
duckdb_tools = DuckDbTools()

# Load data into DuckDB
duckdb_tools.load_local_csv_to_table(
    path=temp_path,
    table="uploaded_data",
)

# Create the AI Data Analyst Agent
data_analyst_agent = Agent(
    model=OpenAIChat(id="gpt-4o", api_key=openai_key),
    tools=[duckdb_tools, PandasTools()],
    system_message="""
    You are an expert data analyst. Use the 'uploaded_data' table 
    to answer user queries. Generate SQL queries using DuckDB tools 
    to solve the user's query. Provide clear and concise answers 
    with the results.
    """,
    markdown=True,
)

# Process user query
user_query = "What are the top 5 customers by total purchase amount?"
response = data_analyst_agent.run(user_query)

# Display results
st.markdown(response.content)

Complete Streamlit App Structure

import streamlit as st

st.title("📊 Data Analyst Agent")

# Sidebar for API keys
with st.sidebar:
    st.header("API Keys")
    openai_key = st.text_input("Enter your OpenAI API key:", type="password")
    if openai_key:
        st.session_state.openai_key = openai_key
        st.success("API key saved!")
    else:
        st.warning("Please enter your OpenAI API key to proceed.")

# File upload
uploaded_file = st.file_uploader("Upload a CSV or Excel file", 
                                 type=["csv", "xlsx"])

if uploaded_file and "openai_key" in st.session_state:
    # Preprocess data
    temp_path, columns, df = preprocess_and_save(uploaded_file)
    
    if temp_path:
        # Display uploaded data
        st.write("Uploaded Data:")
        st.dataframe(df)
        st.write("Columns:", columns)
        
        # Initialize agent (code shown above)
        # ...
        
        # Query interface
        user_query = st.text_area("Ask a query about the data:")
        
        st.info("💡 Check your terminal for clearer agent output")
        
        if st.button("Submit Query"):
            if user_query.strip():
                with st.spinner('Processing your query...'):
                    response = data_analyst_agent.run(user_query)
                    st.markdown(response.content)
            else:
                st.warning("Please enter a query.")

Advanced Features

DuckDB Integration

DuckDB provides fast, in-memory SQL analysis:

Benefits
Supported Operations

Fast: OLAP-optimized for analytical queries
In-Memory: No external database needed
SQL-Compatible: Standard SQL syntax
Pandas Integration: Seamless data exchange

Automatic Type Detection

The agent intelligently detects and converts:

for col in df.columns:
    # Date detection
    if 'date' in col.lower():
        df[col] = pd.to_datetime(df[col], errors='coerce')
    
    # Numeric conversion
    elif df[col].dtype == 'object':
        try:
            df[col] = pd.to_numeric(df[col])
        except (ValueError, TypeError):
            pass  # Keep as string

Handles:

Date formats (YYYY-MM-DD, MM/DD/YYYY, etc.)
Numeric types (integers, floats)
Currency values ($1,234.56)
Missing values (NA, N/A, null)
String data with proper quoting

Use Cases

Sales Analytics

Analyze sales trends, top products, regional performance, and revenue metrics

Customer Analysis

Segment customers, analyze behavior, calculate lifetime value, identify churn

Financial Reporting

Generate P&L summaries, expense analysis, budget tracking, and forecasts

Operations Data

Inventory analysis, supply chain metrics, production efficiency, logistics

Marketing Analytics

Campaign performance, conversion rates, ROI analysis, channel attribution

HR Analytics

Headcount analysis, turnover rates, compensation benchmarks, performance metrics

Best Practices

Clean Data: Ensure your CSV/Excel files have clear column headers and consistent data formats for best results.

Data Privacy: All data processing happens locally or in your OpenAI account. Avoid uploading sensitive data without proper safeguards.

Terminal Output: The app displays cleaner output in the terminal. Check your terminal for detailed agent reasoning and SQL queries.

Optimizing Queries

Be Specific

Instead of “analyze sales”, try “show me total sales by product category for Q1 2024”

Reference Column Names

If you know column names, include them: “sum the revenue_usd column by region”

Specify Formats

Request specific output: “show as a table”, “sort by highest value”, “limit to top 10”

Troubleshooting

File Upload Errors

Issue: File won’t upload or shows errorsSolutions:

Ensure file is .csv or .xlsx format
Check for special characters in column names
Verify file isn’t corrupted
Try saving as CSV if Excel file fails

Query Returns Empty Results

Issue: Agent returns no data or “0 rows”Solutions:

Verify data is loaded (check data preview)
Check column names match your query
Try simpler query first: “show me first 5 rows”
Ensure filters aren’t too restrictive

Incorrect SQL Generated

Issue: Agent misunderstands querySolutions:

Rephrase query more explicitly
Reference specific column names
Break complex queries into simpler parts
Check terminal for generated SQL

API Rate Limits

Issue: OpenAI API errorsSolutions:

Wait a moment between queries
Check API key has credits
Verify API key is valid
Monitor OpenAI dashboard for usage

Performance Tips

File Size

Works best with files under 100MB. For larger datasets, consider sampling or filtering data first.

Query Complexity

Simple aggregations are fast. Complex multi-step queries may take 5-10 seconds.

Result Size

Limit large result sets: “show me top 100” instead of returning millions of rows.

API Costs

Each query uses OpenAI API tokens. Simple queries use ~500 tokens, complex ones may use 2000+.

Next Steps

Tutorial

Follow the complete step-by-step tutorial

More Examples

Explore other AI agent examples

Agno Framework

Learn more about the Agno agent framework

GitHub

View source code and contribute

Starter Examples

Advanced Examples

RAG Examples

​Overview

​Features

File Upload Support

Natural Language Queries

Advanced Analysis

Interactive UI

​How It Works

​Setup

​Usage

​Quick Start

​Example Queries

​Code Example

​Data Preprocessing

​Agent Configuration

​Complete Streamlit App Structure

​Advanced Features

​DuckDB Integration

​Automatic Type Detection

​Use Cases

Sales Analytics

Customer Analysis

Financial Reporting

Operations Data

Marketing Analytics

HR Analytics

​Best Practices

​Optimizing Queries

​Troubleshooting

​Performance Tips

File Size

Query Complexity

Result Size

API Costs

​Next Steps

Tutorial

More Examples

Agno Framework

GitHub

Build docs developers (and LLMs) love

Overview

Features

How It Works

Setup

Usage

Quick Start

Example Queries

Code Example

Data Preprocessing

Agent Configuration

Complete Streamlit App Structure

Advanced Features

DuckDB Integration

Automatic Type Detection

Use Cases

Best Practices

Optimizing Queries

Troubleshooting

Performance Tips

Next Steps