Skip to main content

Installation

Install the CSV connector:
npm install @evidence-dev/csv

Configuration

connection.yaml
name: csv
type: csv
options:
  options: ""
The CSV connector uses DuckDB’s read_csv function under the hood, which automatically detects file structure and types.

Configuration Parameters

options
string
DuckDB CSV options string (see DuckDB CSV documentation)

How It Works

The CSV connector:
  1. Looks for .csv files in your sources directory
  2. Automatically loads them using DuckDB’s CSV reader
  3. Applies auto-detection for delimiters, headers, and data types
  4. Makes the data queryable via SQL

File Placement

Place CSV files in your sources directory:
your-project/
  sources/
    my_connection/
      connection.yaml
      sales_data.csv
      customers.csv
      queries/
        *.sql

Example Queries

Query a CSV File

Create a .sql file that references your CSV:
queries/sales_summary.sql
SELECT 
  DATE_TRUNC('month', order_date) as month,
  SUM(amount) as total_revenue,
  COUNT(*) as order_count
FROM 'sales_data.csv'
GROUP BY 1
ORDER BY 1 DESC

Join Multiple CSV Files

queries/customer_orders.sql
SELECT 
  c.customer_name,
  c.segment,
  COUNT(o.order_id) as order_count,
  SUM(o.amount) as total_spent
FROM 'customers.csv' c
LEFT JOIN 'orders.csv' o
  ON c.customer_id = o.customer_id
GROUP BY 1, 2
ORDER BY 4 DESC

Filter and Transform

queries/recent_orders.sql
SELECT 
  order_id,
  customer_id,
  order_date,
  amount,
  CASE 
    WHEN amount >= 1000 THEN 'High Value'
    WHEN amount >= 500 THEN 'Medium Value'
    ELSE 'Low Value'
  END as order_category
FROM 'orders.csv'
WHERE order_date >= CURRENT_DATE - INTERVAL 30 DAYS
ORDER BY order_date DESC

CSV Options

Customize CSV parsing using the options parameter:
connection.yaml
name: csv
type: csv
options:
  options: "delim=';', header=true, quote='\"'"

Common Options

  • delim=',' - Column delimiter (default: auto-detect)
  • header=true - First row contains headers (default: auto-detect)
  • quote='"' - Quote character (default: ")
  • escape='"' - Escape character (default: ")
  • skip=0 - Number of lines to skip at the beginning
  • dateformat='%Y-%m-%d' - Date format string
  • timestampformat='%Y-%m-%d %H:%M:%S' - Timestamp format
  • nullstr='NULL' - String representing NULL values

Example with Custom Options

connection.yaml
name: csv_custom
type: csv
options:
  options: "delim='|', header=true, nullstr='N/A', dateformat='%d/%m/%Y'"
In most cases, you can leave options empty. DuckDB’s auto-detection is very good at identifying the correct format.

Advanced Features

Wildcard Patterns

Query multiple files at once:
queries/all_regions.sql
SELECT 
  region,
  SUM(sales) as total_sales
FROM 'sales_*.csv'
GROUP BY region

Glob Patterns

Use glob patterns for flexible file matching:
queries/yearly_data.sql
SELECT *
FROM 'data/year=*/month=*/*.csv'
WHERE year >= 2023

Column Selection

Select specific columns:
queries/subset.sql
SELECT 
  order_id,
  customer_id,
  amount
FROM 'large_dataset.csv'
LIMIT 1000

Type Casting

Explicitly cast column types:
queries/with_types.sql
SELECT 
  order_id::INTEGER,
  order_date::DATE,
  amount::DECIMAL(10,2)
FROM 'orders.csv'

CSV File Best Practices

File Format

  • Use UTF-8 encoding
  • Include headers in the first row
  • Use consistent date formats (ISO 8601 recommended: YYYY-MM-DD)
  • Quote fields containing delimiters or newlines

Example Well-Formatted CSV

sales_data.csv
order_id,order_date,customer_id,product,quantity,amount
1001,2024-01-15,C001,Widget A,5,149.99
1002,2024-01-15,C002,Widget B,2,79.98
1003,2024-01-16,C001,Widget C,1,299.99

Optimize for Performance

  • Keep files under 100MB for fastest loading
  • For larger datasets, split into multiple files
  • Use consistent column types across files
  • Remove unnecessary columns

Type Detection

DuckDB automatically detects column types:
  • Numbers: Integers and decimals
  • Dates: ISO 8601 formats (YYYY-MM-DD)
  • Timestamps: ISO 8601 with time (YYYY-MM-DD HH:MM:SS)
  • Strings: Any non-numeric data
  • Booleans: true/false, t/f, yes/no, 1/0

Override Type Detection

If auto-detection fails, cast explicitly in queries:
SELECT 
  id::INTEGER,
  created_date::DATE,
  revenue::DECIMAL(10,2)
FROM 'data.csv'

Troubleshooting

  • Ensure CSV files are in your sources directory
  • Check file names match exactly (case-sensitive)
  • Use relative paths from the sources directory
  • Check for inconsistent data types in columns
  • Look for non-numeric values in numeric columns
  • Use explicit type casting in queries:
    SELECT column_name::INTEGER FROM 'file.csv'
    
  • Specify the delimiter explicitly:
    options: "delim=';'"
    
  • Common delimiters: , (comma), ; (semicolon), \t (tab), | (pipe)
  • Use ISO 8601 format (YYYY-MM-DD) when possible
  • Specify date format if using different format:
    options: "dateformat='%d/%m/%Y'"
    
  • Split large files into smaller chunks
  • Select only needed columns:
    SELECT col1, col2 FROM 'large.csv'
    
  • Use filtering to reduce data:
    SELECT * FROM 'large.csv' WHERE date >= '2024-01-01'
    
  • Ensure files are UTF-8 encoded
  • Convert files to UTF-8 if needed:
    iconv -f ISO-8859-1 -t UTF-8 input.csv > output.csv
    

Converting Other Formats

Excel to CSV

Export Excel files to CSV format before using with Evidence.

TSV (Tab-Separated Values)

connection.yaml
name: tsv
type: csv
options:
  options: "delim='\t'"

Pipe-Delimited Files

connection.yaml
name: pipe_delimited
type: csv
options:
  options: "delim='|'"

Migrating to Parquet

For better performance, convert CSV files to Parquet:
-- Export CSV to Parquet
COPY (SELECT * FROM 'data.csv') 
TO 'data.parquet' (FORMAT PARQUET);
Then use the DuckDB connector to query Parquet files directly:
SELECT * FROM 'data.parquet'

Build docs developers (and LLMs) love