Database Setup

Overview

Finance Agent uses PostgreSQL with the pgvector extension for:

Vector embeddings storage and similarity search
Hybrid search (vector + keyword/TF-IDF)
User authentication and session management
Chat history and conversation threads
Query analytics and logging

The pgvector extension is required for Finance Agent to function. Standard PostgreSQL without pgvector will not work.

PostgreSQL Installation

macOS

# Using Homebrew
brew install postgresql@14

# Start PostgreSQL
brew services start postgresql@14

Ubuntu/Debian

# Install PostgreSQL
sudo apt update
sudo apt install postgresql postgresql-contrib

# Start PostgreSQL
sudo systemctl start postgresql
sudo systemctl enable postgresql

Windows

Download and install from postgresql.org/download/windows

pgvector Extension Setup

Install pgvector

macOS

# Using Homebrew
brew install pgvector

Ubuntu/Debian

# Install build dependencies
sudo apt install postgresql-server-dev-14 build-essential git

# Clone and build pgvector
git clone https://github.com/pgvector/pgvector.git
cd pgvector
make
sudo make install

Docker (Recommended for Development)

# Use official pgvector image
docker run -d \
  --name finance-agent-db \
  -e POSTGRES_PASSWORD=changeme \
  -e POSTGRES_DB=stratalens \
  -p 5432:5432 \
  pgvector/pgvector:pg14

Create the database

# Connect to PostgreSQL
psql -U postgres

# Create database
CREATE DATABASE stratalens;

# Connect to the database
\c stratalens

# Enable pgvector extension
CREATE EXTENSION IF NOT EXISTS vector;

# Verify installation
SELECT * FROM pg_extension WHERE extname = 'vector';

Expected output:

 oid  | extname | extowner | extnamespace | ... 
------+---------+----------+--------------+-----
 ... | vector  |       10 |         2200 | ...

Create a database user (optional)

For production or multi-user setups:

-- Create user
CREATE USER finance_agent WITH PASSWORD 'secure_password_here';

-- Grant privileges
GRANT ALL PRIVILEGES ON DATABASE stratalens TO finance_agent;

-- Connect to database and grant schema privileges
\c stratalens
GRANT ALL ON SCHEMA public TO finance_agent;

Update environment variables

Add the database URL to your .env file:

# For default postgres user
DATABASE_URL=postgresql://postgres:changeme@localhost:5432/stratalens

# For custom user
DATABASE_URL=postgresql://finance_agent:secure_password_here@localhost:5432/stratalens

Schema Initialization

Finance Agent automatically creates all required tables on first startup. The schema includes:

Core Tables

Users and Authentication

users - User accounts with Clerk integration
invitations - User invitation codes and metadata
user_preferences - User settings and preferences

RAG System

transcript_chunks - Earnings transcript embeddings and metadata
- Contains: chunk_text, embedding (vector), ticker, year, quarter, chunk_index
- Primary data source for earnings call queries
ten_k_chunks - SEC 10-K filing embeddings and metadata
- Contains: chunk_text, embedding (vector), ticker, fiscal_year, sec_section, chunk_index
- Used for annual report and financial statement queries
ten_k_tables - Extracted tables from 10-K filings
- Contains: table_data, ticker, fiscal_year, is_financial_statement, statement_type

Chat and History

chat_conversations - Conversation threads (like ChatGPT)
chat_messages - Individual messages within conversations
chat_history - Legacy chat history (backward compatibility)
chat_analytics - Query analytics and performance metrics

Analytics and Logging

query_history - User query history and execution details
search_queries - Comprehensive search query logs
user_usage - Usage tracking for rate limiting and billing
system_logs - Application-level logging

All tables are created automatically when you first run the application. You don’t need to run SQL scripts manually.

Verify Database Setup

Check database connection

# Test PostgreSQL is running
pg_isready

# Expected output:
# /var/run/postgresql:5432 - accepting connections

Verify pgvector extension

psql -U postgres -d stratalens -c "SELECT extversion FROM pg_extension WHERE extname = 'vector';"

# Expected output:
#  extversion 
# ------------
#  0.5.0

Start the application

python -m uvicorn app.main:app --host 0.0.0.0 --port 8000

Check the logs for:

✅ Database pool created: min_size=10, max_size=50
✅ Usage tracking table created
✅ Chat conversations table created
✅ System logs table created

Verify tables were created

psql -U postgres -d stratalens -c "\dt"

You should see tables like:

users
chat_conversations
chat_messages
transcript_chunks
ten_k_chunks
query_history
And more…

Data Ingestion

After setting up the database, you need to ingest financial data for the RAG system to work.

Ingest Earnings Transcripts

# Download transcripts
python agent/rag/data_ingestion/download_transcripts.py

# Ingest transcripts for a specific company and date range
python agent/rag/data_ingestion/ingest_with_structure.py \
  --ticker AAPL \
  --year-start 2020 \
  --year-end 2025

This creates embeddings and stores them in the transcript_chunks table with pgvector indexes for fast similarity search.

Ingest SEC 10-K Filings

# Ingest 10-K filings for a specific company
python agent/rag/data_ingestion/ingest_10k_to_database.py \
  --ticker AMZN \
  --year-start 2019 \
  --year-end 2024

The ingestion process:

Downloads filings via datamule
Extracts text with structural metadata (sections, tables)
Generates embeddings using sentence-transformers
Stores chunks in ten_k_chunks table
Creates vector indexes for similarity search

Data ingestion can be resource-intensive. For large date ranges, consider running ingestion overnight or in batches.

Database Indexes

Finance Agent automatically creates indexes for optimal performance:

Vector Indexes (pgvector)

-- Automatically created during ingestion
CREATE INDEX ON transcript_chunks USING ivfflat (embedding vector_cosine_ops);
CREATE INDEX ON ten_k_chunks USING ivfflat (embedding vector_cosine_ops);

B-tree Indexes

-- User and authentication indexes
CREATE INDEX idx_users_clerk_id ON users(clerk_user_id);
CREATE INDEX idx_users_username ON users(username);
CREATE INDEX idx_users_email ON users(email);

-- Chat and analytics indexes
CREATE INDEX idx_chat_messages_conversation_id ON chat_messages(conversation_id);
CREATE INDEX idx_chat_analytics_created_at ON chat_analytics(created_at);
CREATE INDEX idx_query_history_user_id ON query_history(user_id);

All indexes are created automatically by the application. You don’t need to create them manually.

Connection Pool Configuration

Finance Agent uses connection pooling for optimal performance:

Development Settings

# From config.py
DEVELOPMENT_MIN_SIZE = 10  # Minimum connections
DEVELOPMENT_MAX_SIZE = 50  # Maximum connections
DEVELOPMENT_COMMAND_TIMEOUT = 30  # seconds
DEVELOPMENT_TIMEOUT = 20  # seconds

Production Settings

# From config.py
PRODUCTION_MIN_SIZE = 5   # Minimum connections
PRODUCTION_MAX_SIZE = 30  # Maximum connections
PRODUCTION_COMMAND_TIMEOUT = 20  # seconds
PRODUCTION_TIMEOUT = 15  # seconds

The application automatically selects the appropriate configuration based on the ENVIRONMENT variable.

Troubleshooting

pgvector Extension Not Found

Error: ERROR: extension "vector" does not exist Solution:

Verify pgvector is installed: ls $(pg_config --pkglibdir) | grep vector
Reinstall pgvector (see installation steps above)
Restart PostgreSQL: sudo systemctl restart postgresql

Connection Pool Exhausted

Error: asyncpg.exceptions.TooManyConnectionsError Solution:

Increase max_connections in PostgreSQL:

-- Check current setting
SHOW max_connections;

-- Edit postgresql.conf
max_connections = 100

Restart PostgreSQL
Adjust pool sizes in config.py if needed

Slow Vector Queries

Symptom: Vector similarity searches taking >1 second Solution:

Ensure vector indexes exist:

SELECT indexname FROM pg_indexes WHERE tablename = 'transcript_chunks';

Rebuild indexes if needed:

REINDEX INDEX CONCURRENTLY transcript_chunks_embedding_idx;

Enable debug mode to see query plans:
```
RAG_DEBUG_MODE=true
```

Database Locks

Error: deadlock detected Solution:

Check for long-running queries:

SELECT pid, now() - query_start as duration, query 
FROM pg_stat_activity 
WHERE state = 'active' AND now() - query_start > interval '5 minutes';

Kill problematic queries if needed:
```
SELECT pg_terminate_backend(pid);
```

Production Recommendations

These settings are critical for production deployments.

Security

Use strong passwords for database users

Enable SSL/TLS for database connections:

DATABASE_URL=postgresql://user:pass@host:5432/db?sslmode=require

Restrict network access using firewall rules

Enable connection limits per user:

ALTER USER finance_agent CONNECTION LIMIT 50;

Performance

Enable query logging for monitoring:

-- In postgresql.conf
log_min_duration_statement = 1000  # Log queries > 1 second

Configure shared_buffers (25% of RAM):
```
shared_buffers = 4GB
```
Enable autovacuum for maintenance:
```
autovacuum = on
```
Monitor table bloat and run VACUUM regularly

Backup

Set up automated backups:

# Daily backup
pg_dump -U postgres stratalens > backup_$(date +%Y%m%d).sql

Use point-in-time recovery (PITR) with WAL archiving
Test restore procedures regularly

Setup

Architecture

Contributing

Overview

PostgreSQL Installation

macOS

Ubuntu/Debian

Windows

pgvector Extension Setup

macOS

Ubuntu/Debian

Docker (Recommended for Development)

Schema Initialization

Core Tables

Users and Authentication

RAG System

Chat and History

Analytics and Logging

Verify Database Setup

Data Ingestion

Ingest Earnings Transcripts

Ingest SEC 10-K Filings

Database Indexes

Vector Indexes (pgvector)

B-tree Indexes

Connection Pool Configuration

Development Settings

Production Settings

Troubleshooting

pgvector Extension Not Found

Connection Pool Exhausted

Slow Vector Queries

Database Locks

Production Recommendations

Security

Performance

Backup

Next Steps

Build docs developers (and LLMs) love

Setup

Architecture

Contributing

​Overview

​PostgreSQL Installation

​macOS

​Ubuntu/Debian

​Windows

​pgvector Extension Setup

​macOS

​Ubuntu/Debian

​Docker (Recommended for Development)

​Schema Initialization

​Core Tables

​Users and Authentication

​RAG System

​Chat and History

​Analytics and Logging

​Verify Database Setup

​Data Ingestion

​Ingest Earnings Transcripts

​Ingest SEC 10-K Filings

​Database Indexes

​Vector Indexes (pgvector)

​B-tree Indexes

​Connection Pool Configuration

​Development Settings

​Production Settings

​Troubleshooting

​pgvector Extension Not Found

​Connection Pool Exhausted

​Slow Vector Queries

​Database Locks

​Production Recommendations

​Security

​Performance

​Backup

​Next Steps

Build docs developers (and LLMs) love

Overview

PostgreSQL Installation

macOS

Ubuntu/Debian

Windows

pgvector Extension Setup

macOS

Ubuntu/Debian

Docker (Recommended for Development)

Schema Initialization

Core Tables

Users and Authentication

RAG System

Chat and History

Analytics and Logging

Verify Database Setup

Data Ingestion

Ingest Earnings Transcripts

Ingest SEC 10-K Filings

Database Indexes

Vector Indexes (pgvector)

B-tree Indexes

Connection Pool Configuration

Development Settings

Production Settings

Troubleshooting

pgvector Extension Not Found

Connection Pool Exhausted

Slow Vector Queries

Database Locks

Production Recommendations

Security

Performance

Backup

Next Steps