Data Ingestion

Overview

Finance Agent requires two main types of financial data:

Earnings Transcripts: Quarterly earnings call transcripts stored with full text and vector embeddings
10-K Filings: Annual SEC filings with sophisticated section extraction and table identification

Both are stored in PostgreSQL with the pgvector extension for semantic search.

Prerequisites

Database Setup

Ensure PostgreSQL with pgvector extension is configured:

# Set in your .env file
PG_VECTOR=postgresql://username:password@host:5432/database

API Keys

Configure required API keys in .env:

OPENAI_API_KEY=sk-your-key-here  # For embeddings
API_NINJAS_KEY=your-key-here     # For downloading transcripts

S3 Bucket (Optional)

For full transcript storage (enables transcript viewer):

RAILWAY_BUCKET_ENDPOINT=your-s3-endpoint
RAILWAY_BUCKET_ACCESS_KEY_ID=your-access-key
RAILWAY_BUCKET_SECRET_KEY=your-secret-key
RAILWAY_BUCKET_NAME=your-bucket-name

Ingesting Earnings Transcripts

Step 1: Download Transcripts

Download earnings call transcripts using API Ninjas:

cd agent/rag/data_ingestion
python download_transcripts.py --ticker AAPL

python download_transcripts.py --ticker AAPL

Transcripts are saved as JSON files in agent/rag/data_ingestion/earnings_transcripts/.

Step 2: Process and Store in Database

Create embeddings and store chunks in PostgreSQL:

python ingest_with_structure.py --ticker AAPL

python ingest_with_structure.py --ticker AAPL

What happens during ingestion:

Text Chunking

Transcripts are split into chunks (1000 chars, 200 char overlap) for semantic search

Embedding Generation

Each chunk is embedded using all-MiniLM-L6-v2 (384-dimensional vectors)

Database Storage

Chunks and embeddings are stored in the transcript_chunks table with metadata (ticker, year, quarter)

Hybrid Search Index

Creates both vector (IVFFlat) and keyword (GIN) indexes for hybrid search

Step 3: Upload Full Transcripts to S3 (Optional)

For the transcript viewer feature, upload full transcripts to S3 with character offsets:

python ingest_transcripts_to_bucket.py --tickers AAPL MSFT GOOGL

python ingest_transcripts_to_bucket.py --tickers AAPL MSFT

This script:

Uploads full transcript text to S3 (transcripts/{ticker}/{year}_Q{quarter}.txt)
Updates complete_transcripts table with bucket_key
Computes char_offset and chunk_length for each chunk to enable precise navigation

Without S3 upload, the transcript viewer won’t be able to display full transcripts with highlighted chunks. The RAG system will still work using database chunks.

Ingesting 10-K Filings

Download and Process 10-K Filings

The 10-K ingestion uses sophisticated processing with section extraction and table identification:

cd agent/rag/data_ingestion
python ingest_10k_to_database.py --ticker AAPL

python ingest_10k_to_database.py --ticker AAPL

Advanced Options

# Process all US companies (9500+)
python ingest_10k_to_database.py --all-tickers --workers 8

# Resume from specific position
python ingest_10k_to_database.py --all-tickers --skip-first 100

# Limit number of tickers
python ingest_10k_to_database.py --all-tickers --max-tickers 50

# Set timeout per ticker (in minutes)
python ingest_10k_to_database.py --ticker AAPL --timeout 45

# Retry failed tickers
python ingest_10k_to_database.py --all-tickers --retry-failed

10-K Processing Pipeline

The ingestion performs sophisticated document processing:

Download from SEC EDGAR

Uses datamule library to fetch 10-K filings for the specified fiscal years

Hierarchical Section Extraction

Identifies SEC sections (Item 1, Item 1A, Item 7, etc.) and preserves document structure

Table Extraction

Extracts tables and identifies financial statements (balance sheet, income statement, cash flow)

Contextual Chunking

Chunks preserve section context in metadata for better retrieval

Embedding Generation

Creates 384-dimensional embeddings using all-MiniLM-L6-v2

Database Storage

Stores in three tables:

ten_k_chunks: Text chunks with embeddings and section metadata
ten_k_tables: Extracted tables with financial statement flags
complete_sec_filings: Full document text with section offsets

Database Schema

transcript_chunks

CREATE TABLE transcript_chunks (
    id SERIAL PRIMARY KEY,
    ticker VARCHAR(10),
    year INTEGER,
    quarter INTEGER,
    chunk_index INTEGER,
    chunk_text TEXT,
    embedding vector(384),
    char_offset INTEGER,        -- For S3 navigation
    chunk_length INTEGER,       -- For S3 navigation
    metadata JSONB
);

ten_k_chunks

CREATE TABLE ten_k_chunks (
    id SERIAL PRIMARY KEY,
    ticker VARCHAR(10),
    fiscal_year INTEGER,
    filing_type VARCHAR(10),
    chunk_index INTEGER,
    chunk_text TEXT,
    embedding vector(384),
    sec_section VARCHAR(50),         -- e.g., "item_1", "item_1a"
    sec_section_title VARCHAR(200),  -- e.g., "Business", "Risk Factors"
    chunk_type VARCHAR(50),          -- e.g., "text", "table"
    path_string TEXT,                -- Hierarchical path
    metadata JSONB
);

ten_k_tables

CREATE TABLE ten_k_tables (
    id SERIAL PRIMARY KEY,
    table_id VARCHAR(100) UNIQUE,
    ticker VARCHAR(10),
    fiscal_year INTEGER,
    content TEXT,
    table_data JSONB,
    is_financial_statement BOOLEAN,
    statement_type VARCHAR(50),  -- e.g., "balance_sheet", "income_statement"
    priority VARCHAR(20)         -- e.g., "HIGH", "NORMAL"
);

Verification

After ingestion, verify the data:

Check Transcript Data

-- Count chunks per ticker
SELECT ticker, year, quarter, COUNT(*) as chunk_count
FROM transcript_chunks
GROUP BY ticker, year, quarter
ORDER BY year DESC, quarter DESC;

-- Verify embeddings
SELECT COUNT(*) as total_chunks,
       COUNT(embedding) as chunks_with_embeddings
FROM transcript_chunks;

Check 10-K Data

-- Count 10-K chunks per company
SELECT ticker, fiscal_year, COUNT(*) as chunk_count
FROM ten_k_chunks
GROUP BY ticker, fiscal_year
ORDER BY ticker, fiscal_year DESC;

-- Check financial statements
SELECT ticker, fiscal_year, statement_type, COUNT(*)
FROM ten_k_tables
WHERE is_financial_statement = true
GROUP BY ticker, fiscal_year, statement_type
ORDER BY ticker, fiscal_year DESC;

Test Search

# Test the RAG agent
python -c "
import asyncio
from agent.rag.rag_agent import RAGAgent

agent = RAGAgent()
result = asyncio.run(agent.execute_rag_flow(
    'What was AAPL revenue in Q1 2025?',
    conversation_id='test'
))
print(result['answer'])
"

Troubleshooting

Common Issues:

Missing pgvector extension: Run CREATE EXTENSION vector; in PostgreSQL
API key errors: Verify OPENAI_API_KEY and API_NINJAS_KEY in .env
Memory issues during 10-K ingestion: Reduce --workers or increase timeout
Duplicate data: The scripts automatically skip already-processed filings

Performance Tips

Parallel workers: Use --workers 4-8 for faster processing
Batch size: Process 50-100 tickers at a time for optimal memory usage
Resume capability: Use --skip-first N to resume interrupted batch jobs
Timeout: Set --timeout 45 for companies with large 10-K filings

Next Steps

After ingestion:

Configure RAG settings for optimal performance
Learn how to ask effective questions
Set up authentication for production deployment

Get Started

Core Concepts

Features

Guides

Agent System

Overview

Prerequisites

Ingesting Earnings Transcripts

Step 1: Download Transcripts

Step 2: Process and Store in Database

Step 3: Upload Full Transcripts to S3 (Optional)

Ingesting 10-K Filings

Download and Process 10-K Filings

Advanced Options

10-K Processing Pipeline

Database Schema

Verification

Check Transcript Data

Check 10-K Data

Test Search

Troubleshooting

Performance Tips

Next Steps

Build docs developers (and LLMs) love

Get Started

Core Concepts

Features

Guides

Agent System

​Overview

​Prerequisites

​Ingesting Earnings Transcripts

​Step 1: Download Transcripts

​Step 2: Process and Store in Database

​Step 3: Upload Full Transcripts to S3 (Optional)

​Ingesting 10-K Filings

​Download and Process 10-K Filings

​Advanced Options

​10-K Processing Pipeline

​Database Schema

​Verification

​Check Transcript Data

​Check 10-K Data

​Test Search

​Troubleshooting

​Performance Tips

​Next Steps

Build docs developers (and LLMs) love

Overview

Prerequisites

Ingesting Earnings Transcripts

Step 1: Download Transcripts

Step 2: Process and Store in Database

Step 3: Upload Full Transcripts to S3 (Optional)

Ingesting 10-K Filings

Download and Process 10-K Filings

Advanced Options

10-K Processing Pipeline

Database Schema

Verification

Check Transcript Data

Check 10-K Data

Test Search

Troubleshooting

Performance Tips

Next Steps