Overview
Finance Agent uses PostgreSQL with pgvector for storing embeddings and metadata, while full documents are stored in Railway S3. This hybrid approach balances search performance with cost-effective storage.Schema Tables
Finance Agent maintains three core tables:transcript_chunks- Earnings call transcript embeddingsten_k_chunks- SEC 10-K filing text embeddingsten_k_tables- SEC 10-K extracted financial tables
Transcript Chunks
Schema: transcript_chunks
Stores chunked earnings call transcripts with vector embeddings for semantic search.
Key Fields
| Field | Type | Description |
|---|---|---|
chunk_text | TEXT | Transcript segment (max 1000 characters, 200 overlap) |
embedding | VECTOR(384) | Dense vector from all-MiniLM-L6-v2 model |
ticker | VARCHAR(10) | Company stock ticker (e.g., “AAPL”, “MSFT”) |
year | INTEGER | Calendar year of earnings call (e.g., 2024) |
quarter | INTEGER | Fiscal quarter (1-4) |
chunk_index | INTEGER | Position in original transcript for ordering |
speaker | VARCHAR(255) | Speaker name (CEO, CFO, analyst) |
metadata | JSONB | Flexible JSON storage for additional context |
Chunking Strategy
- Chunk Size: 1000 characters
- Overlap: 200 characters between consecutive chunks
- Embedding Model:
all-MiniLM-L6-v2(384 dimensions) - Purpose: Balance between semantic coherence and search granularity
Search Strategy
Hybrid Search (70% vector + 30% keyword):Data Coverage
- Time Range: 2020-2025
- Companies: S&P 500 constituents + major tech/finance companies
- Source: Public earnings call transcripts
SEC 10-K Chunks
Schema: ten_k_chunks
Stores chunked SEC 10-K annual filing text with section metadata for targeted retrieval.
Key Fields
| Field | Type | Description |
|---|---|---|
chunk_text | TEXT | SEC filing text segment |
embedding | VECTOR(384) | Dense vector from all-MiniLM-L6-v2 |
ticker | VARCHAR(10) | Company stock ticker |
fiscal_year | INTEGER | Fiscal year of 10-K filing (e.g., 2024) |
sec_section | VARCHAR(50) | Section identifier (see SEC Sections below) |
sec_section_title | VARCHAR(255) | Human-readable section name |
is_financial_statement | BOOLEAN | Indicates if chunk contains financial data |
chunk_index | INTEGER | Order within section |
metadata | JSONB | Additional filing metadata |
SEC 10-K Sections
The SEC 10-K agent routes queries to specific sections based on question intent:| Section ID | Section Title | Content |
|---|---|---|
item_1 | Business | Company description, products, services, operations |
item_1a | Risk Factors | Business risks, legal, regulatory, market risks |
item_1b | Unresolved Staff Comments | SEC comments (usually empty) |
item_2 | Properties | Real estate, facilities, locations |
item_3 | Legal Proceedings | Lawsuits, regulatory actions |
item_4 | Mine Safety Disclosures | Mining operations (rarely applicable) |
item_5 | Market for Registrant’s Common Equity | Stock performance, dividends |
item_6 | Reserved | (Previously “Selected Financial Data”, now deprecated) |
item_7 | Management’s Discussion and Analysis (MD&A) | Financial results, trends, outlook |
item_7a | Quantitative and Qualitative Disclosures About Market Risk | Market risk exposures |
item_8 | Financial Statements and Supplementary Data | Balance sheet, income statement, cash flow |
item_9 | Changes in and Disagreements With Accountants | Auditor changes (rare) |
item_9a | Controls and Procedures | Internal controls |
item_9b | Other Information | Miscellaneous disclosures |
item_10 | Directors, Executive Officers and Corporate Governance | Board, executives, governance |
item_11 | Executive Compensation | CEO/executive pay, stock awards |
item_12 | Security Ownership of Certain Beneficial Owners | Major shareholders |
item_13 | Certain Relationships and Related Transactions | Related-party transactions |
item_14 | Principal Accounting Fees and Services | Auditor fees |
item_15 | Exhibits, Financial Statement Schedules | Exhibits list |
SEC Agent Retrieval Strategy
The SEC 10-K agent uses planning-driven retrieval with section routing:- Planning Phase: Generate sub-questions for targeted retrieval
- Section Routing: LLM determines relevant sections (e.g., Item 11 for executive compensation)
- Parallel Search: Execute searches across sections concurrently
- Table Selection: LLM selects relevant financial tables from
ten_k_tables - Hybrid Search: Combine TF-IDF + semantic search with cross-encoder reranking
- Iterative Improvement: Re-plan and search if answer quality < 90%
Data Coverage
- Time Range: 2018-2025 (currently 2024-25 loaded)
- Filing Type: 10-K only (10-Q and 8-K coming soon)
- Companies: Major public companies
SEC 10-K Tables
Schema: ten_k_tables
Stores extracted financial tables from SEC 10-K filings in structured JSONB format.
Key Fields
| Field | Type | Description |
|---|---|---|
ticker | VARCHAR(10) | Company stock ticker |
fiscal_year | INTEGER | Fiscal year of 10-K filing |
table_index | INTEGER | Position in original document |
content | JSONB | Structured table data (headers + rows) |
statement_type | VARCHAR(100) | Financial statement type (see below) |
is_financial_statement | BOOLEAN | True for core statements (Item 8) |
sec_section | VARCHAR(50) | Source section (usually item_8) |
metadata | JSONB | Additional table metadata |
Statement Types
| Type | Description |
|---|---|
income_statement | Revenue, expenses, net income |
balance_sheet | Assets, liabilities, shareholders’ equity |
cash_flow | Operating, investing, financing cash flows |
shareholders_equity | Changes in equity accounts |
segment_reporting | Business segment breakdowns |
other | Supplementary tables |
Table Selection Process
The SEC agent uses LLM-based table selection when questions require financial data:- Question: “What was Apple’s inventory in 2023?”
- Selected Tables: Balance sheet tables containing “inventory” line items
- Result: Precise extraction without irrelevant tables
Vector Search Configuration
pgvector Setup
Distance Metrics
- Cosine Similarity (
<=>) - Default for semantic search - L2 Distance (
<->) - Euclidean distance (alternative) - Inner Product (
<#>) - Dot product (for normalized vectors)
Embedding Model
Model:all-MiniLM-L6-v2 (via sentence-transformers)
- Dimensions: 384
- Speed: ~5ms per encoding on CPU
- Quality: Strong performance on semantic similarity tasks
- License: Apache 2.0 (open source)
Data Ingestion
For detailed instructions on populating these tables, see:- agent/rag/data_ingestion/README.md - Complete ingestion pipeline documentation
Query Examples
Hybrid Transcript Search
SEC Section-Targeted Search
Financial Table Retrieval
Performance Considerations
- Index Maintenance: Rebuild IVFFlat indexes after bulk inserts
- Connection Pooling: Use
asyncpgwith connection pool (10-20 connections) - Query Limits: Cap
top_kat 50 to prevent slow queries - JSONB Indexes: GIN indexes on
metadataandcontentfor fast JSON queries - Partitioning: Consider table partitioning by year for very large datasets
Next Steps
Project Structure
Understand the codebase organization
Tech Stack
Learn about the technologies powering Finance Agent