Skip to main content
Finance Agent implements a sophisticated Retrieval-Augmented Generation (RAG) pipeline that combines semantic search, hybrid retrieval, and iterative self-improvement to deliver accurate, well-sourced financial analysis.

What is RAG?

Retrieval-Augmented Generation (RAG) is an AI architecture that enhances large language models by:
  1. Retrieving relevant information from a knowledge base
  2. Augmenting the LLM prompt with retrieved context
  3. Generating responses grounded in actual data
This approach provides several advantages over pure LLMs:

Factual Accuracy

Responses are grounded in real financial data, not LLM hallucinations

Citations

Every claim can be traced back to source documents

Up-to-date

Knowledge base can be updated without retraining the model

Domain-Specific

Specialized for financial analysis with structured data

Hybrid Search Strategy

Finance Agent uses hybrid search that combines two complementary approaches:

Vector Search (70% weight)

Semantic similarity using embeddings:
  • Model: all-MiniLM-L6-v2 (384 dimensions)
  • Database: PostgreSQL with pgvector extension
  • Similarity: Cosine distance between query and document embeddings
  • Advantages: Understands meaning, handles synonyms, works with natural language
# From agent/rag/search_engine.py:93-95
query_embedding = self.embedding_model.encode([query])
vector_results = self.database_manager._search_postgres_with_ticker(
    query_embedding, ticker, search_quarter
)

Keyword Search (30% weight)

Traditional text matching using TF-IDF:
  • Method: PostgreSQL full-text search with ts_rank
  • Preprocessing: Extract keywords, build search vectors
  • Advantages: Exact phrase matching, handles technical terms, fast execution
# Keyword search complements vector search
keyword_results = self._search_keywords_with_ticker(
    query, ticker, search_quarter
)

Why Hybrid?

Combining both approaches provides the best of both worlds:
  • Understands “capex” and “capital expenditures” are related
  • Handles questions phrased differently than source text
  • Captures semantic meaning beyond exact words
  • Finds exact numbers and technical terms (“$2.5B”, “EBITDA”)
  • Better for precise phrase matching
  • Faster execution on large datasets
  • Higher recall: Finds more relevant chunks
  • Better precision: Ranks most relevant chunks first
  • Robust to different query styles
Configuration: The 70/30 split is configurable in rag/config.py:
"vector_weight": 0.7,
"keyword_weight": 0.3,

Database Schema

The RAG system uses PostgreSQL with the pgvector extension:
-- Earnings call transcripts
CREATE TABLE transcript_chunks (
    chunk_text TEXT,              -- 1000 chars max, 200 overlap
    embedding VECTOR(384),        -- all-MiniLM-L6-v2 embeddings
    ticker VARCHAR,               -- e.g., "AAPL"
    year INTEGER,                 -- e.g., 2024
    quarter INTEGER,              -- 1-4
    metadata JSONB
);

-- 10-K filing text chunks
CREATE TABLE ten_k_chunks (
    chunk_text TEXT,
    embedding VECTOR(384),
    sec_section VARCHAR,          -- item_1, item_7, item_8, etc.
    sec_section_title TEXT,       -- Human-readable section name
    is_financial_statement BOOLEAN
);

-- 10-K extracted tables
CREATE TABLE ten_k_tables (
    content JSONB,                -- Table data
    statement_type VARCHAR,       -- income_statement, balance_sheet, etc.
    is_financial_statement BOOLEAN
);
Chunking strategy: Text is split into 1000-character chunks with 200-character overlap to ensure context continuity across chunk boundaries.

Parallel Retrieval

For performance, Finance Agent executes searches in parallel:

Multi-Ticker Parallelization

When comparing multiple companies:
Input: "Compare $AAPL and $MSFT revenue"

Process:
├── Rephrase for AAPL: "revenue and sales performance"
├── Rephrase for MSFT: "revenue and sales performance"
├── Search AAPL chunks (parallel)
├── Search MSFT chunks (parallel)
├── Synthesis prompt combines both
└── Output: Comparative analysis

Multi-Quarter Parallelization

For time-range queries:
Example: "last 3 quarters" = [2024_q4, 2025_q1, 2025_q2]
Follow-up phrases: ["capex guidance", "AI investments", "margin trends"]

Execution:
├── "capex guidance" → searches 2024_q4, 2025_q1, 2025_q2 (parallel)
├── "AI investments" → searches 2024_q4, 2025_q1, 2025_q2 (parallel)
└── "margin trends" → searches 2024_q4, 2025_q1, 2025_q2 (parallel)

Result: All chunks deduped by citation, merged into context
Parallel execution uses ThreadPoolExecutor with up to 10 workers for maximum throughput.

Iterative Improvement Loop

The agent doesn’t just generate one answer and stop. It performs iterative self-improvement until quality thresholds are met:
┌─────────────────────────────────────────────────────────────────┐
│                    ITERATION LOOP                                │
│                                                                  │
│  ┌──────────────────┐                                           │
│  │ Generate Answer  │◄──────────────────────────────────┐       │
│  └────────┬─────────┘                                   │       │
│           │                                             │       │
│           ▼                                             │       │
│  ┌──────────────────┐                                   │       │
│  │ Evaluate Quality │                                   │       │
│  │ • completeness   │                                   │       │
│  │ • specificity    │                                   │       │
│  │ • accuracy       │                                   │       │
│  │ • vs. reasoning  │ ← Checks if reasoning goals met   │       │
│  └────────┬─────────┘                                   │       │
│           │                                             │       │
│           ▼                                             │       │
│  ┌──────────────────┐    YES    ┌─────────────────┐    │       │
│  │ Confidence < 90% │─────────► │ Search for more │────┘       │
│  │ & iterations left│           │ context (tools) │            │
│  └────────┬─────────┘           └─────────────────┘            │
│           │ NO                                                  │
│           ▼                                                     │
│     ┌───────────┐                                               │
│     │  OUTPUT   │                                               │
│     └───────────┘                                               │
└─────────────────────────────────────────────────────────────────┘

Evaluation Metrics

The agent evaluates each answer on four dimensions (0-100 scale):
MetricWhat It Measures
CompletenessDoes the answer fully address the question?
SpecificityDoes it include specific numbers, quotes, dates?
AccuracyIs the information factually correct?
ClarityIs the response well-structured and readable?
These scores are combined into an overall confidence (0-1 scale).

Iteration Actions

During iteration, the agent can:
1

Generate Follow-up Keywords

Create search-optimized keyword phrases (NOT verbose questions) for missing information.OLD Approach (verbose questions):
❌ "What specific revenue growth percentage was reported?"
NEW Approach (search-optimized keywords):
✅ "revenue growth percentage quarter comparison"
✅ "capex guidance 2025 AI allocation"
2

Request Transcript Search

Set needs_transcript_search=true to search earnings call transcripts.
3

Request News Search

Set needs_news_search=true to search for recent developments.
4

Search All Quarters

Each keyword phrase searches ALL target quarters in parallel for comprehensive coverage.
5

Regenerate Answer

Build improved answer with expanded context from all sources.

Stop Conditions

Iteration stops when any of these conditions are met:
  1. Confidence ≥ threshold (varies by answer mode: 70-95%)
  2. Max iterations reached (2-10 depending on answer mode)
  3. Agent decides answer is sufficient (explicit satisfaction signal)
  4. No follow-up keyword phrases generated (nothing left to search)
The agent automatically adjusts iteration depth based on question complexity (answer mode: direct/standard/detailed).

Answer Generation

Once sufficient context is retrieved, the agent generates answers using:

Single-Ticker Responses

For questions about one company:
# From agent/rag/response_generator.py
def generate_openai_response(question, chunks, metadata):
    """
    Generate response with:
    - Company-specific context
    - Quarter metadata preserved
    - All financial figures included
    - Citation markers [1], [2], etc.
    """

Multi-Ticker Synthesis

For comparative questions:
def generate_multi_ticker_response(question, ticker_results):
    """
    Synthesis requirements:
    • ALWAYS maintain period metadata (Q1 2025, FY 2024)
    • ALWAYS include ALL financial figures from ALL sources
    • Show trends and comparisons across companies
    • Use human-friendly format: "Q1 2025" not "2025_q1"
    """

Citation System

Every claim is backed by citations:
  • Transcript citations: [1], [2], [3]
  • 10-K citations: [10K1], [10K2], [10K3]
  • News citations: [N1], [N2], [N3]
Citations include:
  • Source type (earnings call, 10-K filing, news article)
  • Company ticker
  • Time period (Q1 2025, FY 2024, date)
  • URL or document reference

Configuration

The RAG pipeline is highly configurable:
{
    # Retrieval settings
    "chunks_per_quarter": 15,      # Results per quarter
    "max_quarters": 12,            # Max 3 years of data
    "max_tickers": 8,              # Max companies per query

    # Hybrid search weights
    "keyword_weight": 0.3,
    "vector_weight": 0.7,

    # Models
    "cerebras_model": "qwen-3-235b-a22b-instruct-2507",
    "openai_model": "gpt-5-nano-2025-08-07",
    "evaluation_model": "qwen-3-235b-a22b-instruct-2507",
    "embedding_model": "all-MiniLM-L6-v2",
    "llm_provider": "cerebras",  # or "openai" | "auto"

    # Chunking
    "chunk_size": 1000,
    "chunk_overlap": 200,
}

Performance Optimizations

Vector and keyword searches run in parallel using ThreadPoolExecutor for 2x speedup.
Query embeddings are generated asynchronously to avoid blocking the event loop:
async def encode_query_async(self, query: str) -> np.ndarray:
    loop = asyncio.get_event_loop()
    return await loop.run_in_executor(
        self._executor, self.embedding_model.encode, [query]
    )
Reuses database connections across requests to avoid connection overhead.
Removes duplicate chunks across multiple searches to reduce context size.
Stops early when confidence thresholds are met, avoiding unnecessary LLM calls.

Next Steps

Data Sources

Learn about the three specialized data source tools

Architecture

Understand the complete six-stage pipeline

Build docs developers (and LLMs) love