RAG Pipeline

Finance Agent implements a sophisticated Retrieval-Augmented Generation (RAG) pipeline that combines semantic search, hybrid retrieval, and iterative self-improvement to deliver accurate, well-sourced financial analysis.

What is RAG?

Retrieval-Augmented Generation (RAG) is an AI architecture that enhances large language models by:

Retrieving relevant information from a knowledge base
Augmenting the LLM prompt with retrieved context
Generating responses grounded in actual data

This approach provides several advantages over pure LLMs:

Factual Accuracy

Responses are grounded in real financial data, not LLM hallucinations

Citations

Every claim can be traced back to source documents

Up-to-date

Knowledge base can be updated without retraining the model

Domain-Specific

Specialized for financial analysis with structured data

Hybrid Search Strategy

Finance Agent uses hybrid search that combines two complementary approaches:

Vector Search (70% weight)

Semantic similarity using embeddings:

Model: all-MiniLM-L6-v2 (384 dimensions)
Database: PostgreSQL with pgvector extension
Similarity: Cosine distance between query and document embeddings
Advantages: Understands meaning, handles synonyms, works with natural language

# From agent/rag/search_engine.py:93-95
query_embedding = self.embedding_model.encode([query])
vector_results = self.database_manager._search_postgres_with_ticker(
    query_embedding, ticker, search_quarter
)

Keyword Search (30% weight)

Traditional text matching using TF-IDF:

Method: PostgreSQL full-text search with ts_rank
Preprocessing: Extract keywords, build search vectors
Advantages: Exact phrase matching, handles technical terms, fast execution

# Keyword search complements vector search
keyword_results = self._search_keywords_with_ticker(
    query, ticker, search_quarter
)

Why Hybrid?

Combining both approaches provides the best of both worlds:

Vector Search Strengths

Understands “capex” and “capital expenditures” are related
Handles questions phrased differently than source text
Captures semantic meaning beyond exact words

Keyword Search Strengths

Finds exact numbers and technical terms (“$2.5B”, “EBITDA”)
Better for precise phrase matching
Faster execution on large datasets

Combined Power

Higher recall: Finds more relevant chunks
Better precision: Ranks most relevant chunks first
Robust to different query styles

Configuration: The 70/30 split is configurable in rag/config.py:

"vector_weight": 0.7,
"keyword_weight": 0.3,

Database Schema

The RAG system uses PostgreSQL with the pgvector extension:

-- Earnings call transcripts
CREATE TABLE transcript_chunks (
    chunk_text TEXT,              -- 1000 chars max, 200 overlap
    embedding VECTOR(384),        -- all-MiniLM-L6-v2 embeddings
    ticker VARCHAR,               -- e.g., "AAPL"
    year INTEGER,                 -- e.g., 2024
    quarter INTEGER,              -- 1-4
    metadata JSONB
);

-- 10-K filing text chunks
CREATE TABLE ten_k_chunks (
    chunk_text TEXT,
    embedding VECTOR(384),
    sec_section VARCHAR,          -- item_1, item_7, item_8, etc.
    sec_section_title TEXT,       -- Human-readable section name
    is_financial_statement BOOLEAN
);

-- 10-K extracted tables
CREATE TABLE ten_k_tables (
    content JSONB,                -- Table data
    statement_type VARCHAR,       -- income_statement, balance_sheet, etc.
    is_financial_statement BOOLEAN
);

Chunking strategy: Text is split into 1000-character chunks with 200-character overlap to ensure context continuity across chunk boundaries.

Parallel Retrieval

For performance, Finance Agent executes searches in parallel:

Multi-Ticker Parallelization

When comparing multiple companies:

Input: "Compare $AAPL and $MSFT revenue"

Process:
├── Rephrase for AAPL: "revenue and sales performance"
├── Rephrase for MSFT: "revenue and sales performance"
├── Search AAPL chunks (parallel)
├── Search MSFT chunks (parallel)
├── Synthesis prompt combines both
└── Output: Comparative analysis

Multi-Quarter Parallelization

For time-range queries:

Example: "last 3 quarters" = [2024_q4, 2025_q1, 2025_q2]
Follow-up phrases: ["capex guidance", "AI investments", "margin trends"]

Execution:
├── "capex guidance" → searches 2024_q4, 2025_q1, 2025_q2 (parallel)
├── "AI investments" → searches 2024_q4, 2025_q1, 2025_q2 (parallel)
└── "margin trends" → searches 2024_q4, 2025_q1, 2025_q2 (parallel)

Result: All chunks deduped by citation, merged into context

Parallel execution uses ThreadPoolExecutor with up to 10 workers for maximum throughput.

Iterative Improvement Loop

The agent doesn’t just generate one answer and stop. It performs iterative self-improvement until quality thresholds are met:

┌─────────────────────────────────────────────────────────────────┐
│                    ITERATION LOOP                                │
│                                                                  │
│  ┌──────────────────┐                                           │
│  │ Generate Answer  │◄──────────────────────────────────┐       │
│  └────────┬─────────┘                                   │       │
│           │                                             │       │
│           ▼                                             │       │
│  ┌──────────────────┐                                   │       │
│  │ Evaluate Quality │                                   │       │
│  │ • completeness   │                                   │       │
│  │ • specificity    │                                   │       │
│  │ • accuracy       │                                   │       │
│  │ • vs. reasoning  │ ← Checks if reasoning goals met   │       │
│  └────────┬─────────┘                                   │       │
│           │                                             │       │
│           ▼                                             │       │
│  ┌──────────────────┐    YES    ┌─────────────────┐    │       │
│  │ Confidence < 90% │─────────► │ Search for more │────┘       │
│  │ & iterations left│           │ context (tools) │            │
│  └────────┬─────────┘           └─────────────────┘            │
│           │ NO                                                  │
│           ▼                                                     │
│     ┌───────────┐                                               │
│     │  OUTPUT   │                                               │
│     └───────────┘                                               │
└─────────────────────────────────────────────────────────────────┘

Evaluation Metrics

The agent evaluates each answer on four dimensions (0-100 scale):

Metric	What It Measures
Completeness	Does the answer fully address the question?
Specificity	Does it include specific numbers, quotes, dates?
Accuracy	Is the information factually correct?
Clarity	Is the response well-structured and readable?

These scores are combined into an overall confidence (0-1 scale).

Iteration Actions

During iteration, the agent can:

Generate Follow-up Keywords

Create search-optimized keyword phrases (NOT verbose questions) for missing information.OLD Approach (verbose questions):

❌ "What specific revenue growth percentage was reported?"

NEW Approach (search-optimized keywords):

✅ "revenue growth percentage quarter comparison"
✅ "capex guidance 2025 AI allocation"

Request Transcript Search

Set needs_transcript_search=true to search earnings call transcripts.

Request News Search

Set needs_news_search=true to search for recent developments.

Search All Quarters

Each keyword phrase searches ALL target quarters in parallel for comprehensive coverage.

Regenerate Answer

Build improved answer with expanded context from all sources.

Stop Conditions

Iteration stops when any of these conditions are met:

Confidence ≥ threshold (varies by answer mode: 70-95%)
Max iterations reached (2-10 depending on answer mode)
Agent decides answer is sufficient (explicit satisfaction signal)
No follow-up keyword phrases generated (nothing left to search)

The agent automatically adjusts iteration depth based on question complexity (answer mode: direct/standard/detailed).

Answer Generation

Once sufficient context is retrieved, the agent generates answers using:

Single-Ticker Responses

For questions about one company:

# From agent/rag/response_generator.py
def generate_openai_response(question, chunks, metadata):
    """
    Generate response with:
    - Company-specific context
    - Quarter metadata preserved
    - All financial figures included
    - Citation markers [1], [2], etc.
    """

Multi-Ticker Synthesis

For comparative questions:

def generate_multi_ticker_response(question, ticker_results):
    """
    Synthesis requirements:
    • ALWAYS maintain period metadata (Q1 2025, FY 2024)
    • ALWAYS include ALL financial figures from ALL sources
    • Show trends and comparisons across companies
    • Use human-friendly format: "Q1 2025" not "2025_q1"
    """

Citation System

Every claim is backed by citations:

Transcript citations: [1], [2], [3]
10-K citations: [10K1], [10K2], [10K3]
News citations: [N1], [N2], [N3]

Citations include:

Source type (earnings call, 10-K filing, news article)
Company ticker
Time period (Q1 2025, FY 2024, date)
URL or document reference

Configuration

The RAG pipeline is highly configurable:

{
    # Retrieval settings
    "chunks_per_quarter": 15,      # Results per quarter
    "max_quarters": 12,            # Max 3 years of data
    "max_tickers": 8,              # Max companies per query

    # Hybrid search weights
    "keyword_weight": 0.3,
    "vector_weight": 0.7,

    # Models
    "cerebras_model": "qwen-3-235b-a22b-instruct-2507",
    "openai_model": "gpt-5-nano-2025-08-07",
    "evaluation_model": "qwen-3-235b-a22b-instruct-2507",
    "embedding_model": "all-MiniLM-L6-v2",
    "llm_provider": "cerebras",  # or "openai" | "auto"

    # Chunking
    "chunk_size": 1000,
    "chunk_overlap": 200,
}

Performance Optimizations

Parallel Execution

Vector and keyword searches run in parallel using ThreadPoolExecutor for 2x speedup.

Async Embedding

Query embeddings are generated asynchronously to avoid blocking the event loop:

async def encode_query_async(self, query: str) -> np.ndarray:
    loop = asyncio.get_event_loop()
    return await loop.run_in_executor(
        self._executor, self.embedding_model.encode, [query]
    )

Database Connection Pooling

Reuses database connections across requests to avoid connection overhead.

Chunk Deduplication

Removes duplicate chunks across multiple searches to reduce context size.

Smart Iteration

Stops early when confidence thresholds are met, avoiding unnecessary LLM calls.

Next Steps

Data Sources

Learn about the three specialized data source tools

Architecture

Understand the complete six-stage pipeline

Get Started

Core Concepts

Features

Guides

Agent System

What is RAG?

Factual Accuracy

Citations

Up-to-date

Domain-Specific

Hybrid Search Strategy

Vector Search (70% weight)

Keyword Search (30% weight)

Why Hybrid?

Database Schema

Parallel Retrieval

Multi-Ticker Parallelization

Multi-Quarter Parallelization

Iterative Improvement Loop

Evaluation Metrics

Iteration Actions

Stop Conditions

Answer Generation

Single-Ticker Responses

Multi-Ticker Synthesis

Citation System

Configuration

Performance Optimizations

Next Steps

Data Sources

Architecture

Build docs developers (and LLMs) love

Get Started

Core Concepts

Features

Guides

Agent System

​What is RAG?

Factual Accuracy

Citations

Up-to-date

Domain-Specific

​Hybrid Search Strategy

​Vector Search (70% weight)

​Keyword Search (30% weight)

​Why Hybrid?

​Database Schema

​Parallel Retrieval

​Multi-Ticker Parallelization

​Multi-Quarter Parallelization

​Iterative Improvement Loop

​Evaluation Metrics

​Iteration Actions

​Stop Conditions

​Answer Generation

​Single-Ticker Responses

​Multi-Ticker Synthesis

​Citation System

​Configuration

​Performance Optimizations

​Next Steps

Data Sources

Architecture

Build docs developers (and LLMs) love

What is RAG?

Hybrid Search Strategy

Vector Search (70% weight)

Keyword Search (30% weight)

Why Hybrid?

Database Schema

Parallel Retrieval

Multi-Ticker Parallelization

Multi-Quarter Parallelization

Iterative Improvement Loop

Evaluation Metrics

Iteration Actions

Stop Conditions

Answer Generation

Single-Ticker Responses

Multi-Ticker Synthesis

Citation System

Configuration

Performance Optimizations

Next Steps