PDF RAG Analyzer

Overview

The PDF RAG Analyzer enables interactive conversations with multiple PDF files using Google Gemini 1.5 Flash and FAISS vector storage. Specialized for analyzing financial documents, annual reports, and related-party transactions of Indian stock market companies.

Key Features

Multiple PDF Support: Upload and query multiple PDF files simultaneously
Financial Analysis: Specialized prompts for analyzing financial statements and reports
Google Gemini 1.5: Uses gemini-1.5-flash for fast, accurate responses
FAISS Vector Database: Efficient local vector storage and similarity search
Chat Interface: User-friendly chat interface with message history
Export Functionality: Export conversation history as CSV

Architecture

from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_google_genai import GoogleGenerativeAIEmbeddings, ChatGoogleGenerativeAI
from langchain.vectorstores import FAISS
from langchain.chains import ConversationalRetrievalChain
from PyPDF2 import PdfReader

RAG Pipeline

Implementation

PDF Text Extraction

from PyPDF2 import PdfReader

def get_pdf_text(pdf_docs):
    """Extract text from multiple PDF files."""
    text = ""
    for pdf in pdf_docs:
        pdf_reader = PdfReader(pdf)
        for page in pdf_reader.pages:
            text += page.extract_text()
    return text

Text Chunking

from langchain.text_splitter import RecursiveCharacterTextSplitter

def get_text_chunks(text):
    """Split text into chunks for embedding."""
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=10000,      # Characters per chunk
        chunk_overlap=1000,    # Overlap between chunks
        length_function=len,
        separators=["\n\n", "\n", " ", ""]
    )
    chunks = text_splitter.split_text(text)
    return chunks

Chunking Strategy:

Chunk Size: 10,000 characters for comprehensive context
Overlap: 1,000 characters to maintain continuity
Separators: Prioritizes natural text boundaries (paragraphs, sentences)

FAISS Vector Store

from langchain_google_genai import GoogleGenerativeAIEmbeddings
from langchain.vectorstores import FAISS
import os

def get_vector_store(text_chunks):
    """Create FAISS vector store from text chunks."""
    embeddings = GoogleGenerativeAIEmbeddings(
        model="models/embedding-001",
        google_api_key=os.getenv("GOOGLE_API_KEY")
    )
    
    vector_store = FAISS.from_texts(
        text_chunks,
        embedding=embeddings
    )
    
    # Save locally
    vector_store.save_local("faiss_index")
    
    return vector_store

FAISS Benefits:

Fast similarity search on CPU
Local storage (no cloud dependency)
Efficient memory usage
Support for billion-scale datasets

Conversational Retrieval Chain

from langchain_google_genai import ChatGoogleGenerativeAI
from langchain.chains import ConversationalRetrievalChain
from langchain.memory import ConversationBufferMemory

def get_conversational_chain():
    """Create conversational retrieval chain."""
    # Finance-aware prompt template
    prompt_template = """
    You are a financial analysis expert. Analyze the provided documents carefully.
    
    Context from documents:
    {context}
    
    Chat history:
    {chat_history}
    
    Question: {question}
    
    When answering:
    1. Focus on financial statements, ratios, and metrics
    2. Identify irregularities or red flags
    3. Analyze related-party transactions
    4. Evaluate managerial remuneration
    5. Provide data-driven insights
    
    If the answer is not in the documents, say "I don't have that information in the provided documents."
    
    Answer:
    """
    
    # Initialize Gemini model
    model = ChatGoogleGenerativeAI(
        model="gemini-1.5-flash",
        temperature=0.3,  # Lower temperature for factual responses
        google_api_key=os.getenv("GOOGLE_API_KEY")
    )
    
    # Load FAISS index
    embeddings = GoogleGenerativeAIEmbeddings(
        model="models/embedding-001"
    )
    vector_store = FAISS.load_local(
        "faiss_index",
        embeddings,
        allow_dangerous_deserialization=True
    )
    
    # Create retrieval chain
    chain = ConversationalRetrievalChain.from_llm(
        llm=model,
        retriever=vector_store.as_retriever(
            search_kwargs={"k": 5}  # Retrieve top 5 chunks
        ),
        return_source_documents=True,
        verbose=True
    )
    
    return chain

Financial Analysis Prompting

# Finance-specific system prompt
FINANCE_SYSTEM_PROMPT = """
You are a financial analysis expert specializing in:
- Balance sheet analysis
- Cash flow evaluation
- Related-party transaction detection
- Key Managerial Personnel (KMP) remuneration analysis
- Debt-to-equity ratio calculation
- CFO to Net Profit conversion analysis
- Red flag identification in financial statements

Always:
1. Cite specific numbers from the documents
2. Calculate ratios when relevant
3. Highlight unusual or concerning patterns
4. Compare year-over-year trends
5. Flag potential irregularities
"""

Streamlit Application

import streamlit as st
import pandas as pd
from datetime import datetime

st.set_page_config(page_title="Chat with PDFs", page_icon="📚")
st.title("📚 Chat with Multiple PDFs")

# Sidebar: PDF upload and processing
with st.sidebar:
    st.header("Upload Documents")
    
    # Google AI API key input
    api_key = st.text_input(
        "Google AI API Key",
        type="password",
        help="Get your API key from https://ai.google.dev/"
    )
    
    if api_key:
        os.environ["GOOGLE_API_KEY"] = api_key
    
    # PDF upload
    pdf_docs = st.file_uploader(
        "Upload PDF files",
        accept_multiple_files=True,
        type=["pdf"]
    )
    
    if st.button("Process Documents") and pdf_docs:
        with st.spinner("Processing PDFs..."):
            # Extract text
            raw_text = get_pdf_text(pdf_docs)
            
            # Create chunks
            text_chunks = get_text_chunks(raw_text)
            st.write(f"Created {len(text_chunks)} text chunks")
            
            # Create vector store
            vector_store = get_vector_store(text_chunks)
            
            st.success("✅ Documents processed successfully!")
            st.session_state.docs_processed = True

# Chat interface
if "messages" not in st.session_state:
    st.session_state.messages = []

if "chat_history" not in st.session_state:
    st.session_state.chat_history = []

# Display chat messages
for message in st.session_state.messages:
    with st.chat_message(message["role"]):
        st.markdown(message["content"])

# Chat input
if prompt := st.chat_input("Ask about your documents..."):
    if not st.session_state.get('docs_processed', False):
        st.warning("Please upload and process documents first.")
    else:
        # Add user message
        st.session_state.messages.append({"role": "user", "content": prompt})
        with st.chat_message("user"):
            st.markdown(prompt)
        
        # Get response
        with st.chat_message("assistant"):
            with st.spinner("Thinking..."):
                chain = get_conversational_chain()
                response = chain({
                    "question": prompt,
                    "chat_history": st.session_state.chat_history
                })
                
                answer = response["answer"]
                st.markdown(answer)
                
                # Update history
                st.session_state.messages.append({"role": "assistant", "content": answer})
                st.session_state.chat_history.append((prompt, answer))

# Export conversation
if st.session_state.messages:
    if st.sidebar.button("💾 Export Conversation"):
        df = pd.DataFrame(st.session_state.messages)
        csv = df.to_csv(index=False)
        st.sidebar.download_button(
            "Download CSV",
            csv,
            f"conversation_{datetime.now().strftime('%Y%m%d_%H%M%S')}.csv",
            "text/csv"
        )

Financial Analysis Examples

Example Queries

# Balance sheet analysis
"What is the company's debt-to-equity ratio over the last 3 years?"

# Related-party transactions
"Identify all related-party transactions mentioned in the annual report."

# Cash flow analysis
"What is the CFO to Net Profit conversion trend?"

# Remuneration analysis
"Has there been any unusual increase in Key Managerial Personnel pay?"

# Comparative analysis
"Compare the debt levels across the 5 companies in these reports."

# Red flags
"Are there any red flags or irregularities in the financial statements?"

Installation

git clone https://github.com/rakshithsantosh/pdf-chatbot-gemini.git
cd pdf-chatbot-gemini
uv sync

Environment Setup

Get Google AI API key from Google AI Studio
Enter the key in the Streamlit sidebar

Running the Application

streamlit run app.py

Use Cases

Financial Analysis

Analyze annual reports and financial statements

Compliance Review

Review regulatory filings and compliance documents

Due Diligence

Conduct financial due diligence on companies

Investment Research

Research companies for investment decisions

Best Practices

Upload Related Documents

Upload multiple related PDFs (e.g., 3-5 years of annual reports)

Process Once

Process all documents together for cross-document queries

Specific Questions

Ask specific, financial metric-focused questions

Verify Numbers

Always verify critical numbers by checking source documents

Configuration Options

Text Splitting

chunk_size

int

default:"10000"

Number of characters per chunk

chunk_overlap

int

default:"1000"

Character overlap between chunks for continuity

Retrieval

int

default:"5"

Number of chunks to retrieve for each query

Model

temperature

float

default:"0.3"

Lower values for factual responses, higher for creative

Google AI Studio

Get Gemini API keys and documentation

LangChain

LangChain framework documentation

FAISS

FAISS vector database

Starter Agents

Simple Agents

MCP Agents

Memory Agents

RAG Applications

Advanced Agents

Overview

Key Features

Architecture

RAG Pipeline

Implementation

PDF Text Extraction

Text Chunking

FAISS Vector Store

Conversational Retrieval Chain

Financial Analysis Prompting

Streamlit Application

Financial Analysis Examples

Example Queries

Installation

Environment Setup

Running the Application

Use Cases

Financial Analysis

Compliance Review

Due Diligence

Investment Research

Best Practices

Configuration Options

Text Splitting

Retrieval

Model

Google AI Studio

LangChain

FAISS

Build docs developers (and LLMs) love

Starter Agents

Simple Agents

MCP Agents

Memory Agents

RAG Applications

Advanced Agents

​Overview

​Key Features

​Architecture

​RAG Pipeline

​Implementation

​PDF Text Extraction

​Text Chunking

​FAISS Vector Store

​Conversational Retrieval Chain

​Financial Analysis Prompting

​Streamlit Application

​Financial Analysis Examples

​Example Queries

​Installation

​Environment Setup

​Running the Application

​Use Cases

Financial Analysis

Compliance Review

Due Diligence

Investment Research

​Best Practices

​Configuration Options

​Text Splitting

​Retrieval

​Model

​Related Resources

Google AI Studio

LangChain

FAISS

Build docs developers (and LLMs) love

Overview

Key Features

Architecture

RAG Pipeline

Implementation

PDF Text Extraction

Text Chunking

FAISS Vector Store

Conversational Retrieval Chain

Financial Analysis Prompting

Streamlit Application

Financial Analysis Examples

Example Queries

Installation

Environment Setup

Running the Application

Use Cases

Best Practices

Configuration Options

Text Splitting

Retrieval

Model

Related Resources