Skip to main content

Overview

The PDF RAG Analyzer enables interactive conversations with multiple PDF files using Google Gemini 1.5 Flash and FAISS vector storage. Specialized for analyzing financial documents, annual reports, and related-party transactions of Indian stock market companies.

Key Features

  • Multiple PDF Support: Upload and query multiple PDF files simultaneously
  • Financial Analysis: Specialized prompts for analyzing financial statements and reports
  • Google Gemini 1.5: Uses gemini-1.5-flash for fast, accurate responses
  • FAISS Vector Database: Efficient local vector storage and similarity search
  • Chat Interface: User-friendly chat interface with message history
  • Export Functionality: Export conversation history as CSV

Architecture

from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_google_genai import GoogleGenerativeAIEmbeddings, ChatGoogleGenerativeAI
from langchain.vectorstores import FAISS
from langchain.chains import ConversationalRetrievalChain
from PyPDF2 import PdfReader

RAG Pipeline

Implementation

PDF Text Extraction

from PyPDF2 import PdfReader

def get_pdf_text(pdf_docs):
    """Extract text from multiple PDF files."""
    text = ""
    for pdf in pdf_docs:
        pdf_reader = PdfReader(pdf)
        for page in pdf_reader.pages:
            text += page.extract_text()
    return text

Text Chunking

from langchain.text_splitter import RecursiveCharacterTextSplitter

def get_text_chunks(text):
    """Split text into chunks for embedding."""
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=10000,      # Characters per chunk
        chunk_overlap=1000,    # Overlap between chunks
        length_function=len,
        separators=["\n\n", "\n", " ", ""]
    )
    chunks = text_splitter.split_text(text)
    return chunks
Chunking Strategy:
  • Chunk Size: 10,000 characters for comprehensive context
  • Overlap: 1,000 characters to maintain continuity
  • Separators: Prioritizes natural text boundaries (paragraphs, sentences)

FAISS Vector Store

from langchain_google_genai import GoogleGenerativeAIEmbeddings
from langchain.vectorstores import FAISS
import os

def get_vector_store(text_chunks):
    """Create FAISS vector store from text chunks."""
    embeddings = GoogleGenerativeAIEmbeddings(
        model="models/embedding-001",
        google_api_key=os.getenv("GOOGLE_API_KEY")
    )
    
    vector_store = FAISS.from_texts(
        text_chunks,
        embedding=embeddings
    )
    
    # Save locally
    vector_store.save_local("faiss_index")
    
    return vector_store
FAISS Benefits:
  • Fast similarity search on CPU
  • Local storage (no cloud dependency)
  • Efficient memory usage
  • Support for billion-scale datasets

Conversational Retrieval Chain

from langchain_google_genai import ChatGoogleGenerativeAI
from langchain.chains import ConversationalRetrievalChain
from langchain.memory import ConversationBufferMemory

def get_conversational_chain():
    """Create conversational retrieval chain."""
    # Finance-aware prompt template
    prompt_template = """
    You are a financial analysis expert. Analyze the provided documents carefully.
    
    Context from documents:
    {context}
    
    Chat history:
    {chat_history}
    
    Question: {question}
    
    When answering:
    1. Focus on financial statements, ratios, and metrics
    2. Identify irregularities or red flags
    3. Analyze related-party transactions
    4. Evaluate managerial remuneration
    5. Provide data-driven insights
    
    If the answer is not in the documents, say "I don't have that information in the provided documents."
    
    Answer:
    """
    
    # Initialize Gemini model
    model = ChatGoogleGenerativeAI(
        model="gemini-1.5-flash",
        temperature=0.3,  # Lower temperature for factual responses
        google_api_key=os.getenv("GOOGLE_API_KEY")
    )
    
    # Load FAISS index
    embeddings = GoogleGenerativeAIEmbeddings(
        model="models/embedding-001"
    )
    vector_store = FAISS.load_local(
        "faiss_index",
        embeddings,
        allow_dangerous_deserialization=True
    )
    
    # Create retrieval chain
    chain = ConversationalRetrievalChain.from_llm(
        llm=model,
        retriever=vector_store.as_retriever(
            search_kwargs={"k": 5}  # Retrieve top 5 chunks
        ),
        return_source_documents=True,
        verbose=True
    )
    
    return chain

Financial Analysis Prompting

# Finance-specific system prompt
FINANCE_SYSTEM_PROMPT = """
You are a financial analysis expert specializing in:
- Balance sheet analysis
- Cash flow evaluation
- Related-party transaction detection
- Key Managerial Personnel (KMP) remuneration analysis
- Debt-to-equity ratio calculation
- CFO to Net Profit conversion analysis
- Red flag identification in financial statements

Always:
1. Cite specific numbers from the documents
2. Calculate ratios when relevant
3. Highlight unusual or concerning patterns
4. Compare year-over-year trends
5. Flag potential irregularities
"""

Streamlit Application

import streamlit as st
import pandas as pd
from datetime import datetime

st.set_page_config(page_title="Chat with PDFs", page_icon="📚")
st.title("📚 Chat with Multiple PDFs")

# Sidebar: PDF upload and processing
with st.sidebar:
    st.header("Upload Documents")
    
    # Google AI API key input
    api_key = st.text_input(
        "Google AI API Key",
        type="password",
        help="Get your API key from https://ai.google.dev/"
    )
    
    if api_key:
        os.environ["GOOGLE_API_KEY"] = api_key
    
    # PDF upload
    pdf_docs = st.file_uploader(
        "Upload PDF files",
        accept_multiple_files=True,
        type=["pdf"]
    )
    
    if st.button("Process Documents") and pdf_docs:
        with st.spinner("Processing PDFs..."):
            # Extract text
            raw_text = get_pdf_text(pdf_docs)
            
            # Create chunks
            text_chunks = get_text_chunks(raw_text)
            st.write(f"Created {len(text_chunks)} text chunks")
            
            # Create vector store
            vector_store = get_vector_store(text_chunks)
            
            st.success("✅ Documents processed successfully!")
            st.session_state.docs_processed = True

# Chat interface
if "messages" not in st.session_state:
    st.session_state.messages = []

if "chat_history" not in st.session_state:
    st.session_state.chat_history = []

# Display chat messages
for message in st.session_state.messages:
    with st.chat_message(message["role"]):
        st.markdown(message["content"])

# Chat input
if prompt := st.chat_input("Ask about your documents..."):
    if not st.session_state.get('docs_processed', False):
        st.warning("Please upload and process documents first.")
    else:
        # Add user message
        st.session_state.messages.append({"role": "user", "content": prompt})
        with st.chat_message("user"):
            st.markdown(prompt)
        
        # Get response
        with st.chat_message("assistant"):
            with st.spinner("Thinking..."):
                chain = get_conversational_chain()
                response = chain({
                    "question": prompt,
                    "chat_history": st.session_state.chat_history
                })
                
                answer = response["answer"]
                st.markdown(answer)
                
                # Update history
                st.session_state.messages.append({"role": "assistant", "content": answer})
                st.session_state.chat_history.append((prompt, answer))

# Export conversation
if st.session_state.messages:
    if st.sidebar.button("💾 Export Conversation"):
        df = pd.DataFrame(st.session_state.messages)
        csv = df.to_csv(index=False)
        st.sidebar.download_button(
            "Download CSV",
            csv,
            f"conversation_{datetime.now().strftime('%Y%m%d_%H%M%S')}.csv",
            "text/csv"
        )

Financial Analysis Examples

Example Queries

# Balance sheet analysis
"What is the company's debt-to-equity ratio over the last 3 years?"

# Related-party transactions
"Identify all related-party transactions mentioned in the annual report."

# Cash flow analysis
"What is the CFO to Net Profit conversion trend?"

# Remuneration analysis
"Has there been any unusual increase in Key Managerial Personnel pay?"

# Comparative analysis
"Compare the debt levels across the 5 companies in these reports."

# Red flags
"Are there any red flags or irregularities in the financial statements?"

Installation

git clone https://github.com/rakshithsantosh/pdf-chatbot-gemini.git
cd pdf-chatbot-gemini
uv sync

Environment Setup

  1. Get Google AI API key from Google AI Studio
  2. Enter the key in the Streamlit sidebar

Running the Application

streamlit run app.py

Use Cases

Financial Analysis

Analyze annual reports and financial statements

Compliance Review

Review regulatory filings and compliance documents

Due Diligence

Conduct financial due diligence on companies

Investment Research

Research companies for investment decisions

Best Practices

1

Upload Related Documents

Upload multiple related PDFs (e.g., 3-5 years of annual reports)
2

Process Once

Process all documents together for cross-document queries
3

Specific Questions

Ask specific, financial metric-focused questions
4

Verify Numbers

Always verify critical numbers by checking source documents

Configuration Options

Text Splitting

chunk_size
int
default:"10000"
Number of characters per chunk
chunk_overlap
int
default:"1000"
Character overlap between chunks for continuity

Retrieval

k
int
default:"5"
Number of chunks to retrieve for each query

Model

temperature
float
default:"0.3"
Lower values for factual responses, higher for creative

Google AI Studio

Get Gemini API keys and documentation

LangChain

LangChain framework documentation

FAISS

FAISS vector database

Build docs developers (and LLMs) love