Quickstart

Prerequisites

Before you begin, ensure you have the following installed:

Python 3.10 or higher
Conda (Anaconda or Miniconda)
Git for cloning the repository
DeepSeek API Key (or any OpenAI-compatible LLM API key)

DeenPAL uses the DeepSeek model via OpenRouter API. You can sign up for a free API key at OpenRouter.

Installation

Follow these steps to install and set up DeenPAL:

Clone the repository

Clone the DeenPAL repository from GitHub:

git clone https://github.com/Raza-Aziz/DeenPAL-RAG-based-Islamic-Hadith-Chatbot.git
cd DeenPAL-RAG-based-Islamic-Hadith-Chatbot

Set up a virtual environment

Create and activate a new conda environment with Python 3.10:

conda create -n deen-pal python=3.10 -y
conda activate deen-pal

Using a virtual environment ensures dependency isolation and prevents conflicts with other Python projects.

Configure environment variables

Create a .env file in the root directory and add your OpenRouter API key:

.env

OPENAI_API_KEY="your_openrouter_api_key_here"

DeenPAL uses OpenRouter to access the DeepSeek model. Get your API key from OpenRouter.

Keep your API key secure and never commit the .env file to version control. Add it to .gitignore if not already included.

Install dependencies

Install the required Python packages. You can use either pip or uv package manager:

pip install -r colab_requirements.txt

The uv package manager is significantly faster than pip for dependency installation. If you’re installing frequently, consider using uv.

Prepare your data

Place your Hadith documents in the data/ directory in PDF format:

mkdir -p data/
# Copy your Hadith PDF files into the data/ directory

The original implementation uses Sahih Muslim and Sahih Bukhari books (all volumes) as the data source. You can use any Hadith PDF files that follow a similar structure with chapter and book numbering.

Expected PDF naming format:

<prefix>_Sahih_Bukhari_Volume_1.pdf
<prefix>_Sahih_Muslim_Volume_1.pdf

Running DeenPAL

Once installation is complete, you can start the chatbot:

Start the Streamlit application

Run the following command in your terminal:

streamlit run app.py

The first run will take longer as the system loads PDFs, generates embeddings, and initializes the ChromaDB vector store. Subsequent runs will be much faster due to caching.

Access the chatbot interface

Open your web browser and navigate to:

http://localhost:8501

Streamlit runs on port 8501 by default. If you need to use a different port, run streamlit run app.py --server.port 8080.

You should see the Deen Pal Chatbot interface.

Ask your first question

Try asking a question in the chat input at the bottom of the page. For example:

Example Query

“What does the Hadith say about prayer?”

The chatbot will:

Retrieve relevant Hadiths from the database
Display each Hadith with source citations (book number, hadith number, chapter)
Provide a brief explanation for each Hadith
Generate a concise answer to your question

Understanding the First Run

During the first execution, DeenPAL performs several initialization steps:

# From loader.py - cached for performance
@st.cache_resource
def load_and_prepare_data():
    # 1. Loading Hadith PDFs from data/ directory
    # 2. Processing metadata (extracting source names)
    # 3. Splitting documents into semantic chunks
    # 4. Generating embeddings using HuggingFace model
    # 5. Storing embeddings in ChromaDB

The @st.cache_resource decorator ensures this data loading happens only once per app session, significantly improving response times for subsequent queries.

What Happens Behind the Scenes

When you submit a query, here’s what DeenPAL does:

Semantic Search: Your query is converted to an embedding and compared against the Hadith database
MMR Retrieval: The system retrieves the top 4 diverse results from 10 candidates using Maximal Marginal Relevance
Context Building: Retrieved Hadiths are formatted with their metadata
LLM Generation: The DeepSeek model generates a response based on the retrieved context
Response Display: The answer is shown with proper Hadith citations and explanations

From chains.py:15-18:

retriever = db.as_retriever(
    search_type="mmr",  # Use Maximal Marginal Relevance
    search_kwargs={"k": 4, "fetch_k": 10}  # Retrieve top 4 diverse results from 10 candidates
)

Troubleshooting

Streamlit doesn't start on port 8080

By default, Streamlit uses port 8501. If you need to use port 8080, run:

streamlit run app.py --server.port 8080

API key not found error

Ensure your .env file is in the root directory (same level as app.py) and contains:

DEEPSEEK_API_KEY="your-actual-api-key"

Make sure there are no extra spaces around the equals sign.

No PDFs found in data/ directory

Verify that:

The data/ directory exists in the project root
Your PDF files are placed directly in the data/ directory
The PDF files are readable and not corrupted

Slow first run

This is expected behavior. The first run involves:

Loading all PDF documents
Downloading the HuggingFace embedding model (sentence-transformers/all-MiniLM-L6-v2)
Generating embeddings for all chunks
Initializing the ChromaDB vector store

Subsequent runs will be much faster due to caching.

Next Steps

Architecture

Learn about the technical architecture and how RAG works in DeenPAL.

Configuration

Customize the retrieval parameters, models, and prompts.

Get Started

Core Concepts

Guides

Components

Prerequisites

Installation

Running DeenPAL

Example Query

Understanding the First Run

What Happens Behind the Scenes

Troubleshooting

Next Steps

Architecture

Configuration

Build docs developers (and LLMs) love

Get Started

Core Concepts

Guides

Components

​Prerequisites

​Installation

​Running DeenPAL

Example Query

​Understanding the First Run

​What Happens Behind the Scenes

​Troubleshooting

​Next Steps

Architecture

Configuration

Build docs developers (and LLMs) love

Prerequisites

Installation

Running DeenPAL

Understanding the First Run

What Happens Behind the Scenes

Troubleshooting

Next Steps