Data Preparation

DeenPAL uses PDF documents containing Islamic Hadiths as its knowledge base. This guide explains how to prepare and organize your Hadith data for optimal retrieval.

Overview

The data preparation process involves:

Placing Hadith PDFs in the correct directory
Automatic PDF loading and parsing
Metadata extraction and enrichment
Semantic text splitting by chapter/book
Embedding generation and vector storage

Supported Data Formats

Supported Format: PDF onlyDeenPAL currently supports Hadith documents in PDF format. The system uses PyPDFDirectoryLoader to load all PDFs from the data directory.

Placing Hadith PDFs

Create Data Directory

Ensure the data/ directory exists in the project root:

mkdir -p data

Add PDF Files

Place your Hadith PDF files in the data/ directory:

DeenPAL-RAG-based-Islamic-Hadith-Chatbot/
├── data/
│   ├── Sahih_Bukhari_Vol1.pdf
│   ├── Sahih_Bukhari_Vol2.pdf
│   ├── Sahih_Muslim_Vol1.pdf
│   └── Sahih_Muslim_Vol2.pdf
├── app.py
└── ...

Recommended Sources

The project author recommends using these authentic Hadith collections:

Sahih Bukhari (all volumes)
Sahih Muslim (all volumes)

These are considered the most authentic Hadith collections in Islamic scholarship. You can use other Hadith collections as well, but ensure they’re in PDF format.

Ensure your PDF files are text-based, not scanned images. The loader requires extractable text content.

How the Loader Processes PDFs

DeenPAL’s loader.py module handles the entire data processing pipeline:

1. PDF Loading

folder_path = "data/"
loader = PyPDFDirectoryLoader(folder_path)
documents = loader.load()

The PyPDFDirectoryLoader automatically loads all PDF files from the data/ directory.

2. Metadata Extraction

After loading, the system extracts and cleans metadata from each document:

for doc in documents:
    split_source = (doc.metadata['source'].split("/")[-1])
    exact_source_with_ext = split_source.split('_', maxsplit=1)[1]
    exact_source = exact_source_with_ext.split('.')[0]
    doc.metadata = {'source': exact_source}

Example:

Original: data/001_Sahih_Bukhari_Vol1.pdf
Extracted: Sahih Bukhari Vol1

This provides clean source attribution for each Hadith.

3. Text Splitting Strategy

DeenPAL uses semantic splitting based on Hadith structure rather than fixed character counts:

pattern = r"(?:Chapter\s\d+:)|(?:Book\s\d+,\sNumber\s\d+:)"
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=0,
    separators=[pattern],
    is_separator_regex=True
)
chunks = text_splitter.split_documents(documents)

Why Semantic Splitting?Instead of splitting text at arbitrary character positions, this approach:

Splits at natural boundaries (Chapter/Book markers)
Keeps each Hadith intact as a single chunk
Preserves semantic meaning and context
Improves retrieval accuracy

4. Hadith Number Extraction

After splitting, the system extracts and adds Hadith numbers to metadata:

for chunk in chunks:
    matches = re.search(pattern, chunk.page_content)
    if matches:
        hadith_number = "".join([word for word in matches.group(0) if word.isdigit()])
        chunk.metadata.update({'hadith_number': hadith_number})

This allows for precise citation in chatbot responses.

5. Embedding Generation

The system uses the sentence-transformers/all-MiniLM-L6-v2 model to generate embeddings:

embeddings = HuggingFaceEmbeddings(model_name='sentence-transformers/all-MiniLM-L6-v2')

The embedding model is downloaded automatically on first run and cached locally for future use.

6. Vector Store Initialization

Finally, embeddings are stored in ChromaDB for efficient retrieval:

persist_directory = 'database/chroma_db'
db = Chroma.from_documents(
    documents=chunks,
    embedding=embeddings,
    persist_directory=persist_directory
)

The vector store is persisted to disk, so it only needs to be created once.

Caching for Performance

The entire data loading process is cached using Streamlit’s caching mechanism:

@st.cache_resource
def load_and_prepare_data():
    # ... data loading logic

This ensures that:

Data is only processed once per session
Subsequent queries don’t reload or re-embed data
The chatbot responds quickly after initial setup

File Naming Conventions

While there’s no strict naming requirement, consider using descriptive names: ✅ Good:

Sahih_Bukhari_Vol1.pdf
Sahih_Muslim_Complete.pdf
001_Jami_at_Tirmidhi.pdf

❌ Avoid:

document1.pdf
hadith.pdf
untitled.pdf

Clear names help with source attribution in chatbot responses.

Verifying Data Preparation

When you first run the application, you’ll see console output indicating the data preparation progress:

1- Loading Hadith PDFs
2- Documents loaded successfully.
3- Documents split and metadata added.
4- Chroma vector store initialized.

If you see all four messages, your data preparation was successful.

Next Steps

With your data prepared, you can now:

Get Started

Core Concepts

Guides

Components

Data Preparation

Overview

Supported Data Formats

Placing Hadith PDFs

Recommended Sources

How the Loader Processes PDFs

1. PDF Loading

2. Metadata Extraction

3. Text Splitting Strategy

4. Hadith Number Extraction

5. Embedding Generation

6. Vector Store Initialization

Caching for Performance

File Naming Conventions

Verifying Data Preparation

Next Steps

Build docs developers (and LLMs) love

Get Started

Core Concepts

Guides

Components

​Overview

​Supported Data Formats

​Placing Hadith PDFs

​Recommended Sources

​How the Loader Processes PDFs

​1. PDF Loading

​2. Metadata Extraction

​3. Text Splitting Strategy

​4. Hadith Number Extraction

​5. Embedding Generation

​6. Vector Store Initialization

​Caching for Performance

​File Naming Conventions

​Verifying Data Preparation

​Next Steps

Build docs developers (and LLMs) love

Overview

Supported Data Formats

Placing Hadith PDFs

Recommended Sources

How the Loader Processes PDFs

1. PDF Loading

2. Metadata Extraction

3. Text Splitting Strategy

4. Hadith Number Extraction

5. Embedding Generation

6. Vector Store Initialization

Caching for Performance

File Naming Conventions

Verifying Data Preparation

Next Steps