Overview
The data preparation process involves:- Placing Hadith PDFs in the correct directory
- Automatic PDF loading and parsing
- Metadata extraction and enrichment
- Semantic text splitting by chapter/book
- Embedding generation and vector storage
Supported Data Formats
Supported Format: PDF onlyDeenPAL currently supports Hadith documents in PDF format. The system uses
PyPDFDirectoryLoader to load all PDFs from the data directory.Placing Hadith PDFs
Recommended Sources
The project author recommends using these authentic Hadith collections:- Sahih Bukhari (all volumes)
- Sahih Muslim (all volumes)
How the Loader Processes PDFs
DeenPAL’sloader.py module handles the entire data processing pipeline:
1. PDF Loading
PyPDFDirectoryLoader automatically loads all PDF files from the data/ directory.
2. Metadata Extraction
After loading, the system extracts and cleans metadata from each document:- Original:
data/001_Sahih_Bukhari_Vol1.pdf - Extracted:
Sahih Bukhari Vol1
3. Text Splitting Strategy
DeenPAL uses semantic splitting based on Hadith structure rather than fixed character counts:Why Semantic Splitting?Instead of splitting text at arbitrary character positions, this approach:
- Splits at natural boundaries (Chapter/Book markers)
- Keeps each Hadith intact as a single chunk
- Preserves semantic meaning and context
- Improves retrieval accuracy
4. Hadith Number Extraction
After splitting, the system extracts and adds Hadith numbers to metadata:5. Embedding Generation
The system uses thesentence-transformers/all-MiniLM-L6-v2 model to generate embeddings:
The embedding model is downloaded automatically on first run and cached locally for future use.
6. Vector Store Initialization
Finally, embeddings are stored in ChromaDB for efficient retrieval:Caching for Performance
The entire data loading process is cached using Streamlit’s caching mechanism:- Data is only processed once per session
- Subsequent queries don’t reload or re-embed data
- The chatbot responds quickly after initial setup
File Naming Conventions
While there’s no strict naming requirement, consider using descriptive names: ✅ Good:Sahih_Bukhari_Vol1.pdfSahih_Muslim_Complete.pdf001_Jami_at_Tirmidhi.pdf
document1.pdfhadith.pdfuntitled.pdf
