Overview
Theloader.py module is responsible for loading hadith PDFs, processing them into searchable chunks, and storing them in a vector database. This is the foundation of DeenPAL’s retrieval system.
The data loading process only runs once when the app starts, thanks to Streamlit’s
@st.cache_resource decorator.Core Function
Theload_and_prepare_data() function orchestrates the entire data preparation pipeline:
Pipeline Stages
1. Loading PDFs with PyPDFDirectoryLoader
The loader reads all PDF files from thedata/ directory:
PyPDFDirectoryLoader automatically processes all PDFs in the specified directory, extracting text and basic metadata.2. Metadata Processing
The loader extracts the source name from filenames by removing prefixes and extensions:- Input:
data/01_Sahih_Al-Bukhari.pdf - Output metadata:
{'source': 'Sahih Al-Bukhari'}
3. Text Splitting with Regex Pattern
Documents are split into chunks using a specialized regex pattern that recognizes hadith structure:Regex Pattern Breakdown:
(?:Chapter\s\d+:)- Matches “Chapter 5:” format(?:Book\s\d+,\sNumber\s\d+:)- Matches “Book 1, Number 123:” format- These patterns split documents at natural hadith boundaries
4. Adding Hadith Number Metadata
After splitting, the loader extracts hadith numbers and adds them as metadata:{'hadith_number': '123'} for each chunk.
5. Generating Embeddings
The loader uses HuggingFace’s sentence transformers to create vector embeddings:Model:
sentence-transformers/all-MiniLM-L6-v2- Fast and efficient
- 384-dimensional embeddings
- Optimized for semantic similarity
6. Initializing ChromaDB
Finally, the embeddings are stored in a persistent ChromaDB vector store:Caching with @st.cache_resource
The@st.cache_resource decorator ensures this expensive operation only runs once:
Why
@st.cache_resource?- Prevents reloading data on every Streamlit rerun
- Persists the database and embeddings across sessions
- Significantly improves app performance
Return Values
The function returns two objects:db: ChromaDB vector store containing all document chunksembeddings: HuggingFaceEmbeddings instance for consistent embedding generation
