Why Document Processing Matters
Raw documents are too large to feed into LLMs in their entirety. Document processing breaks them into smaller, semantically meaningful chunks that can be:- Embedded as vectors for similarity search
- Retrieved individually based on relevance
- Fit within the LLM’s context window
Good chunking is critical for RAG performance. Chunks must be large enough to be meaningful but small enough to be specific.
The Processing Pipeline
RAG Chat processes PDF documents through three stages:1. Load
Extract text from PDF using PyPDFLoader
2. Split
Chunk text using RecursiveCharacterTextSplitter
3. Embed
Convert chunks to vectors and store in ChromaDB
PDF Loading
Theprocess_file() function handles PDF uploads:
app.py
Streamlit file uploads are in-memory. The code saves them to a temporary file so PyPDFLoader can read them, then deletes the temp file after processing.
What PyPDFLoader Extracts
PyPDFLoader reads PDF files and extracts:- Raw text content from each page
- Page numbers (stored as metadata)
- Document structure
Document objects, one per page.
Text Chunking
The RecursiveCharacterTextSplitter
After loading, documents are split into chunks:app.py
Parameter Breakdown
Parameter Breakdown
chunk_size=1000: Each chunk is ~1000 characters (≈200-250 words)chunk_overlap=400: Consecutive chunks share 400 characters- “Recursive”: Tries to split on natural boundaries (paragraphs, sentences, words)
Why 1000 Characters?
The chunk size balances two competing needs:Too Small (< 500)
Lacks context; may miss connections between ideas
Too Large (> 2000)
Too much noise; reduces retrieval precision
Just Right (1000)
Captures complete thoughts; good for semantic search
With Overlap (400)
Ensures concepts spanning boundaries are captured
The Importance of Overlap
Chunk overlap prevents information loss at boundaries:Without overlap, a concept split across chunks might not be retrieved. The 400-character overlap ensures context continuity.
Recursive Splitting Strategy
The “recursive” part ofRecursiveCharacterTextSplitter means it tries multiple separators in order:
- Paragraph breaks (
\n\n) - Line breaks (
\n) - Sentences (
.) - Words (
) - Characters (last resort)
Example: Splitting in Action
Example: Splitting in Action
Complete Processing Function
Here’s the fullprocess_file() function:
app.py
Integration with Vector Store
Processed chunks flow into the vector store:app.py
- Converted to a vector embedding via OpenAI API
- Stored in ChromaDB with its original text
- Indexed for fast similarity search
Chunking Best Practices
For Different Document Types
Research Papers
chunk_size=1000 works well; captures full paragraphs
Legal Documents
chunk_size=1500; longer clauses need more context
News Articles
chunk_size=800; shorter, punchier content
Technical Manuals
chunk_size=1200; procedures need complete steps
Tuning Chunk Size
If retrieval quality is poor, consider:- Smaller chunks (500-800): When answers are very specific and localized
- Larger chunks (1200-1500): When answers need more surrounding context
- More overlap (500-600): For dense, technical content where concepts are tightly connected
- Less overlap (200-300): For simple documents with clear topic boundaries
The current configuration (1000/400) is a good starting point for most documents.
Performance Implications
Chunk Count
A 100-page PDF (≈50,000 words) produces:- Character count: ≈250,000 characters
- Chunks: ≈250 chunks (at 1000 chars each with overlap)
- Embedding cost: ≈$0.10-0.15
- Storage: A few MB in ChromaDB
Processing Time
For typical PDFs:- PDF loading: 1-3 seconds
- Text splitting: < 1 second
- Embedding API calls: 5-15 seconds (depends on chunk count)
- Vector store insertion: 1-2 seconds
The embedding API calls are the bottleneck. LangChain batches requests automatically for efficiency.
Metadata Preservation
Chunks retain metadata from the original PDF:- Source attribution in responses
- Filtering by page number
- Document provenance tracking
Next Steps
Vector Store
Learn how processed chunks are stored and retrieved
RAG Overview
See how chunking fits into the complete RAG pipeline