Document Processing

Why Document Processing Matters

Raw documents are too large to feed into LLMs in their entirety. Document processing breaks them into smaller, semantically meaningful chunks that can be:

Embedded as vectors for similarity search
Retrieved individually based on relevance
Fit within the LLM’s context window

Good chunking is critical for RAG performance. Chunks must be large enough to be meaningful but small enough to be specific.

The Processing Pipeline

RAG Chat processes PDF documents through three stages:

1. Load

Extract text from PDF using PyPDFLoader

2. Split

Chunk text using RecursiveCharacterTextSplitter

3. Embed

Convert chunks to vectors and store in ChromaDB

PDF Loading

The process_file() function handles PDF uploads:

app.py

def process_file(file):
    with NamedTemporaryFile(delete=False, suffix='.pdf') as temp_file:
        temp_file.write(file.read())
        temp_file_path = temp_file.name
        loader = PyPDFLoader(temp_file_path)
        docs = loader.load()
        
        # ... splitting happens next ...
        
        os.remove(temp_file_path)
        return chunks

Streamlit file uploads are in-memory. The code saves them to a temporary file so PyPDFLoader can read them, then deletes the temp file after processing.

What PyPDFLoader Extracts

PyPDFLoader reads PDF files and extracts:

Raw text content from each page
Page numbers (stored as metadata)
Document structure

The result is a list of Document objects, one per page.

Text Chunking

The RecursiveCharacterTextSplitter

After loading, documents are split into chunks:

app.py

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 1000,
    chunk_overlap = 400
)
chunks = text_splitter.split_documents(docs)

This configuration is carefully tuned for RAG:

Parameter Breakdown

chunk_size=1000: Each chunk is ~1000 characters (≈200-250 words)
chunk_overlap=400: Consecutive chunks share 400 characters
“Recursive”: Tries to split on natural boundaries (paragraphs, sentences, words)

Why 1000 Characters?

The chunk size balances two competing needs:

Too Small (< 500)

Lacks context; may miss connections between ideas

Too Large (> 2000)

Too much noise; reduces retrieval precision

Just Right (1000)

Captures complete thoughts; good for semantic search

With Overlap (400)

Ensures concepts spanning boundaries are captured

The Importance of Overlap

Chunk overlap prevents information loss at boundaries:

Chunk 1: "...important concept is RAG. RAG combines retrieval..."
Chunk 2: "...RAG combines retrieval with generation to improve..."
                  ↑ 400 character overlap ↑

Without overlap, a concept split across chunks might not be retrieved. The 400-character overlap ensures context continuity.

Recursive Splitting Strategy

The “recursive” part of RecursiveCharacterTextSplitter means it tries multiple separators in order:

Paragraph breaks (\n\n)
Line breaks (\n)
Sentences (. )
Words ( )
Characters (last resort)

This preserves natural semantic boundaries when possible.

Example: Splitting in Action

Original text (2500 chars):
"RAG is a technique... [paragraph] ...vector stores enable... [paragraph] ...chunking matters because..."

Step 1: Try splitting on paragraph breaks
→ Paragraph 1: 900 chars ✓ (fits in 1000)
→ Paragraph 2: 1100 chars ✗ (too big)

Step 2: Split Paragraph 2 on sentence boundaries
→ Paragraph 2a: 800 chars ✓
→ Paragraph 2b: 300 chars ✓

Result: 3 chunks, all semantically coherent

Complete Processing Function

Here’s the full process_file() function:

app.py

def process_file(file):
    # Save uploaded file temporarily
    with NamedTemporaryFile(delete=False, suffix='.pdf') as temp_file:
        temp_file.write(file.read())
        temp_file_path = temp_file.name
        
        # Load PDF and extract text
        loader = PyPDFLoader(temp_file_path)
        docs = loader.load()

        # Split into chunks
        text_splitter = RecursiveCharacterTextSplitter(
            chunk_size = 1000,
            chunk_overlap = 400
        )
        chunks = text_splitter.split_documents(docs)
        
        # Clean up temp file
        os.remove(temp_file_path)

        return chunks

Integration with Vector Store

Processed chunks flow into the vector store:

app.py

all_chunks = []
for uploaded_file in uploaded_files:
    chunks = process_file(uploaded_file)
    all_chunks.extend(chunks)

if all_chunks:
    vector_store = add_to_vector_store(
        vector_store = vector_store,
        documents = all_chunks
    )

Each chunk is:

Converted to a vector embedding via OpenAI API
Stored in ChromaDB with its original text
Indexed for fast similarity search

Chunking Best Practices

For Different Document Types

Research Papers

chunk_size=1000 works well; captures full paragraphs

Legal Documents

chunk_size=1500; longer clauses need more context

News Articles

chunk_size=800; shorter, punchier content

Technical Manuals

chunk_size=1200; procedures need complete steps

Tuning Chunk Size

If retrieval quality is poor, consider:

Smaller chunks (500-800): When answers are very specific and localized
Larger chunks (1200-1500): When answers need more surrounding context
More overlap (500-600): For dense, technical content where concepts are tightly connected
Less overlap (200-300): For simple documents with clear topic boundaries

The current configuration (1000/400) is a good starting point for most documents.

Performance Implications

Chunk Count

A 100-page PDF (≈50,000 words) produces:

Character count: ≈250,000 characters
Chunks: ≈250 chunks (at 1000 chars each with overlap)
Embedding cost: ≈$0.10-0.15
Storage: A few MB in ChromaDB

Processing Time

For typical PDFs:

PDF loading: 1-3 seconds
Text splitting: < 1 second
Embedding API calls: 5-15 seconds (depends on chunk count)
Vector store insertion: 1-2 seconds

The embedding API calls are the bottleneck. LangChain batches requests automatically for efficiency.

Metadata Preservation

Chunks retain metadata from the original PDF:

{
  "page_content": "RAG combines retrieval with generation...",
  "metadata": {
    "source": "research_paper.pdf",
    "page": 3
  }
}

This allows for:

Source attribution in responses
Filtering by page number
Document provenance tracking

Next Steps

Vector Store

Learn how processed chunks are stored and retrieved

RAG Overview

See how chunking fits into the complete RAG pipeline

Get Started

Core Concepts

Guides

Reference

Advanced

Document Processing

Why Document Processing Matters

The Processing Pipeline

1. Load

2. Split

3. Embed

PDF Loading

What PyPDFLoader Extracts

Text Chunking

The RecursiveCharacterTextSplitter

Why 1000 Characters?

Too Small (< 500)

Too Large (> 2000)

Just Right (1000)

With Overlap (400)

The Importance of Overlap

Recursive Splitting Strategy

Complete Processing Function

Integration with Vector Store

Chunking Best Practices

For Different Document Types

Research Papers

Legal Documents

News Articles

Technical Manuals

Tuning Chunk Size

Performance Implications

Chunk Count

Processing Time

Metadata Preservation

Next Steps

Vector Store

RAG Overview

Build docs developers (and LLMs) love

Get Started

Core Concepts

Guides

Reference

Advanced

​Why Document Processing Matters

​The Processing Pipeline

1. Load

2. Split

3. Embed

​PDF Loading

​What PyPDFLoader Extracts

​Text Chunking

​The RecursiveCharacterTextSplitter

​Why 1000 Characters?

Too Small (< 500)

Too Large (> 2000)

Just Right (1000)

With Overlap (400)

​The Importance of Overlap

​Recursive Splitting Strategy

​Complete Processing Function

​Integration with Vector Store

​Chunking Best Practices

​For Different Document Types

Research Papers

Legal Documents

News Articles

Technical Manuals

​Tuning Chunk Size

​Performance Implications

​Chunk Count

​Processing Time

​Metadata Preservation

​Next Steps

Vector Store

RAG Overview

Build docs developers (and LLMs) love

Why Document Processing Matters

The Processing Pipeline

PDF Loading

What PyPDFLoader Extracts

Text Chunking

The RecursiveCharacterTextSplitter

Why 1000 Characters?

The Importance of Overlap

Recursive Splitting Strategy

Complete Processing Function

Integration with Vector Store

Chunking Best Practices

For Different Document Types

Tuning Chunk Size

Performance Implications

Chunk Count

Processing Time

Metadata Preservation

Next Steps