What is RAG?
Retrieval-Augmented Generation (RAG) is an AI pattern that enhances language models by providing them with relevant context retrieved from external knowledge sources. Instead of relying solely on the model’s training data, RAG systems:- Retrieve relevant information from a knowledge base
- Augment the user’s query with this context
- Generate an informed response using the language model
RAG dramatically improves accuracy and reduces hallucinations by grounding AI responses in actual document content.
Pipeline Architecture
The PDF AI RAG pipeline consists of three main stages:Stage 1: Document Ingestion
Document ingestion transforms PDFs into searchable vector embeddings.Stage 2: Context Retrieval
When a user asks a question, relevant document chunks are retrieved.Stage 3: Response Generation
The retrieved context is combined with the user’s question to generate an answer.Document Processing
The document processing pipeline is implemented insrc/lib/pinecone.ts and handles converting PDFs into searchable vectors.
Step 1: PDF Download from S3
The file is stored at
D:/pdf-${Date.now()}.pdf during processing (src/lib/s3-server.ts:20).Step 2: PDF Text Extraction
PDFLoader extracts text from every page, producing an array of page objects:
Step 3: Document Chunking
Large documents are split into smaller, semantically meaningful chunks:Why chunking is important
Why chunking is important
Chunking serves three critical purposes:
- Token Limits - Language models have context window limits. Smaller chunks ensure we stay within bounds.
- Semantic Precision - Smaller chunks provide more focused, relevant context. A chunk about “security” won’t also include unrelated content about “pricing”.
-
Pinecone Constraints - Metadata is limited to 36KB per vector (hence the
truncateStringByBytescall).
Step 4: Byte Truncation
Pinecone has a 36KB metadata limit per vector:Step 5: Embedding Generation
Each chunk is converted to a 1536-dimension vector using OpenAI’s embedding model:text-embedding-ada-002 model:
Why use MD5 hashing for IDs?MD5 hashing ensures deterministic IDs. If the same content is processed twice, it gets the same ID, preventing duplicate vectors in Pinecone.
Step 6: Pinecone Upload
Vectors are upserted to Pinecone with namespace isolation:Each PDF gets its own namespace (based on the S3 file key) to ensure data isolation between documents.
Context Retrieval
When a user asks a question, the system retrieves relevant document chunks through semantic similarity search.Step 1: Query Embedding
Step 2: Vector Similarity Search
Understanding topK parameter
Understanding topK parameter
topK: 5 means “return the 5 most similar document chunks.”Pinecone uses cosine similarity to rank vectors. The similarity score ranges from 0 to 1:
- 0.9-1.0 - Extremely similar
- 0.7-0.9 - Highly relevant
- 0.5-0.7 - Somewhat relevant
- < 0.5 - Not relevant
Step 3: Filtering by Relevance
Step 4: Context Assembly
Response Generation
The final stage combines retrieved context with the language model to generate answers.System Prompt Construction
The system prompt explicitly instructs the AI to only use information from the context block, preventing hallucinations.
Streaming Chat Completion
Benefits of streaming
Benefits of streaming
- Perceived Performance - Users see responses immediately instead of waiting for complete generation
- Better UX - Users can start reading while the AI is still writing
- Error Handling - If generation fails mid-stream, users still see partial responses
- Lower Time-to-First-Byte - Critical for Edge Runtime cold starts
Pipeline Performance
Document Processing Metrics
- PDF Download: ~1-3 seconds (depends on file size)
- Text Extraction: ~0.5-2 seconds per page
- Embedding Generation: ~0.3 seconds per chunk
- Pinecone Upload: ~1-2 seconds (batch operation)
Query Performance Metrics
- Query Embedding: ~0.3 seconds
- Pinecone Search: ~50-200ms
- GPT-4 First Token: ~1-2 seconds
- Streaming Completion: ~3-10 seconds (varies by response length)
All timings assume Edge Runtime deployment with optimal network conditions.
Error Handling
The pipeline includes robust error handling at each stage:- S3 Download Fails - Returns error before processing
- OpenAI API Error - Caught and logged, returns 500
- Pinecone Timeout - Automatically retried by SDK
- No Context Found - AI responds with “I don’t know”
Optimization Opportunities
Potential improvements
Potential improvements
Caching
- Cache query embeddings for common questions
- Cache Pinecone search results (with TTL)
Parallel Processing
- Process PDF pages in parallel
- Batch embedding API calls
Smarter Chunking
- Use semantic chunking instead of character-based
- Preserve section boundaries
Hybrid Search
- Combine vector search with keyword search
- Re-rank results using cross-encoder models
Context Optimization
- Dynamically adjust context window based on query complexity
- Use compression techniques for longer contexts