Architecture overview
The RAG pipeline consists of five main stages:- Document parsing - Extract text from PDFs and TXT files
- Text chunking - Split text into overlapping segments
- Embedding generation - Convert chunks to vector embeddings
- Vector storage - Store embeddings in MongoDB
- Semantic search - Find relevant chunks using vector similarity
- LLM generation - Generate natural language answers
Document parsing
TheDocumentParserService extracts text from uploaded documents based on their MIME type.
Supported formats
PDF files
PDF files
Uses the Smalot PDF Parser library to extract text from PDF documents. Handles multi-page documents and preserves text content while stripping formatting.Limitations:
- Image-only PDFs without text layers will return empty content
- Complex layouts may have text extraction order issues
- OCR is not supported
Text files
Text files
Plain text files are read directly using PHP’s
file_get_contents(). All UTF-8 content is preserved.If parsing fails or returns empty text, the document status is set to “failed” and processing stops. The error is logged for debugging.
Text chunking
TheTextChunkerService splits extracted text into smaller, overlapping segments optimized for vector search.
Chunking strategy
- Chunk size: 1200 characters per chunk (configurable)
- Overlap: 300 characters between consecutive chunks
- Boundary detection: Chunks break at word boundaries when possible to avoid splitting mid-word
- Normalization: Multiple whitespace characters are collapsed to single spaces
The overlap ensures that content spanning chunk boundaries is still captured. This improves retrieval accuracy when a query matches content near a chunk boundary.
Why these parameters?
1200 character chunks
1200 character chunks
This size balances:
- Semantic coherence: Large enough to contain complete thoughts and context
- Specificity: Small enough to avoid mixing unrelated topics
- Embedding quality: Optimal length for embedding models
- Token limits: Stays well under LLM context windows when multiple chunks are sent
300 character overlap
300 character overlap
The 25% overlap (300/1200) ensures:
- Context continuity across chunks
- Critical information near boundaries isn’t lost
- Better retrieval recall for queries matching boundary regions
Embedding generation
TheEmbeddingService converts text chunks into high-dimensional vector embeddings using the OpenRouter API.
Embedding model
By default, Filebright uses text-embedding-3-small (configurable viaOPENROUTER_EMBEDDING_MODEL):
- Dimensions: 1536-dimensional vectors
- Use case: Optimized for semantic search
- Performance: Fast and cost-effective
Bulk processing
All chunks from a document are embedded in a single API call for efficiency:Vector storage
TheVectorStorageService stores document chunks and their embeddings in MongoDB.
Data model
EachDocumentChunk document in MongoDB contains:
Vector index
MongoDB’s vector search requires a vector index on theembedding field:
Semantic search
When a user sends a query, the RAG system retrieves relevant chunks using vector similarity search.Query flow
Embed the query
The user’s question is converted to a vector embedding using the same model as document chunks:
Filter by user
The search is scoped to the authenticated user’s documents only. Other users’ chunks are never included in results.
Search parameters
numCandidates: 100
numCandidates: 100
The number of candidate documents to consider before selecting the final results. Higher values improve accuracy but reduce speed. 100 is a good balance for most use cases.
limit: 3
limit: 3
The maximum number of chunks returned. Three chunks typically provide sufficient context without exceeding LLM token limits or including irrelevant information.
similarity: cosine
similarity: cosine
Cosine similarity measures the angle between vectors, making it ideal for text embeddings where magnitude is less important than direction.
Vector search is semantic, not lexical. It finds chunks with similar meaning to the query, even if they don’t share exact keywords.
LLM response generation
The final stage combines retrieved chunks with the user’s query and sends them to an LLM for natural language generation.Prompt engineering
The system uses a simple but effective prompt structure:- System instruction: “You are a helpful assistant…”
- Context: The 3 retrieved chunks separated by
--- - Question: The user’s original query
- Instruction: “Answer:”
- Ground responses in the provided context
- Answer the specific question asked
- Maintain a helpful, conversational tone
Default model
Filebright uses gpt-3.5-turbo by default (configurable viaOPENROUTER_CHAT_MODEL):
- Fast response times
- Good reasoning capabilities
- Cost-effective for high query volumes
- Supports sufficient context window for 3 chunks + query
Error handling
The RAG pipeline includes robust error handling at every stage:Parsing failures
Embedding failures
Query failures
No results
All errors are logged to Laravel’s log files at
storage/logs/laravel.log for debugging and monitoring.Performance considerations
Async processing
Document processing runs in background jobs via Laravel’s queue system:- Uploads complete instantly
- Heavy processing doesn’t block the web server
- Failed jobs can be retried automatically
Bulk operations
Embeddings are generated in bulk to minimize API calls and processing time.Database indexing
MongoDB’s vector index enables fast similarity search even across millions of chunks.Caching opportunities
Potential optimizations (not currently implemented):- Cache frequently asked queries and their results
- Cache embeddings for common queries
- Implement query result pagination for very large result sets
Scalability
The RAG architecture scales well:- Horizontal scaling: Add more queue workers to process documents in parallel
- Vector storage: MongoDB Atlas vector search handles billions of vectors
- API limits: OpenRouter provides high rate limits and can be swapped for self-hosted models
- User isolation: All data is scoped by user ID, enabling multi-tenancy