RAG system architecture

Filebright uses a RAG (Retrieval Augmented Generation) architecture to enable intelligent document querying. This system combines vector embeddings, semantic search, and large language models to provide accurate answers from your documents.

Architecture overview

The RAG pipeline consists of five main stages:

Document parsing - Extract text from PDFs and TXT files
Text chunking - Split text into overlapping segments
Embedding generation - Convert chunks to vector embeddings
Vector storage - Store embeddings in MongoDB
Semantic search - Find relevant chunks using vector similarity
LLM generation - Generate natural language answers

Document parsing

The DocumentParserService extracts text from uploaded documents based on their MIME type.

// DocumentParserService.php:17-46
public function parse(string $filePath, string $mimeType): string
{
    if (!file_exists($filePath)) {
        return '';
    }

    return match ($mimeType) {
        'application/pdf' => $this->parsePdf($filePath),
        'text/plain' => file_get_contents($filePath) ?: '',
        default => '',
    };
}

private function parsePdf(string $filePath): string
{
    try {
        $pdf = $this->pdfParser->parseFile($filePath);
        return $pdf->getText();
    } catch (\Exception $e) {
        Log::error("PDF Parsing failed: " . $e->getMessage());
        return '';
    }
}

Supported formats

PDF files

Uses the Smalot PDF Parser library to extract text from PDF documents. Handles multi-page documents and preserves text content while stripping formatting.Limitations:

Image-only PDFs without text layers will return empty content
Complex layouts may have text extraction order issues
OCR is not supported

Text files

Plain text files are read directly using PHP’s file_get_contents(). All UTF-8 content is preserved.

If parsing fails or returns empty text, the document status is set to “failed” and processing stops. The error is logged for debugging.

Text chunking

The TextChunkerService splits extracted text into smaller, overlapping segments optimized for vector search.

// TextChunkerService.php:7-54
public function chunk(string $text, int $chunkSize = 1200, int $overlap = 300): array
{
    if (empty($text)) {
        return [];
    }

    $text = preg_replace('/\s+/', ' ', trim($text));
    $textLength = strlen($text);

    if ($textLength <= $chunkSize) {
        return [$text];
    }

    $chunks = [];
    $start = 0;

    while ($start < $textLength) {
        $end = $start + $chunkSize;

        if ($end < $textLength) {
            $lastSpace = strrpos(substr($text, $start, $chunkSize), ' ');
            if ($lastSpace !== false && $lastSpace > ($chunkSize - 100)) {
                $end = $start + $lastSpace;
            }
        } else {
            $end = $textLength;
        }

        $chunk = trim(substr($text, $start, $end - $start));
        if (!empty($chunk)) {
            $chunks[] = $chunk;
        }

        if ($end >= $textLength) {
            break;
        }

        $nextStart = $end - $overlap;
        $start = ($nextStart > $start) ? $nextStart : $end;
    }

    return $chunks;
}

Chunking strategy

Chunk size: 1200 characters per chunk (configurable)
Overlap: 300 characters between consecutive chunks
Boundary detection: Chunks break at word boundaries when possible to avoid splitting mid-word
Normalization: Multiple whitespace characters are collapsed to single spaces

The overlap ensures that content spanning chunk boundaries is still captured. This improves retrieval accuracy when a query matches content near a chunk boundary.

Why these parameters?

1200 character chunks

This size balances:

Semantic coherence: Large enough to contain complete thoughts and context
Specificity: Small enough to avoid mixing unrelated topics
Embedding quality: Optimal length for embedding models
Token limits: Stays well under LLM context windows when multiple chunks are sent

300 character overlap

The 25% overlap (300/1200) ensures:

Context continuity across chunks
Critical information near boundaries isn’t lost
Better retrieval recall for queries matching boundary regions

Embedding generation

The EmbeddingService converts text chunks into high-dimensional vector embeddings using the OpenRouter API.

// EmbeddingService.php:19-53
public function getBulkEmbeddings(array $texts): array
{
    if (empty($texts)) return [];

    $sanitizedTexts = array_map([$this, 'sanitize'], $texts);
    
    try {
        $response = Http::withHeaders([
            'Authorization' => 'Bearer ' . $this->apiKey,
            'HTTP-Referer' => config('app.url'),
        ])->post('https://openrouter.ai/api/v1/embeddings', [
            'model' => $this->model,
            'input' => $sanitizedTexts,
        ]);

        if ($response->failed()) {
            Log::error("OpenRouter Bulk Embedding API failed: " . $response->body());
            return [];
        }

        $data = $response->json('data');
        if (empty($data)) return [];

        // Sort by index if provided by API, though usually it matches order
        usort($data, fn($a, $b) => ($a['index'] ?? 0) <=> ($b['index'] ?? 0));

        return array_map(fn($item) => $item['embedding'], $data);
    } catch (\Exception $e) {
        Log::error("Bulk Embedding API Exception: " . $e->getMessage());
        return [];
    }
}

private function sanitize(string $text): string
{
    // Strip malformed UTF-8 characters
    return mb_convert_encoding($text, 'UTF-8', 'UTF-8');
}

Embedding model

By default, Filebright uses text-embedding-3-small (configurable via OPENROUTER_EMBEDDING_MODEL):

Dimensions: 1536-dimensional vectors
Use case: Optimized for semantic search
Performance: Fast and cost-effective

Bulk processing

All chunks from a document are embedded in a single API call for efficiency:

// ProcessDocument.php:48-53
$this->document->update(['status' => 'vectorizing']);
$embeddings = $embeddingService->getBulkEmbeddings($chunks);

if (count($embeddings) !== $chunkCount) {
    throw new \Exception("Failed to generate embeddings for all chunks.");
}

Bulk embedding reduces API calls and processing time. A 100-page document might generate 200+ chunks, which would take 200 API calls if done individually but only 1 with bulk processing.

Vector storage

The VectorStorageService stores document chunks and their embeddings in MongoDB.

// VectorStorageService.php:17-39
public function storeChunks(int $documentId, int $userId, array $chunks, array $embeddings): void
{
    try {
        foreach ($chunks as $index => $content) {
            $this->chunkModel->create([
                'document_id' => $documentId,
                'content' => $content,
                'embedding' => $embeddings[$index],
                'metadata' => [
                    'user_id' => $userId,
                    'chunk_index' => $index,
                ]
            ]);
        }
    } catch (\Exception $e) {
        Log::error("Vector storage failed for document ID {$documentId}: " . $e->getMessage());
        throw $e;
    }
}

Data model

Each DocumentChunk document in MongoDB contains:

{
  "_id": "ObjectId",
  "document_id": 123,
  "content": "The chunk text content...",
  "embedding": [0.123, -0.456, 0.789, ...], // 1536 dimensions
  "metadata": {
    "user_id": 42,
    "chunk_index": 0
  },
  "created_at": "2026-03-03T12:00:00Z",
  "updated_at": "2026-03-03T12:00:00Z"
}

Vector index

MongoDB’s vector search requires a vector index on the embedding field:

// MongoDB Atlas vector index configuration
{
  "fields": [
    {
      "type": "vector",
      "path": "embedding",
      "numDimensions": 1536,
      "similarity": "cosine"
    },
    {
      "type": "filter",
      "path": "metadata.user_id"
    }
  ]
}

The vector index must be created manually in MongoDB Atlas before the RAG system can perform searches. Without this index, queries will fail.

Semantic search

When a user sends a query, the RAG system retrieves relevant chunks using vector similarity search.

Query flow

Embed the query

The user’s question is converted to a vector embedding using the same model as document chunks:

// RAGService.php:18-20
public function answer(string $query, int $userId): string
{
    $queryEmbedding = $this->embeddingService->getEmbedding($query);
    // ...
}

Vector search

MongoDB performs a $vectorSearch aggregation to find similar chunks:

// RAGService.php:37-55
protected function retrieveContext(array $embedding, int $userId): \Illuminate\Support\Collection
{
    return DocumentChunk::raw(function ($collection) use ($embedding, $userId) {
        return $collection->aggregate([
            [
                '$vectorSearch' => [
                    'index' => 'vector_index',
                    'path' => 'embedding',
                    'queryVector' => $embedding,
                    'numCandidates' => 100,
                    'limit' => 3,
                    'filter' => [
                        'metadata.user_id' => $userId
                    ]
                ]
            ]
        ]);
    });
}

Filter by user

The search is scoped to the authenticated user’s documents only. Other users’ chunks are never included in results.

Rank and limit

MongoDB returns the top 3 chunks ranked by cosine similarity to the query vector.

Search parameters

numCandidates: 100

The number of candidate documents to consider before selecting the final results. Higher values improve accuracy but reduce speed. 100 is a good balance for most use cases.

limit: 3

The maximum number of chunks returned. Three chunks typically provide sufficient context without exceeding LLM token limits or including irrelevant information.

similarity: cosine

Cosine similarity measures the angle between vectors, making it ideal for text embeddings where magnitude is less important than direction.

Vector search is semantic, not lexical. It finds chunks with similar meaning to the query, even if they don’t share exact keywords.

LLM response generation

The final stage combines retrieved chunks with the user’s query and sends them to an LLM for natural language generation.

// RAGService.php:57-88
protected function getLLMResponse(string $query, string $context): string
{
    $apiKey = config('services.openrouter.key');
    $model = config('services.openrouter.chat_model', 'openai/gpt-3.5-turbo');

    $prompt = "You are a helpful assistant. Use the following pieces of retrieved context to answer the user's question.\n\n"
            . "Context:\n" . $context . "\n\n"
            . "Question: " . $query . "\n\n"
            . "Answer:";

    try {
        $response = Http::withHeaders([
            'Authorization' => 'Bearer ' . $apiKey,
            'Content-Type' => 'application/json',
        ])->post('https://openrouter.ai/api/v1/chat/completions', [
            'model' => $model,
            'messages' => [
                ['role' => 'user', 'content' => $prompt]
            ],
        ]);

        if ($response->successful()) {
            return $response->json('choices.0.message.content') ?? "Error retrieving response.";
        }

        Log::error("OpenRouter API Error: " . $response->body());
        return "Error communicating with AI service.";
    } catch (\Exception $e) {
        Log::error("RAG Service Exception: " . $e->getMessage());
        return "An unexpected error occurred.";
    }
}

Prompt engineering

The system uses a simple but effective prompt structure:

System instruction: “You are a helpful assistant…”
Context: The 3 retrieved chunks separated by ---
Question: The user’s original query
Instruction: “Answer:”

This structure guides the LLM to:

Ground responses in the provided context
Answer the specific question asked
Maintain a helpful, conversational tone

Default model

Filebright uses gpt-3.5-turbo by default (configurable via OPENROUTER_CHAT_MODEL):

Fast response times
Good reasoning capabilities
Cost-effective for high query volumes
Supports sufficient context window for 3 chunks + query

You can configure a more powerful model like GPT-4 or Claude for improved reasoning and longer context handling. Update the OPENROUTER_CHAT_MODEL environment variable.

Error handling

The RAG pipeline includes robust error handling at every stage:

Parsing failures

// ProcessDocument.php:39-41
if (empty($text)) {
    throw new \Exception("Extraction returned empty text.");
}

Sets document status to “failed” and logs the error.

Embedding failures

// ProcessDocument.php:51-53
if (count($embeddings) !== $chunkCount) {
    throw new \Exception("Failed to generate embeddings for all chunks.");
}

Ensures all chunks were successfully embedded before proceeding.

Query failures

// RAGService.php:22-24
if (empty($queryEmbedding)) {
    return "I'm sorry, I couldn't process your request at the moment.";
}

Provides user-friendly error messages instead of exposing technical failures.

No results

// RAGService.php:28-30
if ($chunks->isEmpty()) {
    return "I couldn't find any relevant information in your documents to answer that.";
}

Clearly communicates when no relevant content was found.

All errors are logged to Laravel’s log files at storage/logs/laravel.log for debugging and monitoring.

Performance considerations

Async processing

Document processing runs in background jobs via Laravel’s queue system:

// DocumentController.php:38
ProcessDocument::dispatch($document);

This ensures:

Uploads complete instantly
Heavy processing doesn’t block the web server
Failed jobs can be retried automatically

Bulk operations

Embeddings are generated in bulk to minimize API calls and processing time.

Database indexing

MongoDB’s vector index enables fast similarity search even across millions of chunks.

Caching opportunities

Potential optimizations (not currently implemented):

Cache frequently asked queries and their results
Cache embeddings for common queries
Implement query result pagination for very large result sets

Scalability

The RAG architecture scales well:

Horizontal scaling: Add more queue workers to process documents in parallel
Vector storage: MongoDB Atlas vector search handles billions of vectors
API limits: OpenRouter provides high rate limits and can be swapped for self-hosted models
User isolation: All data is scoped by user ID, enabling multi-tenancy

Monitor your OpenRouter API usage and costs, especially if you have many users. Consider implementing rate limiting on the /api/chat endpoint to prevent abuse.

Get Started

Features

Configuration

Deployment

RAG system architecture

Architecture overview

Document parsing

Supported formats

Text chunking

Chunking strategy

Why these parameters?

Embedding generation

Embedding model

Bulk processing

Vector storage

Data model

Vector index

Semantic search

Query flow

Search parameters

LLM response generation

Prompt engineering

Default model

Error handling

Parsing failures

Embedding failures

Query failures

No results

Performance considerations

Async processing

Bulk operations

Database indexing

Caching opportunities

Scalability

Build docs developers (and LLMs) love

Get Started

Features

Configuration

Deployment

​Architecture overview

​Document parsing

​Supported formats

​Text chunking

​Chunking strategy

​Why these parameters?

​Embedding generation

​Embedding model

​Bulk processing

​Vector storage

​Data model

​Vector index

​Semantic search

​Query flow

​Search parameters

​LLM response generation

​Prompt engineering

​Default model

​Error handling

​Parsing failures

​Embedding failures

​Query failures

​No results

​Performance considerations

​Async processing

​Bulk operations

​Database indexing

​Caching opportunities

​Scalability

Build docs developers (and LLMs) love

Architecture overview

Document parsing

Supported formats

Text chunking

Chunking strategy

Why these parameters?

Embedding generation

Embedding model

Bulk processing

Vector storage

Data model

Vector index

Semantic search

Query flow

Search parameters

LLM response generation

Prompt engineering

Default model

Error handling

Parsing failures

Embedding failures

Query failures

No results

Performance considerations

Async processing

Bulk operations

Database indexing

Caching opportunities

Scalability