Skip to main content
Filebright uses a RAG (Retrieval Augmented Generation) architecture to enable intelligent document querying. This system combines vector embeddings, semantic search, and large language models to provide accurate answers from your documents.

Architecture overview

The RAG pipeline consists of five main stages:
  1. Document parsing - Extract text from PDFs and TXT files
  2. Text chunking - Split text into overlapping segments
  3. Embedding generation - Convert chunks to vector embeddings
  4. Vector storage - Store embeddings in MongoDB
  5. Semantic search - Find relevant chunks using vector similarity
  6. LLM generation - Generate natural language answers

Document parsing

The DocumentParserService extracts text from uploaded documents based on their MIME type.
// DocumentParserService.php:17-46
public function parse(string $filePath, string $mimeType): string
{
    if (!file_exists($filePath)) {
        return '';
    }

    return match ($mimeType) {
        'application/pdf' => $this->parsePdf($filePath),
        'text/plain' => file_get_contents($filePath) ?: '',
        default => '',
    };
}

private function parsePdf(string $filePath): string
{
    try {
        $pdf = $this->pdfParser->parseFile($filePath);
        return $pdf->getText();
    } catch (\Exception $e) {
        Log::error("PDF Parsing failed: " . $e->getMessage());
        return '';
    }
}

Supported formats

Uses the Smalot PDF Parser library to extract text from PDF documents. Handles multi-page documents and preserves text content while stripping formatting.Limitations:
  • Image-only PDFs without text layers will return empty content
  • Complex layouts may have text extraction order issues
  • OCR is not supported
Plain text files are read directly using PHP’s file_get_contents(). All UTF-8 content is preserved.
If parsing fails or returns empty text, the document status is set to “failed” and processing stops. The error is logged for debugging.

Text chunking

The TextChunkerService splits extracted text into smaller, overlapping segments optimized for vector search.
// TextChunkerService.php:7-54
public function chunk(string $text, int $chunkSize = 1200, int $overlap = 300): array
{
    if (empty($text)) {
        return [];
    }

    $text = preg_replace('/\s+/', ' ', trim($text));
    $textLength = strlen($text);

    if ($textLength <= $chunkSize) {
        return [$text];
    }

    $chunks = [];
    $start = 0;

    while ($start < $textLength) {
        $end = $start + $chunkSize;

        if ($end < $textLength) {
            $lastSpace = strrpos(substr($text, $start, $chunkSize), ' ');
            if ($lastSpace !== false && $lastSpace > ($chunkSize - 100)) {
                $end = $start + $lastSpace;
            }
        } else {
            $end = $textLength;
        }

        $chunk = trim(substr($text, $start, $end - $start));
        if (!empty($chunk)) {
            $chunks[] = $chunk;
        }

        if ($end >= $textLength) {
            break;
        }

        $nextStart = $end - $overlap;
        $start = ($nextStart > $start) ? $nextStart : $end;
    }

    return $chunks;
}

Chunking strategy

  • Chunk size: 1200 characters per chunk (configurable)
  • Overlap: 300 characters between consecutive chunks
  • Boundary detection: Chunks break at word boundaries when possible to avoid splitting mid-word
  • Normalization: Multiple whitespace characters are collapsed to single spaces
The overlap ensures that content spanning chunk boundaries is still captured. This improves retrieval accuracy when a query matches content near a chunk boundary.

Why these parameters?

This size balances:
  • Semantic coherence: Large enough to contain complete thoughts and context
  • Specificity: Small enough to avoid mixing unrelated topics
  • Embedding quality: Optimal length for embedding models
  • Token limits: Stays well under LLM context windows when multiple chunks are sent
The 25% overlap (300/1200) ensures:
  • Context continuity across chunks
  • Critical information near boundaries isn’t lost
  • Better retrieval recall for queries matching boundary regions

Embedding generation

The EmbeddingService converts text chunks into high-dimensional vector embeddings using the OpenRouter API.
// EmbeddingService.php:19-53
public function getBulkEmbeddings(array $texts): array
{
    if (empty($texts)) return [];

    $sanitizedTexts = array_map([$this, 'sanitize'], $texts);
    
    try {
        $response = Http::withHeaders([
            'Authorization' => 'Bearer ' . $this->apiKey,
            'HTTP-Referer' => config('app.url'),
        ])->post('https://openrouter.ai/api/v1/embeddings', [
            'model' => $this->model,
            'input' => $sanitizedTexts,
        ]);

        if ($response->failed()) {
            Log::error("OpenRouter Bulk Embedding API failed: " . $response->body());
            return [];
        }

        $data = $response->json('data');
        if (empty($data)) return [];

        // Sort by index if provided by API, though usually it matches order
        usort($data, fn($a, $b) => ($a['index'] ?? 0) <=> ($b['index'] ?? 0));

        return array_map(fn($item) => $item['embedding'], $data);
    } catch (\Exception $e) {
        Log::error("Bulk Embedding API Exception: " . $e->getMessage());
        return [];
    }
}

private function sanitize(string $text): string
{
    // Strip malformed UTF-8 characters
    return mb_convert_encoding($text, 'UTF-8', 'UTF-8');
}

Embedding model

By default, Filebright uses text-embedding-3-small (configurable via OPENROUTER_EMBEDDING_MODEL):
  • Dimensions: 1536-dimensional vectors
  • Use case: Optimized for semantic search
  • Performance: Fast and cost-effective

Bulk processing

All chunks from a document are embedded in a single API call for efficiency:
// ProcessDocument.php:48-53
$this->document->update(['status' => 'vectorizing']);
$embeddings = $embeddingService->getBulkEmbeddings($chunks);

if (count($embeddings) !== $chunkCount) {
    throw new \Exception("Failed to generate embeddings for all chunks.");
}
Bulk embedding reduces API calls and processing time. A 100-page document might generate 200+ chunks, which would take 200 API calls if done individually but only 1 with bulk processing.

Vector storage

The VectorStorageService stores document chunks and their embeddings in MongoDB.
// VectorStorageService.php:17-39
public function storeChunks(int $documentId, int $userId, array $chunks, array $embeddings): void
{
    try {
        foreach ($chunks as $index => $content) {
            $this->chunkModel->create([
                'document_id' => $documentId,
                'content' => $content,
                'embedding' => $embeddings[$index],
                'metadata' => [
                    'user_id' => $userId,
                    'chunk_index' => $index,
                ]
            ]);
        }
    } catch (\Exception $e) {
        Log::error("Vector storage failed for document ID {$documentId}: " . $e->getMessage());
        throw $e;
    }
}

Data model

Each DocumentChunk document in MongoDB contains:
{
  "_id": "ObjectId",
  "document_id": 123,
  "content": "The chunk text content...",
  "embedding": [0.123, -0.456, 0.789, ...], // 1536 dimensions
  "metadata": {
    "user_id": 42,
    "chunk_index": 0
  },
  "created_at": "2026-03-03T12:00:00Z",
  "updated_at": "2026-03-03T12:00:00Z"
}

Vector index

MongoDB’s vector search requires a vector index on the embedding field:
// MongoDB Atlas vector index configuration
{
  "fields": [
    {
      "type": "vector",
      "path": "embedding",
      "numDimensions": 1536,
      "similarity": "cosine"
    },
    {
      "type": "filter",
      "path": "metadata.user_id"
    }
  ]
}
The vector index must be created manually in MongoDB Atlas before the RAG system can perform searches. Without this index, queries will fail.
When a user sends a query, the RAG system retrieves relevant chunks using vector similarity search.

Query flow

1

Embed the query

The user’s question is converted to a vector embedding using the same model as document chunks:
// RAGService.php:18-20
public function answer(string $query, int $userId): string
{
    $queryEmbedding = $this->embeddingService->getEmbedding($query);
    // ...
}
2

Vector search

MongoDB performs a $vectorSearch aggregation to find similar chunks:
// RAGService.php:37-55
protected function retrieveContext(array $embedding, int $userId): \Illuminate\Support\Collection
{
    return DocumentChunk::raw(function ($collection) use ($embedding, $userId) {
        return $collection->aggregate([
            [
                '$vectorSearch' => [
                    'index' => 'vector_index',
                    'path' => 'embedding',
                    'queryVector' => $embedding,
                    'numCandidates' => 100,
                    'limit' => 3,
                    'filter' => [
                        'metadata.user_id' => $userId
                    ]
                ]
            ]
        ]);
    });
}
3

Filter by user

The search is scoped to the authenticated user’s documents only. Other users’ chunks are never included in results.
4

Rank and limit

MongoDB returns the top 3 chunks ranked by cosine similarity to the query vector.

Search parameters

The number of candidate documents to consider before selecting the final results. Higher values improve accuracy but reduce speed. 100 is a good balance for most use cases.
The maximum number of chunks returned. Three chunks typically provide sufficient context without exceeding LLM token limits or including irrelevant information.
Cosine similarity measures the angle between vectors, making it ideal for text embeddings where magnitude is less important than direction.
Vector search is semantic, not lexical. It finds chunks with similar meaning to the query, even if they don’t share exact keywords.

LLM response generation

The final stage combines retrieved chunks with the user’s query and sends them to an LLM for natural language generation.
// RAGService.php:57-88
protected function getLLMResponse(string $query, string $context): string
{
    $apiKey = config('services.openrouter.key');
    $model = config('services.openrouter.chat_model', 'openai/gpt-3.5-turbo');

    $prompt = "You are a helpful assistant. Use the following pieces of retrieved context to answer the user's question.\n\n"
            . "Context:\n" . $context . "\n\n"
            . "Question: " . $query . "\n\n"
            . "Answer:";

    try {
        $response = Http::withHeaders([
            'Authorization' => 'Bearer ' . $apiKey,
            'Content-Type' => 'application/json',
        ])->post('https://openrouter.ai/api/v1/chat/completions', [
            'model' => $model,
            'messages' => [
                ['role' => 'user', 'content' => $prompt]
            ],
        ]);

        if ($response->successful()) {
            return $response->json('choices.0.message.content') ?? "Error retrieving response.";
        }

        Log::error("OpenRouter API Error: " . $response->body());
        return "Error communicating with AI service.";
    } catch (\Exception $e) {
        Log::error("RAG Service Exception: " . $e->getMessage());
        return "An unexpected error occurred.";
    }
}

Prompt engineering

The system uses a simple but effective prompt structure:
  1. System instruction: “You are a helpful assistant…”
  2. Context: The 3 retrieved chunks separated by ---
  3. Question: The user’s original query
  4. Instruction: “Answer:”
This structure guides the LLM to:
  • Ground responses in the provided context
  • Answer the specific question asked
  • Maintain a helpful, conversational tone

Default model

Filebright uses gpt-3.5-turbo by default (configurable via OPENROUTER_CHAT_MODEL):
  • Fast response times
  • Good reasoning capabilities
  • Cost-effective for high query volumes
  • Supports sufficient context window for 3 chunks + query
You can configure a more powerful model like GPT-4 or Claude for improved reasoning and longer context handling. Update the OPENROUTER_CHAT_MODEL environment variable.

Error handling

The RAG pipeline includes robust error handling at every stage:

Parsing failures

// ProcessDocument.php:39-41
if (empty($text)) {
    throw new \Exception("Extraction returned empty text.");
}
Sets document status to “failed” and logs the error.

Embedding failures

// ProcessDocument.php:51-53
if (count($embeddings) !== $chunkCount) {
    throw new \Exception("Failed to generate embeddings for all chunks.");
}
Ensures all chunks were successfully embedded before proceeding.

Query failures

// RAGService.php:22-24
if (empty($queryEmbedding)) {
    return "I'm sorry, I couldn't process your request at the moment.";
}
Provides user-friendly error messages instead of exposing technical failures.

No results

// RAGService.php:28-30
if ($chunks->isEmpty()) {
    return "I couldn't find any relevant information in your documents to answer that.";
}
Clearly communicates when no relevant content was found.
All errors are logged to Laravel’s log files at storage/logs/laravel.log for debugging and monitoring.

Performance considerations

Async processing

Document processing runs in background jobs via Laravel’s queue system:
// DocumentController.php:38
ProcessDocument::dispatch($document);
This ensures:
  • Uploads complete instantly
  • Heavy processing doesn’t block the web server
  • Failed jobs can be retried automatically

Bulk operations

Embeddings are generated in bulk to minimize API calls and processing time.

Database indexing

MongoDB’s vector index enables fast similarity search even across millions of chunks.

Caching opportunities

Potential optimizations (not currently implemented):
  • Cache frequently asked queries and their results
  • Cache embeddings for common queries
  • Implement query result pagination for very large result sets

Scalability

The RAG architecture scales well:
  • Horizontal scaling: Add more queue workers to process documents in parallel
  • Vector storage: MongoDB Atlas vector search handles billions of vectors
  • API limits: OpenRouter provides high rate limits and can be swapped for self-hosted models
  • User isolation: All data is scoped by user ID, enabling multi-tenancy
Monitor your OpenRouter API usage and costs, especially if you have many users. Consider implementing rate limiting on the /api/chat endpoint to prevent abuse.

Build docs developers (and LLMs) love