Upload Document

Overview

This endpoint handles document upload and automatically processes it through the RAG (Retrieval-Augmented Generation) pipeline. The document is:

Validated and stored
Text extracted based on file type
Split into chunks with configurable size and overlap
Converted to embeddings using OpenAI’s embedding model
Stored as vectors for semantic search

Request

document

file

required

The document file to upload. Supported formats:

PDF (.pdf) - Extracted using text processing utilities
DOCX (.docx) - Microsoft Word documents
TXT (.txt) - Plain text files

Maximum file size is configurable via uploads.max_size config.

Request Details

Content-Type: multipart/form-data File Validation:

File type checked against allowedTypes configuration (api/upload.php:24)
File size validated against maxSize limit (api/upload.php:25)
Duplicate detection using MD5 hash (DocumentService.php:53)

Response

success

boolean

required

Indicates whether the upload and indexing succeeded

document

object

Contains details about the uploaded document

document.id

integer

Database ID of the uploaded document

document.filename

string

Original filename of the uploaded document

document.chunks

integer

Number of text chunks created and indexed for RAG retrieval

error

string

Error message if the upload failed

RAG Indexing Process

After successful upload, the document goes through automatic indexing (api/upload.php:64-69):

$chunksIndexed = $rag->indexDocument(
    $document['id'],
    $document['text'],
    Config::get('rag.chunk_size'),
    Config::get('rag.chunk_overlap')
);

Text Extraction

Text is extracted using TextProcessor::extractText() (DocumentService.php:52):

PDF: Uses text extraction utilities
DOCX: Parses Word document structure
TXT: Direct file read

Chunking Strategy

Chunk Size: Configured via rag.chunk_size (typically 500-1000 tokens)
Chunk Overlap: Configured via rag.chunk_overlap (typically 50-200 tokens)
Overlap ensures context continuity between chunks

Embedding Generation

Each chunk is converted to a vector embedding using OpenAI’s embedding model (api/upload.php:36-40):

$openai = new OpenAIService(
    $oaiCreds['api_key'],
    $oaiCreds['model'],
    $oaiCreds['embedding_model'],
    $logger
);

Vector Storage

Embeddings are stored in the vectors table with:

Document ID reference
Chunk text
Chunk index (position in document)
Vector embedding (for similarity search)

Example

curl -X POST https://your-domain.com/api/upload \
  -F "document=@./company-handbook.pdf"

Success Response

{
  "success": true,
  "document": {
    "id": 42,
    "filename": "company-handbook.pdf",
    "chunks": 127
  }
}

Error Responses

Missing File

{
  "success": false,
  "error": "Error al subir documento"
}

Duplicate Document

{
  "success": false,
  "error": "Documento duplicado: 'company-handbook.pdf' ya fue subido previamente"
}

Invalid File Type

{
  "success": false,
  "error": "Error al subir documento"
}

Implementation Details

Credential Resolution

The endpoint attempts to load OpenAI credentials from database first, then falls back to config (api/upload.php:30-51):

try {
    $encryption = new EncryptionService();
    $credentialService = new CredentialService($db, $encryption);
    if ($credentialService->hasOpenAICredentials()) {
        $oaiCreds = $credentialService->getOpenAICredentials();
        $openai = new OpenAIService(
            $oaiCreds['api_key'],
            $oaiCreds['model'],
            $oaiCreds['embedding_model'],
            $logger
        );
    }
} catch (\Exception $credEx) {
    // Fallback to config file credentials
    $openai = new OpenAIService(
        Config::get('openai.api_key'),
        Config::get('openai.model'),
        Config::get('openai.embedding_model'),
        $logger
    );
}

Vector Search Configuration

Similarity method is configurable (api/upload.php:53):

cosine - Cosine similarity (default)
euclidean - Euclidean distance
dot_product - Dot product similarity

Get Documents - List all uploaded documents
Delete Document - Remove document and its vectors
Get Document Content - Retrieve document text chunks

Webhook

Conversations

Documents

Settings

Flows

System

Overview

Request

Request Details

Response

RAG Indexing Process

Text Extraction

Chunking Strategy

Embedding Generation

Vector Storage

Example

Success Response

Error Responses

Missing File

Duplicate Document

Invalid File Type

Implementation Details

Credential Resolution

Vector Search Configuration

Build docs developers (and LLMs) love

Webhook

Conversations

Documents

Settings

Flows

System

​Overview

​Request

​Request Details

​Response

​RAG Indexing Process

​Text Extraction

​Chunking Strategy

​Embedding Generation

​Vector Storage

​Example

​Success Response

​Error Responses

​Missing File

​Duplicate Document

​Invalid File Type

​Implementation Details

​Credential Resolution

​Vector Search Configuration

​Related Endpoints

Build docs developers (and LLMs) love

Overview

Request

Request Details

Response

RAG Indexing Process

Text Extraction

Chunking Strategy

Embedding Generation

Vector Storage

Example

Success Response

Error Responses

Missing File

Duplicate Document

Invalid File Type

Implementation Details

Credential Resolution

Vector Search Configuration

Related Endpoints