Overview
The DocumentService manages the knowledge base by handling document uploads, extracting text content, preventing duplicates, and maintaining document metadata. It supports multiple file formats and integrates with the RAG system.
Class Structure
Namespace: App\Services
Dependencies:
Database - Database connection for document metadata
TextProcessor - Utility for extracting text from various file formats
Constructor
public function __construct(
Database $db,
string $uploadPath,
array $allowedTypes,
int $maxSize
)
Database instance for document persistence
Directory path for storing uploaded files
Array of allowed file extensions (e.g., ['pdf', 'txt', 'docx'])
Maximum file size in bytes
Behavior:
- Automatically creates upload directory if it doesn’t exist
- Sets up file validation rules
$documentService = new DocumentService(
$db,
__DIR__ . '/../uploads/documents',
['pdf', 'txt', 'docx', 'doc', 'md'],
10 * 1024 * 1024 // 10 MB
);
Public Methods
uploadDocument
Uploads and processes a document file, extracting its text content.
public function uploadDocument(array $file): array
PHP $_FILES array entry containing:
name - Original filename
tmp_name - Temporary file path
size - File size in bytes
error - Upload error code
Returns: Array with document metadata:
[
'id' => 42,
'filename' => 'abc123_1234567890.pdf',
'original_name' => 'manual.pdf',
'file_type' => 'pdf',
'file_size' => 524288,
'text' => 'Extracted text content...'
]
Throws:
\RuntimeException if upload fails, file type not allowed, size exceeded, or duplicate detected
$documentService = new DocumentService(
$db,
__DIR__ . '/uploads/documents',
['pdf', 'txt', 'docx', 'doc', 'md'],
10 * 1024 * 1024
);
try {
$result = $documentService->uploadDocument($_FILES['document']);
echo "Document uploaded successfully!\n";
echo "ID: {$result['id']}\n";
echo "Original name: {$result['original_name']}\n";
echo "Text extracted: " . strlen($result['text']) . " chars\n";
// Now chunk and embed the document
$chunks = $vectorService->chunkText($result['text'], 500);
foreach ($chunks as $chunk) {
$vectorService->addChunk($result['id'], $chunk);
}
$documentService->updateChunkCount($result['id'], count($chunks));
} catch (\RuntimeException $e) {
echo "Upload failed: " . $e->getMessage();
}
Features:
- Duplicate Detection: Uses MD5 file hash to prevent duplicate uploads
- Text Extraction: Automatically extracts text using
TextProcessor
- Atomic Operations: Cleans up file on error
- Unique Filenames: Generates unique names using
uniqid() + timestamp
The service stores both the original filename (for display) and a unique system filename (for storage).
getDocument
Retrieves a document’s metadata by ID.
public function getDocument(int $id): ?array
Returns: Document record or null if not found
$document = $documentService->getDocument(42);
if ($document) {
echo "Name: {$document['original_name']}\n";
echo "Type: {$document['file_type']}\n";
echo "Size: {$document['file_size']} bytes\n";
echo "Chunks: {$document['chunk_count']}\n";
echo "Uploaded: {$document['created_at']}\n";
}
getAllDocuments
Retrieves all documents, ordered by creation date (newest first).
public function getAllDocuments(int $limit = 100): array
Maximum number of documents to retrieve
Returns: Array of document records
$documents = $documentService->getAllDocuments(50);
foreach ($documents as $doc) {
echo "<tr>";
echo "<td>{$doc['id']}</td>";
echo "<td>{$doc['original_name']}</td>";
echo "<td>{$doc['file_type']}</td>";
echo "<td>" . number_format($doc['file_size'] / 1024, 2) . " KB</td>";
echo "<td>{$doc['chunk_count']}</td>";
echo "<td>{$doc['created_at']}</td>";
echo "</tr>";
}
deleteDocument
Deletes a document and its physical file.
public function deleteDocument(int $id): bool
Returns: Boolean indicating success
Throws: \RuntimeException if document not found
Behavior:
- Removes physical file from disk
- Deletes database record
- Handles missing files gracefully
Cascade deletion required: This method only deletes the document record. You must manually delete associated chunks from the vector database.
$documentId = $_POST['document_id'];
try {
// 1. Delete vector embeddings first
$vectorService->deleteDocumentChunks($documentId);
// 2. Delete document and file
$documentService->deleteDocument($documentId);
echo json_encode(['success' => true]);
} catch (\RuntimeException $e) {
echo json_encode([
'success' => false,
'error' => $e->getMessage()
]);
}
updateChunkCount
Updates the number of chunks created from this document.
public function updateChunkCount(int $documentId, int $count): bool
Number of text chunks created
Returns: Boolean indicating success
// After chunking and embedding a document
$chunks = $vectorService->chunkText($document['text'], 500);
foreach ($chunks as $chunk) {
$vectorService->addChunk($documentId, $chunk);
}
// Update the document's chunk count
$documentService->updateChunkCount($documentId, count($chunks));
getDocumentStats
Retrieves aggregate statistics about all documents.
public function getDocumentStats(): array
Returns: Array with statistics:
$stats = $documentService->getDocumentStats();
// Returns:
// [
// 'total' => 47,
// 'total_size' => 15728640, // bytes
// 'by_type' => [
// 'pdf' => 23,
// 'txt' => 12,
// 'docx' => 8,
// 'md' => 4
// ]
// ]
echo "Total documents: {$stats['total']}\n";
echo "Total size: " . number_format($stats['total_size'] / 1024 / 1024, 2) . " MB\n";
echo "PDFs: {$stats['by_type']['pdf']}\n";
Database Schema
documents Table
| Column | Type | Description |
|---|
id | INT | Primary key |
filename | VARCHAR | System filename (unique) |
original_name | VARCHAR | User’s original filename |
file_type | VARCHAR | File extension |
content_text | LONGTEXT | Extracted text content |
file_size | INT | File size in bytes |
file_hash | VARCHAR(32) | MD5 hash for duplicate detection |
chunk_count | INT | Number of vector chunks |
created_at | TIMESTAMP | Upload time |
Supported File Types
The TextProcessor utility supports:
- PDF (
.pdf) - Extracted using pdftotext or similar
- Text (
.txt) - Raw text files
- Markdown (
.md) - Markdown files
- Word (
.docx, .doc) - Microsoft Word documents
- Rich Text (
.rtf) - Rich Text Format
File type support depends on your TextProcessor implementation. Check src/Utils/TextProcessor.php for available extractors.
Complete Upload Workflow
if ($_SERVER['REQUEST_METHOD'] === 'POST' && isset($_FILES['document'])) {
$documentService = new DocumentService(
$db,
__DIR__ . '/uploads/documents',
['pdf', 'txt', 'docx', 'doc', 'md'],
10 * 1024 * 1024
);
try {
// 1. Upload and extract text
$document = $documentService->uploadDocument($_FILES['document']);
// 2. Chunk the text
$vectorService = new VectorSearchService($db, $openai);
$chunks = $vectorService->chunkText($document['text'], 500);
// 3. Generate and store embeddings
foreach ($chunks as $index => $chunk) {
$vectorService->addChunk(
$document['id'],
$chunk,
$index
);
}
// 4. Update chunk count
$documentService->updateChunkCount($document['id'], count($chunks));
echo json_encode([
'success' => true,
'document_id' => $document['id'],
'chunks_created' => count($chunks)
]);
} catch (\RuntimeException $e) {
http_response_code(400);
echo json_encode([
'success' => false,
'error' => $e->getMessage()
]);
}
}
Error Handling
Common Exceptions
try {
$result = $documentService->uploadDocument($_FILES['document']);
} catch (\RuntimeException $e) {
$message = $e->getMessage();
if (strpos($message, 'File size exceeds') !== false) {
// File too large
echo "Please upload a smaller file (max 10MB)";
}
elseif (strpos($message, 'File type not allowed') !== false) {
// Invalid file type
echo "Please upload PDF, DOCX, or TXT files only";
}
elseif (strpos($message, 'Documento duplicado') !== false) {
// Duplicate file
echo "This document has already been uploaded";
}
else {
// Other error
echo "Upload failed: {$message}";
}
}
Best Practices
Chunk immediately after upload to make the document searchable. Store the chunk count for monitoring.
Delete cascade: When deleting a document, always delete its vector embeddings first to prevent orphaned data.
Duplicate detection uses file content hash, not filename. The same file with a different name will be rejected.
Source Code
Location: src/Services/DocumentService.php:1-147