Overview
The Knowledge Base feature enables users to upload documents that are automatically parsed, chunked, and vectorized for Retrieval-Augmented Generation (RAG). The system uses PostgreSQL with pgvector extension for vector storage and supports streaming responses via Server-Sent Events (SSE) for real-time AI interactions.All vectorization operations are processed asynchronously using Redis Streams to handle large documents efficiently.
Supported Document Formats
PDF Documents
Adobe PDF files with text layer support
Word Documents
Microsoft Word (DOCX, DOC)
Text Files
Plain text (TXT) and Markdown (MD)
Max File Size
Up to 50MB per document
Upload and Vectorization Workflow
The knowledge base follows an asynchronous processing pipeline:Upload Document
Users upload a document with optional metadata:Form Parameters:
file: Document file (required)name: Custom name (optional, defaults to filename)category: Classification tag (optional, e.g., “Java”, “System Design”)
Duplicate Detection
The system calculates a SHA-256 hash to prevent duplicate uploads:If a duplicate is detected, the existing knowledge base entry is returned immediately.
Vectorization Task
A task is sent to Redis Stream for async processing:The API returns immediately with status PENDING.
Vectorization Status Flow
Status Definitions
Status Definitions
Document Chunking Strategy
Large documents are split into smaller chunks for effective embedding:Chunking Method
TokenTextSplitter from Spring AISplits text based on token count rather than character count for accurate embedding.
Chunk Metadata
Each chunk stores:
- Original document ID
- Chunk index
- Document metadata (name, category)
- Embedding vector
Chunk count is tracked in
KnowledgeBaseEntity.chunkCount for statistics and debugging.Category Management
Organize knowledge bases with categories:List All Categories
Filter by Category
Update Category
RAG Query Flow
The system uses Retrieval-Augmented Generation to answer questions based on uploaded documents:Query Rewriting (Optional)
If enabled, the question is rewritten for better retrieval:
Why Query Rewriting?
Why Query Rewriting?
User questions are often:
- Too vague (“tell me about Redis”)
- Contain typos or colloquialisms
- Missing key technical terms
- Add relevant technical keywords
- Clarify ambiguous terms
- Optimize for vector similarity search
Dynamic Search Parameters
Search parameters adapt based on query length:
minScore: 0.28
Short Query
≤4 characterstopK: 20
minScore: 0.18
minScore: 0.18
Medium Query
5-12 characterstopK: 12
minScore: 0.28
minScore: 0.28
Long Query
12 characterstopK: 8
minScore: 0.28
Vector Similarity Search
The system performs vector search across selected knowledge bases:Uses pgvector’s cosine similarity:
Effective Hit Validation
For short queries, the system validates that retrieved chunks actually contain the search term:
This prevents the AI from generating vague “information not found” responses when vector similarity produces false positives.
AI Response Generation
The context and question are sent to the AI model:System Prompt: Instructs the AI to answer based only on provided contextUser Prompt: Template with context and question variables
Streaming SSE Responses
For real-time, typewriter-style responses, use the streaming endpoint:SSE Response Format
- Client Implementation
- Stream Probing
Listing Knowledge Bases
Retrieve all uploaded knowledge bases:sortBy: Sort field (uploadedAt,name,questionCount)vectorStatus: Filter by status (PENDING,PROCESSING,COMPLETED,FAILED)
Searching Knowledge Bases
Search by filename or content:- Knowledge base name
- Original filename
- Category tags
Downloading Documents
Retrieve the original uploaded file:Statistics Dashboard
Get aggregated statistics:Manual Re-vectorization
If vectorization fails, users can retry:This endpoint is rate-limited to 2 requests per IP to prevent abuse.
Deleting Knowledge Bases
Remove a knowledge base and all associated vectors:- Deletes the entity from the database
- Removes all vector embeddings from pgvector
- Does not delete the original file from storage (for audit purposes)
Rate Limiting
Protection mechanisms:Upload
3 uploads per window
Query
10 queries per window
Streaming
5 streams per window
Error Handling
Vectorization Failed
Vectorization Failed
Status:
FAILEDCommon Causes:- Document too large for embedding model
- Invalid UTF-8 encoding
- AI API rate limit or timeout
- Database connection failure
vectorError field for details and use manual re-vectorization.No Results Found
No Results Found
Response: Standard “no information found” messageCauses:
- Question topic not covered in uploaded documents
- Query rewriting produced poor keywords
- Vector similarity threshold too strict
- Rephrase the question with more specific terms
- Adjust
minScoreparameters (requires config change) - Upload more relevant documents
File Parse Failed
File Parse Failed
Error:
无法从文件中提取文本内容Causes:- Scanned PDF without OCR
- Corrupted or encrypted file
- Unsupported document structure
Best Practices
Optimize Document Structure
Use clear headings and sections. Well-structured documents chunk better and retrieve more accurately.
Use Descriptive Names
Name knowledge bases descriptively (e.g., “Spring Boot 3.0 Official Guide” vs. “doc.pdf”).
Organize with Categories
Assign categories consistently to enable filtered searches and multi-KB queries.
Monitor Chunk Count
If
chunkCount is very low (< 5), the document may be too short or poorly parsed.Poll Vectorization Status
Implement polling (every 3-5 seconds) after upload:
Handle Streaming Errors
Always implement
onerror handlers for SSE connections and provide fallback UI.