Knowledge Base - InterviewGuide

Overview

The Knowledge Base feature enables users to upload documents that are automatically parsed, chunked, and vectorized for Retrieval-Augmented Generation (RAG). The system uses PostgreSQL with pgvector extension for vector storage and supports streaming responses via Server-Sent Events (SSE) for real-time AI interactions.

All vectorization operations are processed asynchronously using Redis Streams to handle large documents efficiently.

Supported Document Formats

PDF Documents

Adobe PDF files with text layer support

Word Documents

Microsoft Word (DOCX, DOC)

Text Files

Plain text (TXT) and Markdown (MD)

Max File Size

Up to 50MB per document

Upload and Vectorization Workflow

The knowledge base follows an asynchronous processing pipeline:

Upload Document

Users upload a document with optional metadata:

POST /api/knowledgebase/upload
Content-Type: multipart/form-data

Form Parameters:

file: Document file (required)
name: Custom name (optional, defaults to filename)
category: Classification tag (optional, e.g., “Java”, “System Design”)

Validation:

// KnowledgeBaseUploadService.java:38
private static final long MAX_FILE_SIZE = 50 * 1024 * 1024; // 50MB

Duplicate Detection

The system calculates a SHA-256 hash to prevent duplicate uploads:

// KnowledgeBaseEntity.java:21-23
@Column(nullable = false, unique = true, length = 64)
private String fileHash;

If a duplicate is detected, the existing knowledge base entry is returned immediately.

Content Extraction

Apache Tika parses the document and extracts text:

// KnowledgeBaseUploadService.java:68-71
String content = parseService.parseContent(file);
if (content == null || content.trim().isEmpty()) {
    throw new BusinessException(ErrorCode.INTERNAL_ERROR, "无法从文件中提取文本内容");
}

File Storage

Original file is uploaded to S3-compatible storage:

// KnowledgeBaseUploadService.java:74-76
String fileKey = storageService.uploadKnowledgeBase(file);
String fileUrl = storageService.getFileUrl(fileKey);

Metadata Persistence

Knowledge base entity is saved with status PENDING:

// KnowledgeBaseEntity.java:64-67
@Enumerated(EnumType.STRING)
private VectorStatus vectorStatus = VectorStatus.PENDING;

Vectorization Task

A task is sent to Redis Stream for async processing:

// KnowledgeBaseUploadService.java:82
vectorizeStreamProducer.sendVectorizeTask(savedKb.getId(), content);

The API returns immediately with status PENDING.

Background Vectorization

A consumer worker:

Updates status to PROCESSING
Chunks the document using TokenTextSplitter
Generates embeddings for each chunk
Stores vectors in PostgreSQL (pgvector)
Updates status to COMPLETED or FAILED

Vectorization Status Flow

Status Definitions

public enum VectorStatus {
    PENDING,      // Document uploaded, awaiting vectorization
    PROCESSING,   // Chunking and embedding in progress
    COMPLETED,    // Successfully vectorized and stored
    FAILED        // Vectorization failed (check vectorError field)
}

Document Chunking Strategy

Large documents are split into smaller chunks for effective embedding:

Chunking Method

TokenTextSplitter from Spring AISplits text based on token count rather than character count for accurate embedding.

Chunk Metadata

Each chunk stores:

Original document ID
Chunk index
Document metadata (name, category)
Embedding vector

Chunk count is tracked in KnowledgeBaseEntity.chunkCount for statistics and debugging.

Category Management

Organize knowledge bases with categories:

List All Categories

GET /api/knowledgebase/categories

Response:

["Java", "System Design", "Databases", "Algorithms"]

Filter by Category

GET /api/knowledgebase/category/{category}

Update Category

PUT /api/knowledgebase/{id}/category

{
  "category": "Spring Framework"
}

RAG Query Flow

The system uses Retrieval-Augmented Generation to answer questions based on uploaded documents:

Query Submission

Users submit a question with selected knowledge base IDs:

POST /api/knowledgebase/query

{
  "knowledgeBaseIds": [1, 2, 3],
  "question": "How does Redis handle persistence?"
}

Knowledge Base Validation

System validates that all IDs exist and increments question counters:

// KnowledgeBaseQueryService.java:111
countService.updateQuestionCounts(knowledgeBaseIds);

Query Rewriting (Optional)

If enabled, the question is rewritten for better retrieval:

# application.yml
app:
  ai:
    rag:
      rewrite:
        enabled: true

Why Query Rewriting?

User questions are often:

Too vague (“tell me about Redis”)
Contain typos or colloquialisms
Missing key technical terms

The AI rewrites the query to:

Add relevant technical keywords
Clarify ambiguous terms
Optimize for vector similarity search

Example:

Original:  "How to make Redis not lose data?"
Rewritten: "Redis persistence mechanisms: RDB snapshots and AOF append-only file"

// KnowledgeBaseQueryService.java:285-306
private String rewriteQuestion(String question) {
    String rewritePrompt = rewritePromptTemplate.render(variables);
    String rewritten = chatClient.prompt()
        .user(rewritePrompt)
        .call()
        .content();
    return normalized;
}

Dynamic Search Parameters

Search parameters adapt based on query length:

// KnowledgeBaseQueryService.java:274-283
private SearchParams resolveSearchParams(String question) {
    int compactLength = question.replaceAll("\\s+", "").length();
    if (compactLength <= shortQueryLength) {
        return new SearchParams(topkShort, minScoreShort);
    }
    if (compactLength <= 12) {
        return new SearchParams(topkMedium, minScoreDefault);
    }
    return new SearchParams(topkLong, minScoreDefault);
}

Short Query

≤4 characterstopK: 20
minScore: 0.18

Medium Query

5-12 characterstopK: 12
minScore: 0.28

Long Query

12 characters

topK: 8
minScore: 0.28

Vector Similarity Search

The system performs vector search across selected knowledge bases:

// KnowledgeBaseQueryService.java:260-271
List<Document> docs = vectorService.similaritySearch(
    candidateQuery,
    knowledgeBaseIds,
    queryContext.searchParams().topK(),
    queryContext.searchParams().minScore()
);

Uses pgvector’s cosine similarity:

SELECT * FROM vector_store
WHERE metadata->>'kb_id' IN (1, 2, 3)
ORDER BY embedding <=> query_embedding
LIMIT topK;

Effective Hit Validation

For short queries, the system validates that retrieved chunks actually contain the search term:

// KnowledgeBaseQueryService.java:313-333
private boolean hasEffectiveHit(String question, List<Document> docs) {
    if (!isShortTokenQuery(normalized)) {
        return true;
    }
    
    String loweredToken = normalized.toLowerCase();
    for (Document doc : docs) {
        if (text != null && text.toLowerCase().contains(loweredToken)) {
            return true;
        }
    }
    return false;
}

This prevents the AI from generating vague “information not found” responses when vector similarity produces false positives.

Context Construction

Retrieved document chunks are merged:

// KnowledgeBaseQueryService.java:122-124
String context = relevantDocs.stream()
    .map(Document::getText)
    .collect(Collectors.joining("\n\n---\n\n"));

AI Response Generation

The context and question are sent to the AI model:

// KnowledgeBaseQueryService.java:134-138
String answer = chatClient.prompt()
    .system(systemPrompt)
    .user(userPrompt)
    .call()
    .content();

System Prompt: Instructs the AI to answer based only on provided contextUser Prompt: Template with context and question variables

Response Normalization

The answer is validated and normalized:

// KnowledgeBaseQueryService.java:343-352
private String normalizeAnswer(String answer) {
    if (answer == null || answer.isBlank()) {
        return NO_RESULT_RESPONSE;
    }
    if (isNoResultLike(normalized)) {
        return NO_RESULT_RESPONSE;
    }
    return normalized;
}

If the AI indicates “no information found,” a standardized message is returned:

"抱歉，在选定的知识库中未检索到相关信息。请换一个更具体的关键词或补充上下文后再试。"

Streaming SSE Responses

For real-time, typewriter-style responses, use the streaming endpoint:

POST /api/knowledgebase/query/stream
Content-Type: application/json
Accept: text/event-stream

{
  "knowledgeBaseIds": [1, 2],
  "question": "Explain Redis persistence"
}

SSE Response Format

data: Redis

data:  provides

data:  two

data:  main

data:  persistence

data:  mechanisms

data: ...

data: [DONE]

Client Implementation
Stream Probing

const eventSource = new EventSource(
  '/api/knowledgebase/query/stream',
  {
    method: 'POST',
    body: JSON.stringify({
      knowledgeBaseIds: [1, 2],
      question: 'How does Redis work?'
    })
  }
);

eventSource.onmessage = (event) => {
  if (event.data === '[DONE]') {
    eventSource.close();
  } else {
    appendToChat(event.data);
  }
};

eventSource.onerror = (error) => {
  console.error('SSE error:', error);
  eventSource.close();
};

The system uses a probe window to detect “no result” patterns early:

// KnowledgeBaseQueryService.java:367-420
private Flux<String> normalizeStreamOutput(Flux<String> rawFlux) {
    return Flux.create(sink -> {
        StringBuilder probeBuffer = new StringBuilder();
        AtomicBoolean passthrough = new AtomicBoolean(false);
        
        rawFlux.subscribe(
            chunk -> {
                probeBuffer.append(chunk);
                String probeText = probeBuffer.toString();
                
                // Early detection of "no result" patterns
                if (isNoResultLike(probeText)) {
                    sink.next(NO_RESULT_RESPONSE);
                    sink.complete();
                    return;
                }
                
                // After probe window, stream directly
                if (probeBuffer.length() >= STREAM_PROBE_CHARS) {
                    passthrough.set(true);
                    sink.next(probeText);
                }
            }
        );
    });
}

Probe Window: 120 characters

This prevents streaming long “information not found” explanations. The system detects these patterns in the first ~120 characters and returns the standard message instead.

Listing Knowledge Bases

Retrieve all uploaded knowledge bases:

GET /api/knowledgebase/list?sortBy=uploadedAt&vectorStatus=COMPLETED

Query Parameters:

sortBy: Sort field (uploadedAt, name, questionCount)
vectorStatus: Filter by status (PENDING, PROCESSING, COMPLETED, FAILED)

Response:

[
  {
    "id": 1,
    "name": "Redis in Action",
    "category": "Databases",
    "originalFilename": "redis_guide.pdf",
    "fileSize": 2048576,
    "uploadedAt": "2026-03-10T10:00:00",
    "vectorStatus": "COMPLETED",
    "chunkCount": 42,
    "questionCount": 15,
    "accessCount": 30
  }
]

Searching Knowledge Bases

Search by filename or content:

GET /api/knowledgebase/search?keyword=redis

Searches across:

Knowledge base name
Original filename
Category tags

Downloading Documents

Retrieve the original uploaded file:

GET /api/knowledgebase/{id}/download

Response Headers:

Content-Disposition: attachment; filename="redis_guide.pdf"
Content-Type: application/pdf

Statistics Dashboard

Get aggregated statistics:

GET /api/knowledgebase/stats

Response:

{
  "totalKnowledgeBases": 25,
  "totalQuestions": 342,
  "totalChunks": 1580,
  "statusBreakdown": {
    "COMPLETED": 22,
    "PENDING": 2,
    "FAILED": 1
  },
  "topCategories": [
    {"category": "Java", "count": 8},
    {"category": "System Design", "count": 6}
  ]
}

Manual Re-vectorization

If vectorization fails, users can retry:

POST /api/knowledgebase/{id}/revectorize

Reset Status

Update vectorStatus to PENDING and clear error message

Re-download File

Fetch original file from storage and re-parse

Re-queue Task

Send new vectorization task to Redis Stream

This endpoint is rate-limited to 2 requests per IP to prevent abuse.

Deleting Knowledge Bases

Remove a knowledge base and all associated vectors:

DELETE /api/knowledgebase/{id}

This operation:

Deletes the entity from the database
Removes all vector embeddings from pgvector
Does not delete the original file from storage (for audit purposes)

Rate Limiting

Protection mechanisms:

// KnowledgeBaseController.java
@RateLimit(dimensions = {RateLimit.Dimension.GLOBAL, RateLimit.Dimension.IP}, count = 3)
public Result<Map<String, Object>> uploadKnowledgeBase(...) { ... }

@RateLimit(dimensions = {RateLimit.Dimension.GLOBAL, RateLimit.Dimension.IP}, count = 10)
public Result<QueryResponse> queryKnowledgeBase(...) { ... }

@RateLimit(dimensions = {RateLimit.Dimension.GLOBAL, RateLimit.Dimension.IP}, count = 5)
public Flux<String> queryKnowledgeBaseStream(...) { ... }

Upload

3 uploads per window

Query

10 queries per window

Streaming

5 streams per window

Error Handling

Vectorization Failed

Status: FAILEDCommon Causes:

Document too large for embedding model
Invalid UTF-8 encoding
AI API rate limit or timeout
Database connection failure

Solution: Check vectorError field for details and use manual re-vectorization.

No Results Found

Response: Standard “no information found” messageCauses:

Question topic not covered in uploaded documents
Query rewriting produced poor keywords
Vector similarity threshold too strict

Solution:

Rephrase the question with more specific terms
Adjust minScore parameters (requires config change)
Upload more relevant documents

File Parse Failed

Error: 无法从文件中提取文本内容Causes:

Scanned PDF without OCR
Corrupted or encrypted file
Unsupported document structure

Solution: Convert to text-based format or perform OCR preprocessing.

Best Practices

Optimize Document Structure

Use clear headings and sections. Well-structured documents chunk better and retrieve more accurately.

Use Descriptive Names

Name knowledge bases descriptively (e.g., “Spring Boot 3.0 Official Guide” vs. “doc.pdf”).

Organize with Categories

Assign categories consistently to enable filtered searches and multi-KB queries.

Monitor Chunk Count

If chunkCount is very low (< 5), the document may be too short or poorly parsed.

Poll Vectorization Status

Implement polling (every 3-5 seconds) after upload:

while (kb.vectorStatus !== 'COMPLETED') {
  await sleep(3000);
  kb = await getKnowledgeBase(id);
}

Handle Streaming Errors

Always implement onerror handlers for SSE connections and provide fallback UI.

Architecture Diagram

Configuration Reference

# application.yml
spring:
  ai:
    vectorstore:
      pgvector:
        initialize-schema: true  # Auto-create vector_store table
        index-type: HNSW         # Vector index type
        distance-type: COSINE_DISTANCE
        dimensions: 1536         # Embedding dimension

app:
  ai:
    rag:
      rewrite:
        enabled: true
      search:
        short-query-length: 4
        topk-short: 20
        topk-medium: 12
        topk-long: 8
        min-score-short: 0.18
        min-score-default: 0.28

For complete API reference, see:

Getting Started

Core Features

Architecture

Configuration

Deployment

​Overview

​Supported Document Formats

PDF Documents

Word Documents

Text Files

Max File Size

​Upload and Vectorization Workflow

​Vectorization Status Flow

​Document Chunking Strategy

Chunking Method

Chunk Metadata

​Category Management

​List All Categories

​Filter by Category

​Update Category

​RAG Query Flow

Short Query

Medium Query

Long Query

​Streaming SSE Responses

​SSE Response Format

​Listing Knowledge Bases

​Searching Knowledge Bases

​Downloading Documents

​Statistics Dashboard

​Manual Re-vectorization

​Deleting Knowledge Bases

​Rate Limiting

Upload

Query

Streaming

​Error Handling

​Best Practices

Optimize Document Structure

Use Descriptive Names

Organize with Categories

Monitor Chunk Count

Poll Vectorization Status

Handle Streaming Errors

​Architecture Diagram

​Configuration Reference

​Related API Endpoints

Build docs developers (and LLMs) love

Overview

Supported Document Formats

Upload and Vectorization Workflow

Vectorization Status Flow

Document Chunking Strategy

Category Management

List All Categories

Filter by Category

Update Category

RAG Query Flow

Streaming SSE Responses

SSE Response Format

Listing Knowledge Bases

Searching Knowledge Bases

Downloading Documents

Statistics Dashboard

Manual Re-vectorization

Deleting Knowledge Bases

Rate Limiting

Error Handling

Best Practices

Architecture Diagram

Configuration Reference

Related API Endpoints