Overview
Query one or multiple knowledge bases using RAG (Retrieval-Augmented Generation). The system performs semantic vector search to find relevant document chunks, then uses an LLM to generate a contextual answer based on the retrieved information.Both endpoints require knowledge bases to have
vectorStatus: COMPLETED. Ensure vectorization is finished before querying.RAG Query Workflow
- Semantic Search - Your question is embedded and compared against stored document vectors
- Relevance Ranking - Most relevant chunks are retrieved based on cosine similarity
- Context Assembly - Retrieved chunks are formatted as context for the LLM
- Answer Generation - LLM generates a natural language answer grounded in the retrieved content
- Response Delivery - Answer is returned either as complete JSON or streamed via SSE
Standard Query (Non-Streaming)
Endpoint:POST /api/knowledgebase/query
Rate Limit: 10 requests per time window (Global + IP-based)
Returns the complete answer in a single JSON response after processing is finished.
Request
Array of knowledge base IDs to query. Supports querying multiple knowledge bases simultaneously for broader context.Type:
integer[]Example: [1, 2, 5]Validation: At least one ID is requiredThe question to answer based on the knowledge base content.Example:
"What are the system requirements for deployment?"Validation: Cannot be blankResponse
Response status code.
200 indicates success.Response message.
"success" on successful query.Query result containing the generated answer.
Examples
- cURL
- JavaScript (Fetch)
- Python
Response Example
Streaming Query (SSE)
Endpoint:POST /api/knowledgebase/query/stream
Rate Limit: 5 requests per time window (Global + IP-based)
Content-Type: text/event-stream
Returns the answer incrementally as Server-Sent Events (SSE), enabling real-time streaming of the LLM response for better user experience.
Request
Request body is identical to the standard query:Array of knowledge base IDs to query.
The question to answer.
Response
The response is a stream of Server-Sent Events. Each event contains a chunk of the generated answer. Event Format:SSE Format Notes:
- Newlines in the answer are escaped as
\nto maintain SSE protocol compatibility - Carriage returns are escaped as
\r - Each event is followed by two newlines
- The stream ends when answer generation is complete
Examples
- cURL
- JavaScript (EventSource)
- Python (requests)
- Python (sseclient)
Stream Response Example
Error Responses
Validation Errors
Knowledge Base Not Found
Vectorization Not Complete
Rate Limit Exceeded
Multi-Knowledge Base Querying
Both endpoints support querying multiple knowledge bases simultaneously:- Search across multiple documents in a single query
- Combine information from different sources
- Improve answer quality with broader context
- All specified knowledge bases must have
vectorStatus: COMPLETED - Retrieval considers chunks from all knowledge bases
- Response may synthesize information from multiple sources
Best Practices
When to Use Streaming
Use streaming (/query/stream) when:
- Building real-time chat interfaces
- Answers are expected to be long
- User experience benefits from incremental display
- You need to show progress during generation
/query) when:
- Building APIs or batch processing
- You need the complete answer before proceeding
- Simpler client-side implementation is preferred
- Logging or storing complete responses
Question Quality
For best results:- Be specific and clear in your questions
- Include relevant context or keywords
- Ask one question at a time
- Refer to concepts likely present in the documents
See Also
- Upload Knowledge Base - Upload and vectorize documents
- List Knowledge Bases - View available knowledge bases
- RAG Chat Sessions - Persistent conversation context over knowledge bases
