Overview
The Knowledge Base system enables semantic search over custom documents using vector embeddings. Upload your own research papers, documentation, or datasets to create a private knowledge base that agents can query during literature search.Use Cases
- Search internal research documentation
- Query proprietary datasets and protocols
- Retrieve information from uploaded papers
- Cross-reference findings with custom knowledge
Architecture
Components
- Document Processor - Extracts text from PDF, DOCX, Markdown
- Text Chunker - Splits documents into searchable chunks
- Embedding Provider - Generates vector embeddings (Voyage AI, OpenAI, Cohere)
- pgvector - PostgreSQL extension for vector similarity search
- Cohere Reranker - Two-stage retrieval for precision
Configuration
Environment Variables
.env
Embedding Providers
- Voyage AI (Recommended)
- OpenAI
- Cohere
- Highest quality embeddings for scientific text
- Superior retrieval performance
- Optimized for long documents
Database Setup
Enable pgvector extension in PostgreSQL:src/embeddings/setup.sql
Document Processing
Supported File Types
| Format | Extensions | Notes |
|---|---|---|
| Markdown | .md | Front-matter support |
.pdf | Text extraction only | |
| Word | .docx | Requires mammoth |
Document Processor
src/embeddings/documentProcessor.ts
Vector Search Pipeline
Two-Stage Retrieval
The system uses a two-stage approach for optimal precision:- Vector Search - Fast approximate nearest neighbor search (20 results)
- Reranking - Precise relevance scoring using Cohere (top 5)
src/embeddings/vectorSearch.ts
Vector Search (Stage 1)
src/embeddings/vectorSearch.ts
Reranking (Stage 2)
src/embeddings/vectorSearch.ts
Integration with Literature Agent
Knowledge Base is automatically queried during literature searches:src/agents/literature/knowledge.ts
src/routes/chat.ts
Adding Documents
Via File System
Via API
Search API
Direct Search
Get Statistics
Performance Tuning
Vector Index Configuration
- lists = 100: Good for < 10,000 documents
- lists = 200: Good for 10,000 - 100,000 documents
- lists = 500: Good for > 100,000 documents
Caching
Search results are cached for 5 minutes:src/embeddings/vectorSearch.ts
Batch Processing
Best Practices
Choosing Embedding Models
Choosing Embedding Models
- Voyage AI: Best for scientific/technical content
- OpenAI: Good general-purpose option
- Cohere: Cost-effective, strong multilingual
EMBEDDING_DIMENSIONS to your model!Document Chunking
Document Chunking
- Keep chunks < 1000 tokens for optimal embedding quality
- Preserve context: don’t split mid-sentence
- Use overlapping chunks for continuity
Similarity Thresholds
Similarity Thresholds
SIMILARITY_THRESHOLD=0.5: Good default (cosine similarity)RERANKER_SCORE_THRESHOLD=0.3: Filters low-quality reranked results- Adjust based on precision/recall needs
Reranking Strategy
Reranking Strategy
- Always use reranking for user-facing queries (better precision)
- Disable reranking for high-throughput batch processing
- Increase
vectorLimitif top results are poor
Troubleshooting
No results returned
No results returned
Possible causes:
- Similarity threshold too high (lower
SIMILARITY_THRESHOLD) - No documents indexed (check
getStats()) - Query too specific (broaden search terms)
Poor search quality
Poor search quality
Possible causes:
- Weak embedding model
- Reranking disabled
- Insufficient vector search candidates
Slow queries
Slow queries
Possible causes:
- Missing vector index
- Index needs tuning
- Cold cache
Related Resources
Chat Mode
Query knowledge base via chat
Deep Research
Use knowledge base in research cycles
File Upload
Upload documents to knowledge base
pgvector Docs
PostgreSQL vector extension