How it works
When you upload files, DeepTutor runs them through a multi-stage pipeline:- Parsing — PDFs are parsed with Docling/MinerU; TXT and Markdown files are read directly.
- Chunking & embedding — Document chunks are embedded and stored in a vector store for semantic search.
- Knowledge graph construction — Entities and relations are extracted and stored in a graph store (
rag_storage/). - Numbered-item extraction — Definitions, theorems, equations, figures, and tables are catalogued in
numbered_items.jsonfor precise lookup by the Query Item tool.
Supported file types
| Format | Extension | Notes |
|---|---|---|
.pdf | Parsed with Docling/MinerU; images extracted | |
| Plain text | .txt | Read directly |
| Markdown | .md | Read directly |
Creating a knowledge base
Upload your documents
Upload one or more PDF, TXT, or Markdown files. You can upload multiple files at once.
Monitor indexing
Watch the terminal for progress. Indexing time scales with document size and your embedding API throughput.
Knowledge base names are used as directory names on disk. Use alphanumeric characters and underscores — avoid spaces and special characters.
RAG retrieval modes
DeepTutor supports two retrieval strategies. You can configure the default per-module inconfig/main.yaml.
- Hybrid (recommended)
- Naive
Combines dense vector search with graph-based entity retrieval. The knowledge graph links related concepts across your documents, making it well suited for questions that span multiple topics or require structured reasoning.
Adding documents incrementally
You can add documents to an existing knowledge base without rebuilding it from scratch. Only the new files are processed; the existing knowledge graph is preserved and merged automatically.Management commands
Data storage
Each knowledge base is stored underdata/knowledge_bases/:
metadata.json file records every create and add-documents operation with timestamps:
Demo knowledge bases
Two pre-built knowledge bases are available for download to help you get started immediately.Research papers
Five papers from HKUDS lab, including AI-Researcher and LightRAG. Good for testing research-oriented queries.
Data science textbook
Eight chapters, 296 pages of deep representation learning content. Good for testing long-form document retrieval.
Download the demo archives
Download from Google Drive.
How knowledge bases are used across modules
| Module | How it uses the knowledge base |
|---|---|
| Smart Solver | RAG retrieval (hybrid or naive) at each solve step; Query Item for numbered definitions and theorems |
| Question Generator | Background knowledge retrieval to ground generated questions in source material |
| Guided Learning | LocateAgent reads notebook records; InteractiveAgent may pull supporting content |
| Deep Research | RAG retrieval across planning and research phases; combined with web and paper search |