Overview
Theprepare_kb.py script is the first step in building the knowledge base. It processes raw JSON files containing question-answer pairs, assigns topics and subtopics through keyword matching, determines difficulty levels, and outputs clean, structured data ready for indexing.
Script Location
What It Does
The preparation script performs these key operations:- Loads raw data from
data/raw/*.jsonfiles - Normalizes text by removing HTML, standardizing whitespace, and handling Unicode
- Assigns topics (DBMS, OOPs, OS) based on filename patterns
- Assigns subtopics using keyword matching rules from
config/topic_rules.json - Determines difficulty (Beginner/Intermediate/Advanced) through heuristic analysis
- Deduplicates questions based on normalized text
- Outputs clean JSON and JSONL files for indexing
Input Requirements
Raw Data Format
Place JSON files insource/data/raw/. Each file should contain an array of objects:
Current Raw Files
database_qna.json- DBMS questionsoops_qna_simplified.json- Object-oriented programming questionsos_qna.json- Operating systems questions
Topic Assignment Logic
Topics are determined from the source filename:database_qna.json will have all its questions assigned to the “DBMS” topic.
See source/scripts/prepare_kb.py:67-75
Subtopic Assignment
Subtopics are assigned through keyword matching with configurable rules fromconfig/topic_rules.json.
How It Works
source/scripts/prepare_kb.py:104-134
Keyword Matching Process
- Combines question and answer text
- Tokenizes into words (alphanumeric + some special chars)
- For each rule matching the topic, counts keyword matches
- Calculates coverage score (matches / total keywords)
- Selects subtopic with highest coverage
- Requires minimum 25% coverage threshold
- Falls back to default subtopic if threshold not met
Example
For a question about “B+ tree indexing in databases”:- Keywords matched:
["index", "indexing", "b+ tree", "b tree"] - Rule matched: DBMS → Indexing
- Coverage: 4/8 = 0.5 (50%)
- Assigned subtopic: Indexing
Subtopic Refinement
Special disambiguation logic handles overlapping concepts:source/scripts/prepare_kb.py:137-152
This ensures “deadlock” questions go to the right topic-specific subtopic, and “memory” questions are properly categorized.
Difficulty Determination
Difficulty levels are assigned using keyword heuristics:source/scripts/prepare_kb.py:155-173
Difficulty Criteria
- Advanced: Contains 2+ advanced technical terms (MVCC, CAP theorem, B+ tree, etc.)
- Beginner: Contains 2+ beginner indicators (“what is”, “define”, “basic”, etc.)
- Intermediate: Default for everything else
Text Normalization
All text goes through normalization to ensure consistency:source/scripts/prepare_kb.py:26-33
HTML Handling
source/scripts/prepare_kb.py:21-24
Output Files
The script generates two output files insource/data/processed/:
1. kb_clean.json
Structured JSON array with all metadata:2. kb_chunks.jsonl
Newline-delimited JSON for embedding generation:source/scripts/prepare_kb.py:217-241
Running the Script
Prerequisites
Execution
Expected Output
Deduplication
The script prevents duplicate questions:source/scripts/prepare_kb.py:181-195
Duplicates are identified by normalized question+answer pairs (case-insensitive).
Code Walkthrough Example
Let’s trace how a single question is processed:Input
Processing Steps
- Load: Read from
database_qna.json - Topic: Filename contains “database” →
topic = "DBMS" - Normalize question:
"What is normalization in DBMS?"(HTML removed) - Normalize answer:
"Normalization is organizing data to reduce redundancy using normal forms like 1NF, 2NF, 3NF." - Tokenize:
{"what", "is", "normalization", "in", "dbms", "organizing", "data", "reduce", "redundancy", "normal", "forms", "1nf", "2nf", "3nf"} - Match keywords: Rule for “Normalization” has keywords:
["normalization", "normal form", "1nf", "2nf", "3nf", ...] - Calculate coverage: 5 matches / 10 keywords = 0.5 (50%)
- Assign subtopic:
subtopic = "Normalization" - Check difficulty: Contains “normal” + “1nf”, “2nf”, “3nf” but not enough advanced terms →
difficulty = "Intermediate" - Generate ID: Keep existing
"42"
Output
Troubleshooting
Issue: Questions assigned to wrong subtopic
Solution: Update keyword rules inconfig/topic_rules.json to include more specific keywords for that subtopic.
Issue: Too many “uncategorized” topics
Solution: Ensure raw JSON filenames contain recognizable keywords (“database”, “dbms”, “oops”, “os”).Issue: All questions marked as “Intermediate”
Solution: Adjust the difficulty heuristics in thedifficulty_heuristic() function or add more domain-specific keywords.
Issue: Duplicate questions in output
Solution: This shouldn’t happen due to deduplication. If it does, check if questions have different whitespace/formatting that bypasses normalization.Next Steps
After runningprepare_kb.py, proceed to:
- FAISS Indexing - Build vector search index
- Query the system - Use the RAG query interface
Related Configuration
- Adding Topics - Extend to new knowledge domains
config/taxonomy.json- Topic hierarchy structureconfig/topic_rules.json- Keyword matching rules