Knowledge Base Preparation

Overview

The prepare_kb.py script is the first step in building the knowledge base. It processes raw JSON files containing question-answer pairs, assigns topics and subtopics through keyword matching, determines difficulty levels, and outputs clean, structured data ready for indexing.

Script Location

source/scripts/prepare_kb.py

What It Does

The preparation script performs these key operations:

Loads raw data from data/raw/*.json files
Normalizes text by removing HTML, standardizing whitespace, and handling Unicode
Assigns topics (DBMS, OOPs, OS) based on filename patterns
Assigns subtopics using keyword matching rules from config/topic_rules.json
Determines difficulty (Beginner/Intermediate/Advanced) through heuristic analysis
Deduplicates questions based on normalized text
Outputs clean JSON and JSONL files for indexing

Input Requirements

Raw Data Format

Place JSON files in source/data/raw/. Each file should contain an array of objects:

[
  {
    "id": 1,
    "question": "What is a database?",
    "answer": "A database is an organized collection of structured data..."
  },
  {
    "id": 2,
    "question": "What is normalization?",
    "answer": "Normalization is the process of organizing data..."
  }
]

Note: The script also handles nested JSON where the array is a value within an object.

Current Raw Files

database_qna.json - DBMS questions
oops_qna_simplified.json - Object-oriented programming questions
os_qna.json - Operating systems questions

Topic Assignment Logic

Topics are determined from the source filename:

def topic_from_filename(fname: str) -> str:
    fname = fname.lower()
    if "database" in fname or "dbms" in fname:
        return "DBMS"
    if "oops" in fname:
        return "OOPs"
    if "os" in fname:
        return "OS"
    return "uncategorized"

Example: A file named database_qna.json will have all its questions assigned to the “DBMS” topic. See source/scripts/prepare_kb.py:67-75

Subtopic Assignment

Subtopics are assigned through keyword matching with configurable rules from config/topic_rules.json.

How It Works

def assign_subtopic(q: str, a: str, topic: str, rules: List[Dict[str, Any]]) -> str:
    text = f"{q} {a}".lower()
    tokens = tokenize(text)

    best_subtopic = None
    best_score = 0.0

    for rule in rules:
        if rule["topic"] != topic:
            continue

        keywords = rule["keywords"]
        matched = sum(1 for kw in keywords if kw in text or kw in tokens)
        coverage = matched / len(keywords)

        if coverage > best_score:
            best_score = coverage
            best_subtopic = rule["subtopic"]

    # confidence threshold
    if best_score >= 0.25:
        return best_subtopic

    # fallback to default subtopic
    fallback = {
        "DBMS": "DBMS Architecture",
        "OOPs": "Classes",
        "OS": "Processes"
    }
    return fallback.get(topic)

See source/scripts/prepare_kb.py:104-134

Keyword Matching Process

Combines question and answer text
Tokenizes into words (alphanumeric + some special chars)
For each rule matching the topic, counts keyword matches
Calculates coverage score (matches / total keywords)
Selects subtopic with highest coverage
Requires minimum 25% coverage threshold
Falls back to default subtopic if threshold not met

Example

For a question about “B+ tree indexing in databases”:

Keywords matched: ["index", "indexing", "b+ tree", "b tree"]
Rule matched: DBMS → Indexing
Coverage: 4/8 = 0.5 (50%)
Assigned subtopic: Indexing

Special disambiguation logic handles overlapping concepts:

def refine_subtopic(q: str, a: str, topic: str, subtopic: str) -> str:
    text = f"{q} {a}".lower()

    if "deadlock" in text:
        if topic == "DBMS" and any(x in text for x in ["transaction", "lock"]):
            return "Deadlocks"
        if topic == "OS" and any(x in text for x in ["process", "resource"]):
            return "Deadlocks"

    if "memory" in text:
        if topic == "OOPs":
            return "Memory Management in OOP"
        if topic == "OS":
            return "Memory Management"

    return subtopic

See source/scripts/prepare_kb.py:137-152 This ensures “deadlock” questions go to the right topic-specific subtopic, and “memory” questions are properly categorized.

Difficulty Determination

Difficulty levels are assigned using keyword heuristics:

def difficulty_heuristic(q: str, a: str) -> str:
    text = f"{q} {a}".lower()

    advanced = [
        "mvcc", "2pl", "serializable", "cap theorem", "wal",
        "b+ tree", "page replacement", "banker's algorithm",
        "vtable", "raii"
    ]

    beginner = [
        "what is", "define", "basic", "class", "object",
        "process", "thread", "primary key"
    ]

    if sum(1 for t in advanced if t in text) >= 2:
        return "Advanced"
    if sum(1 for t in beginner if t in text) >= 2:
        return "Beginner"
    return "Intermediate"

See source/scripts/prepare_kb.py:155-173

Difficulty Criteria

Advanced: Contains 2+ advanced technical terms (MVCC, CAP theorem, B+ tree, etc.)
Beginner: Contains 2+ beginner indicators (“what is”, “define”, “basic”, etc.)
Intermediate: Default for everything else

Text Normalization

All text goes through normalization to ensure consistency:

def normalize_text(s: str) -> str:
    s = unidecode(s)  # Convert Unicode to ASCII
    s = strip_html(s)  # Remove HTML tags
    s = s.replace("\xa0", " ")  # Replace non-breaking spaces
    s = unicodedata.normalize("NFKC", s)  # Normalize Unicode
    s = re.sub(r"[ \t]+", " ", s)  # Collapse whitespace
    s = re.sub(r"\n\s*\n\s*", "\n\n", s)  # Normalize line breaks
    return s.strip()

See source/scripts/prepare_kb.py:26-33

HTML Handling

def strip_html(text: str) -> str:
    text = re.sub(r"<\s*br\s*/?>", "\n", text, flags=re.I)  # <br> → newline
    text = re.sub(r"<[^>]+>", " ", text)  # Remove all other tags
    return text

See source/scripts/prepare_kb.py:21-24

Output Files

The script generates two output files in source/data/processed/:

1. kb_clean.json

Structured JSON array with all metadata:

[
  {
    "id": "1",
    "question": "What is a database?",
    "answer": "A database is an organized collection...",
    "topic": "DBMS",
    "subtopic": "DBMS Architecture",
    "difficulty": "Beginner",
    "source": "database_qna.json"
  }
]

2. kb_chunks.jsonl

Newline-delimited JSON for embedding generation:

{"id": "1", "text": "Q: What is a database?\nA: A database is...", "metadata": {"topic": "DBMS", "subtopic": "DBMS Architecture", "difficulty": "Beginner", "source": "database_qna.json"}}
{"id": "2", "text": "Q: What is normalization?\nA: Normalization is...", "metadata": {"topic": "DBMS", "subtopic": "Normalization", "difficulty": "Intermediate", "source": "database_qna.json"}}

See source/scripts/prepare_kb.py:217-241

Running the Script

Prerequisites

pip install unidecode

Execution

cd source/scripts
python prepare_kb.py

Expected Output

🔄 Preparing KB with strict topic control...
✅ KB preparation complete
Clean JSON : data/processed/kb_clean.json
Chunks     : data/processed/kb_chunks.jsonl

Deduplication

The script prevents duplicate questions:

seen = set()

for obj in raw_items:
    q = normalize_text(obj.get("question", ""))
    a = normalize_text(obj.get("answer", ""))

    if not q or not a:
        continue

    key = (q.lower(), a.lower())
    if key in seen:
        continue
    seen.add(key)

See source/scripts/prepare_kb.py:181-195 Duplicates are identified by normalized question+answer pairs (case-insensitive).

Code Walkthrough Example

Let’s trace how a single question is processed:

Input

{
  "id": 42,
  "question": "What is <b>normalization</b> in DBMS?",
  "answer": "Normalization is organizing data to reduce redundancy using normal forms like 1NF, 2NF, 3NF."
}

Processing Steps

Load: Read from database_qna.json
Topic: Filename contains “database” → topic = "DBMS"
Normalize question: "What is normalization in DBMS?" (HTML removed)
Normalize answer: "Normalization is organizing data to reduce redundancy using normal forms like 1NF, 2NF, 3NF."
Tokenize: {"what", "is", "normalization", "in", "dbms", "organizing", "data", "reduce", "redundancy", "normal", "forms", "1nf", "2nf", "3nf"}
Match keywords: Rule for “Normalization” has keywords: ["normalization", "normal form", "1nf", "2nf", "3nf", ...]
Calculate coverage: 5 matches / 10 keywords = 0.5 (50%)
Assign subtopic: subtopic = "Normalization"
Check difficulty: Contains “normal” + “1nf”, “2nf”, “3nf” but not enough advanced terms → difficulty = "Intermediate"
Generate ID: Keep existing "42"

Output

{
  "id": "42",
  "question": "What is normalization in DBMS?",
  "answer": "Normalization is organizing data to reduce redundancy using normal forms like 1NF, 2NF, 3NF.",
  "topic": "DBMS",
  "subtopic": "Normalization",
  "difficulty": "Intermediate",
  "source": "database_qna.json"
}

Troubleshooting

Issue: Questions assigned to wrong subtopic

Solution: Update keyword rules in config/topic_rules.json to include more specific keywords for that subtopic.

Issue: Too many “uncategorized” topics

Solution: Ensure raw JSON filenames contain recognizable keywords (“database”, “dbms”, “oops”, “os”).

Issue: All questions marked as “Intermediate”

Solution: Adjust the difficulty heuristics in the difficulty_heuristic() function or add more domain-specific keywords.

Issue: Duplicate questions in output

Solution: This shouldn’t happen due to deduplication. If it does, check if questions have different whitespace/formatting that bypasses normalization.

Next Steps

After running prepare_kb.py, proceed to:

FAISS Indexing - Build vector search index
Query the system - Use the RAG query interface

Adding Topics - Extend to new knowledge domains
config/taxonomy.json - Topic hierarchy structure
config/topic_rules.json - Keyword matching rules

Contributing

Knowledge Base

Deployment

Overview

Script Location

What It Does

Input Requirements

Raw Data Format

Current Raw Files

Topic Assignment Logic

Subtopic Assignment

How It Works

Keyword Matching Process

Example

Subtopic Refinement

Difficulty Determination

Difficulty Criteria

Text Normalization

HTML Handling

Output Files

1. kb_clean.json

2. kb_chunks.jsonl

Running the Script

Prerequisites

Execution

Expected Output

Deduplication

Code Walkthrough Example

Input

Processing Steps

Output

Troubleshooting

Issue: Questions assigned to wrong subtopic

Issue: Too many “uncategorized” topics

Issue: All questions marked as “Intermediate”

Issue: Duplicate questions in output

Next Steps

Build docs developers (and LLMs) love

Contributing

Knowledge Base

Deployment

​Overview

​Script Location

​What It Does

​Input Requirements

​Raw Data Format

​Current Raw Files

​Topic Assignment Logic

​Subtopic Assignment

​How It Works

​Keyword Matching Process

​Example

​Subtopic Refinement

​Difficulty Determination

​Difficulty Criteria

​Text Normalization

​HTML Handling

​Output Files

​1. kb_clean.json

​2. kb_chunks.jsonl

​Running the Script

​Prerequisites

​Execution

​Expected Output

​Deduplication

​Code Walkthrough Example

​Input

​Processing Steps

​Output

​Troubleshooting

​Issue: Questions assigned to wrong subtopic

​Issue: Too many “uncategorized” topics

​Issue: All questions marked as “Intermediate”

​Issue: Duplicate questions in output

​Next Steps

​Related Configuration

Build docs developers (and LLMs) love

Overview

Script Location

What It Does

Input Requirements

Raw Data Format

Current Raw Files

Topic Assignment Logic

Subtopic Assignment

How It Works

Keyword Matching Process

Example

Subtopic Refinement

Difficulty Determination

Difficulty Criteria

Text Normalization

HTML Handling

Output Files

1. kb_clean.json

2. kb_chunks.jsonl

Running the Script

Prerequisites

Execution

Expected Output

Deduplication

Code Walkthrough Example

Input

Processing Steps

Output

Troubleshooting

Issue: Questions assigned to wrong subtopic

Issue: Too many “uncategorized” topics

Issue: All questions marked as “Intermediate”

Issue: Duplicate questions in output

Next Steps

Related Configuration