Overview
The CS Interview Assistant uses a vector-based knowledge base powered by FAISS (Facebook AI Similarity Search) to retrieve relevant technical interview questions. The knowledge base covers three main computer science topics:- DBMS - Database Management Systems
- OOPs - Object-Oriented Programming
- OS - Operating Systems
If you’re running the project for the first time or the technical interview section is empty, you need to build the knowledge base index.
Knowledge Base System
The indexing system consists of two main scripts:prepare_kb.py- Processes raw JSON data and creates a clean, normalized datasetmistral_faiss.py(orreindex_mistral.py) - Builds the FAISS vector index from the processed data
Data Flow
What Gets Created
After indexing, the following files are generated indata/processed/faiss_mistral/:
index.faiss- The FAISS vector index containing embeddingsmetas.json- Metadata for each indexed question (topic, difficulty, source, etc.)
Step 1: Prepare Raw Data
Add raw data files
Place your raw JSON files containing interview questions in the
data/raw/ directory.Each JSON file should contain an array of question objects or a dictionary with a list value.Run the preparation script
From the root
be directory with your virtual environment activated:If
kb_clean.json already exists, you can skip this step unless you’ve added new data.Data Processing Steps
Theprepare_kb.py script performs the following operations:
- Load JSON files from
data/raw/directory - Topic detection from filename (database → DBMS, oops → OOPs, os → OS)
- Text normalization:
- Strip HTML tags
- Normalize Unicode characters
- Remove extra whitespace
- Convert special characters
- Generate stable IDs using SHA-1 hashing
- Apply topic rules from
config/topic_rules.json - Output clean JSON to
data/processed/kb_clean.json
Step 2: Build FAISS Index
Run the indexing script
From the root Or if using the alternative script:
be directory with your virtual environment activated:Monitor the indexing process
The script will display progress as it:
- Loads data from
kb_clean.json - Creates text chunks (Q&A pairs)
- Generates embeddings using
all-MiniLM-L6-v2model - Builds the FAISS index
- Saves the index and metadata
Embedding Model
The indexing script uses Sentence Transformers with theall-MiniLM-L6-v2 model:
FAISS Index Type
The system usesIndexFlatIP (Inner Product) for exact similarity search:
Refreshing Data
To add new questions or refresh the existing knowledge base:Metadata Structure
Each entry inmetas.json contains:
Unique identifier for the question (12-character SHA-1 hash)
Main topic category:
DBMS, OOPs, or OSSpecific subtopic within the main category
Question difficulty level:
easy, medium, or hardOriginal source filename from
data/raw/Troubleshooting
kb_clean.json not found
kb_clean.json not found
Error:
❌ kb_clean.json not found. Run prepare_kb.py first.Solution: Run the data preparation script:kb_clean.json is empty
kb_clean.json is empty
Error:
❌ kb_clean.json is empty.Solution: Check that you have JSON files in data/raw/ and re-run:No raw data files
No raw data files
Issue: No JSON files in
data/raw/ directory.Solution: Add your interview question JSON files to the data/raw/ directory before running prepare_kb.py.Embedding model download issues
Embedding model download issues
Issue: First-time run requires downloading the Sentence Transformers model.Solution: Ensure you have internet connectivity. The model will be cached locally after the first download.
Data indexing complete! Your knowledge base is ready. Start the application to begin using the CS Interview Assistant.