Overview
The FAISS (Facebook AI Similarity Search) indexing process transforms the clean knowledge base into a searchable vector database. This enables fast semantic similarity search to retrieve relevant Q&A pairs for answering user queries.Script Location
What FAISS Does
FAISS provides:- Vector similarity search: Find semantically similar questions
- Efficient indexing: Handle thousands of vectors with millisecond latency
- Scalability: Supports millions of vectors with GPU acceleration
- Inner product search: Uses normalized embeddings for cosine similarity
Prerequisites
Before running the indexing script:- Run preparation: Execute
prepare_kb.pyto generatekb_clean.json - Install dependencies:
Process Overview
The indexing pipeline:- Load clean knowledge base data
- Create text chunks combining questions and answers
- Generate embeddings using Sentence Transformers
- Build FAISS index with inner product similarity
- Save index and metadata files
Text Chunking
Each Q&A pair is formatted as a single text chunk:source/scripts/mistral_faiss.py:24-40
Chunk Format
Input:Embedding Generation
The script uses Sentence Transformers to convert text into dense vector embeddings.Model Selection
Default Model:all-MiniLM-L6-v2
- Dimension: 384
- Performance: Fast inference, good quality
- Size: 80MB
- Use case: General-purpose semantic search
| Model | Dimensions | Size | Performance | Use Case |
|---|---|---|---|---|
all-mpnet-base-v2 | 768 | 420MB | Best quality | High accuracy needed |
all-MiniLM-L12-v2 | 384 | 120MB | Balanced | More context |
all-MiniLM-L6-v2 | 384 | 80MB | Fastest | Production (default) |
paraphrase-multilingual-MiniLM-L12-v2 | 384 | 420MB | Multi-language | Non-English |
Embedding Code
source/scripts/mistral_faiss.py:43-66
Why Normalize Embeddings?
- Without normalization:
IP(A, B) = A · B - With normalization:
IP(A, B) = cos(θ)where θ is angle between vectors
FAISS Index Types
IndexFlatIP (Current)
- Exhaustive search (checks all vectors)
- 100% accuracy
- Best for < 1M vectors
- O(n) search complexity
- Dataset is small to medium (< 100K vectors)
- You need perfect recall
- Latency < 100ms is acceptable
Alternative: IndexIVFFlat
- Inverted file index
- ~90-95% recall (configurable)
- Good for 100K - 10M vectors
- Much faster than flat search
nlist: Number of clusters (√n is good default)nprobe: Clusters to search (higher = more accurate, slower)
Alternative: IndexHNSWFlat
- Hierarchical Navigable Small World graphs
- ~95-99% recall
- Best for 10K - 100M vectors
- Fast search, slow indexing
M: Number of connections (32 is good default)- Higher M = better recall but more memory
Index Optimization
For Larger Datasets (> 100K vectors)
Replace the index creation with:For GPU Acceleration
Output Files
The script generates three files insource/data/processed/faiss_mistral/:
1. index.faiss
Binary FAISS index file containing:- Vector embeddings
- Index structure
- Search metadata
2. metas.json
JSON array with metadata for each vector:3. ids.json (Optional)
Some implementations also save a separate ID mapping:Running the Script
Basic Usage
Expected Output
Performance Metrics
For ~500 questions:- Embedding generation: ~2-5 seconds (CPU), ~0.5s (GPU)
- Index building: < 1 second
- Total time: ~3-6 seconds
- Index size: ~750KB
- Metadata size: ~50KB
Resume FAISS vs Knowledge Base FAISS
The system uses different FAISS indexes for different purposes:Knowledge Base FAISS (Current)
Location:source/data/processed/faiss_mistral/
Content: Q&A pairs from computer science topics
Use: Answering domain-specific questions
Index: Questions + Answers as chunks
Resume FAISS (Separate)
Location:source/data/processed/faiss_resume/ (if implemented)
Content: Resume text, job descriptions, candidate profiles
Use: Matching resumes to jobs, finding candidates
Index: Resume sections or full documents
Key Differences
| Aspect | Knowledge Base | Resume |
|---|---|---|
| Data | Q&A pairs | Resume documents |
| Chunk size | Question + Answer | Section or full doc |
| Metadata | Topic, subtopic, difficulty | Skills, experience, education |
| Query type | User questions | Job requirements |
| Update frequency | Periodic (new questions) | Frequent (new candidates) |
Troubleshooting
Issue: “kb_clean.json not found”
Cause: Preparation script hasn’t been run Solution:Issue: Out of memory during embedding generation
Cause: Too many vectors being processed at once Solution: Reduce batch size:Issue: Index file is huge
Cause: Using high-dimensional embeddings or large dataset Solutions:- Use a smaller model (e.g., MiniLM-L6 instead of MPNet)
- Use product quantization:
Issue: Search is too slow
Cause: Using flat index on large dataset Solutions:- Switch to IVF index (see optimization section)
- Use GPU acceleration
- Reduce nprobe parameter
Issue: Poor search results
Possible Causes & Solutions:- Embeddings not normalized: Ensure
normalize_embeddings=True - Wrong similarity metric: Use IndexFlatIP for cosine similarity
- Model mismatch: Use same model for indexing and querying
- Bad chunk formatting: Ensure consistent “Q: … A: …” format
Verification
Check Index Statistics
Test Search
Advanced Configuration
Batch Processing for Large Datasets
Multi-GPU Indexing
Next Steps
After building the index:- Query the system: Use
rag_query.pyto test retrieval - Monitor performance: Track search latency and accuracy
- Iterate on quality: Refine embeddings or index type based on results
Related Documentation
- KB Preparation - Preparing data for indexing
- Adding Topics - Expanding the knowledge base
- FAISS Documentation: https://github.com/facebookresearch/faiss/wiki
- Sentence Transformers: https://www.sbert.net/