Pipeline overview
The ingestion process follows five sequential steps: Each step transforms the data into a format optimized for semantic search.Step 1: Load files from GitHub
Component:GitHubCodeBaseLoader (src/rag/github_codebase_loader.py)
The loader fetches repository contents using GitHub’s API with intelligent filtering to exclude non-source files:
Filtering strategy
The loader implements two-level filtering:Excluded file extensions
Excluded file extensions
Binary and non-text files that don’t contribute to code understanding:
- Images:
.png,.jpg,.svg,.ico,.webp - Archives:
.zip,.tar,.gz,.rar,.7z - Binaries:
.exe,.dll,.so,.pyc,.class - Media:
.mp3,.mp4,.wav,.avi - Documents:
.pdf,.doc,.xls,.ppt - Minified:
.min.js,.min.css - Databases:
.db,.sqlite
github_codebase_loader.py:3-14Excluded directories
Excluded directories
Common folders containing dependencies and generated files:
node_modules/- JavaScript dependencies.git/- Version control datadist/,build/- Build artifacts__pycache__/- Python bytecodevenv/,.venv/- Python virtual environments
github_codebase_loader.py:16-24Lazy loading
Files are loaded one-by-one using lazy loading to handle large repositories efficiently:Each loaded document includes metadata like
path, source, and repo for traceability during retrieval.Step 2: Chunk documents
Component:TextSplitter (src/rag/text_splitter.py)
Code files are split into smaller chunks to fit within embedding model constraints and improve retrieval precision:
Language-aware splitting
The splitter recognizes 20+ programming languages and uses syntax-aware boundaries:- Web: JavaScript, TypeScript, PHP, HTML
- Systems: C, C++, Rust, Go, Swift
- JVM: Java, Kotlin, Scala
- Scripting: Python, Ruby, Lua, Perl, R
- Functional: Haskell, Elixir
- Others: Solidity, C#, PowerShell, Markdown
text_splitter.py:3-50
Chunking parameters
Chunk size: 1000 characters
Balances context preservation with embedding model efficiency. Large enough to capture function implementations, small enough for precise matching.
Fallback splitter
For unrecognized file types, a generic recursive splitter is used:Step 3: Generate embeddings
Component:EmbeddingManager (src/rag/embedding_manager.py)
Text chunks are converted to numerical vectors using the Sentence Transformers library:
Model: all-MiniLM-L6-v2
This model is specifically chosen for code understanding:Dimensions: 384-dimensional vectorsSpeed: Encodes ~10,000 tokens/second on CPUQuality: Trained on 1B+ sentence pairs for semantic similaritySize: 80MB model weights—small enough for local execution
Output format
Generates a numpy array of shape(n_chunks, 384):
Step 4: Store in ChromaDB
Component:VectorStore (src/rag/vector_store.py)
Embeddings are persisted in ChromaDB for efficient similarity search:
ChromaDB configuration
The vector store is initialized with specific settings:Similarity metric: Cosine similarity measures the angle between vectors, making it ideal for semantic similarity regardless of text length.
Document storage
Each chunk is stored with:- Unique ID: Generated UUID for tracking (
doc_{uuid}_{index}) - Embedding: 384-dimensional vector
- Metadata: Original file path, document index, content length
- Content: Full text of the chunk
vector_store.py:42-81
Persistence strategy
Local storage location
Local storage location
Vectors are stored at This allows instant reuse without re-embedding when querying the same repository.
~/.RepoRAGX/vector_store/ with the following structure:Collection naming
Collection naming
Collections are named after repositories with slashes replaced:
facebook/react→facebook_reactmicrosoft/vscode→microsoft_vscode
HNSW indexing
ChromaDB uses Hierarchical Navigable Small World (HNSW) graphs for fast approximate nearest neighbor search:- Search time: O(log n) instead of O(n) for brute force
- Accuracy: >95% recall at top-10 results
- Trade-off: Small amount of disk space for dramatic speed improvement
Complete ingestion flow
Here’s the complete pipeline as implemented insrc/main.py:37-44:
Performance considerations
Embedding speed
~10,000 tokens/second on modern CPUsA 10,000-line repository (≈500k tokens) embeds in ~50 seconds
Storage efficiency
384 dimensions × 4 bytes = 1.5KB per chunk1,000 chunks = ~1.5MB storage (plus index overhead)
Memory usage
Model: 80MB (loaded once)Peak: ~500MB for large repos during embedding generation
Network transfer
Depends on repo sizeLazy loading prevents memory spikes from large repositories
Error handling
The pipeline gracefully handles common issues:Next steps
RAG retrieval
Learn how queries are processed and answers are generated using the ingested data