Architecture overview
The system consists of two distinct pipelines that work together:Data ingestion pipeline
Fetches code from GitHub, processes it into chunks, generates embeddings, and stores them in a vector database for efficient retrieval.
Core components
RepoRAGX orchestrates several specialized components, each handling a specific responsibility:GitHubCodeBaseLoader
Location:src/rag/github_codebase_loader.py
Responsible for fetching repository contents from GitHub with intelligent filtering:
node_modules/ and .git/) to focus on actual source code.
TextSplitter
Location:src/rag/text_splitter.py
Breaks documents into manageable chunks while preserving code structure:
EmbeddingManager
Location:src/rag/embedding_manager.py
Generates vector embeddings using Sentence Transformers:
all-MiniLM-L6-v2 model produces 384-dimensional embeddings optimized for semantic similarity.
VectorStore
Location:src/rag/vector_store.py
Manages persistent storage using ChromaDB with cosine similarity:
~/.RepoRAGX/vector_store for fast retrieval across sessions.
RAGRetriever
Location:src/rag/rag_retriever.py
Performs semantic search to find relevant code chunks:
GroqLLM
Location:src/rag/groq_llm.py
Generates natural language answers using Groq’s LLM API:
The two-pipeline flow
Pipeline 1: Data ingestion (one-time setup)
Pipeline 1: Data ingestion (one-time setup)
This pipeline runs once per repository to build the searchable knowledge base:
- Load: Fetch files from GitHub repository
- Filter: Exclude binary files and build artifacts
- Chunk: Split files into overlapping segments
- Embed: Convert text to 384-dimensional vectors
- Store: Persist embeddings in ChromaDB
src/main.py:37-44Pipeline 2: RAG retrieval (per query)
Pipeline 2: RAG retrieval (per query)
This pipeline runs for each user question:
- Query: User asks a question about the codebase
- Embed: Convert query to same vector space
- Search: Find top-k similar chunks using cosine similarity
- Retrieve: Fetch matching code snippets with metadata
- Generate: LLM synthesizes answer from context
src/main.py:49-54 and src/rag/groq_llm.py:28-56Data flow diagram
Key design decisions
Why ChromaDB? Provides efficient cosine similarity search with persistent storage, allowing the vector database to be reused across sessions without re-embedding.
Why all-MiniLM-L6-v2? Balances speed and quality—fast enough for real-time embedding generation while maintaining strong semantic understanding for code.
Why chunk overlap? The 200-character overlap (20% of chunk size) ensures important context isn’t lost at chunk boundaries, improving retrieval accuracy.
Main execution flow
The entry point atsrc/main.py orchestrates both pipelines:
Next steps
Data ingestion
Deep dive into the ingestion pipeline
RAG retrieval
Learn how query processing works