Skip to main content
The data ingestion pipeline is the foundation of RepoRAGX. It runs once per repository to fetch code from GitHub, process it into chunks, generate embeddings, and store them in a vector database for fast semantic search.

Pipeline overview

The ingestion process follows five sequential steps: Each step transforms the data into a format optimized for semantic search.

Step 1: Load files from GitHub

Component: GitHubCodeBaseLoader (src/rag/github_codebase_loader.py) The loader fetches repository contents using GitHub’s API with intelligent filtering to exclude non-source files:
loader = GitHubCodeBaseLoader(
    repo="owner/repo",
    branch="main",
    access_token=github_token
)
docs = loader.load()

Filtering strategy

The loader implements two-level filtering:
Binary and non-text files that don’t contribute to code understanding:
  • Images: .png, .jpg, .svg, .ico, .webp
  • Archives: .zip, .tar, .gz, .rar, .7z
  • Binaries: .exe, .dll, .so, .pyc, .class
  • Media: .mp3, .mp4, .wav, .avi
  • Documents: .pdf, .doc, .xls, .ppt
  • Minified: .min.js, .min.css
  • Databases: .db, .sqlite
Defined in github_codebase_loader.py:3-14
Common folders containing dependencies and generated files:
  • node_modules/ - JavaScript dependencies
  • .git/ - Version control data
  • dist/, build/ - Build artifacts
  • __pycache__/ - Python bytecode
  • venv/, .venv/ - Python virtual environments
Defined in github_codebase_loader.py:16-24

Lazy loading

Files are loaded one-by-one using lazy loading to handle large repositories efficiently:
for doc in self.loader.lazy_load():
    try:
        docs.append(doc)
    except Exception:
        print(f"Skipping file: {doc.metadata.get('path','unknown')}")
This approach prevents memory issues when processing repositories with thousands of files.
Each loaded document includes metadata like path, source, and repo for traceability during retrieval.

Step 2: Chunk documents

Component: TextSplitter (src/rag/text_splitter.py) Code files are split into smaller chunks to fit within embedding model constraints and improve retrieval precision:
chunks = TextSplitter(
    documents=docs,
    chunk_size=1000,
    chunk_overlap=200
).split_documents_into_chunks()

Language-aware splitting

The splitter recognizes 20+ programming languages and uses syntax-aware boundaries:
# Splits on class/function definitions
class MyClass:     # ← Natural boundary
    def method():  # ← Another boundary
        pass
Supported languages include:
  • Web: JavaScript, TypeScript, PHP, HTML
  • Systems: C, C++, Rust, Go, Swift
  • JVM: Java, Kotlin, Scala
  • Scripting: Python, Ruby, Lua, Perl, R
  • Functional: Haskell, Elixir
  • Others: Solidity, C#, PowerShell, Markdown
Mapping defined in text_splitter.py:3-50

Chunking parameters

1

Chunk size: 1000 characters

Balances context preservation with embedding model efficiency. Large enough to capture function implementations, small enough for precise matching.
2

Chunk overlap: 200 characters

Creates 20% overlap between consecutive chunks to prevent information loss at boundaries. Critical for functions that span chunk edges.

Fallback splitter

For unrecognized file types, a generic recursive splitter is used:
RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    separators=["\n\n", "\n", " ", ""]
)
This tries splitting on double newlines first, then single newlines, then spaces.

Step 3: Generate embeddings

Component: EmbeddingManager (src/rag/embedding_manager.py) Text chunks are converted to numerical vectors using the Sentence Transformers library:
embedding_manager = EmbeddingManager(model_name="all-MiniLM-L6-v2")
texts = [doc.page_content for doc in chunks]
embeddings = embedding_manager.generate_embeddings(texts)

Model: all-MiniLM-L6-v2

This model is specifically chosen for code understanding:
Dimensions: 384-dimensional vectorsSpeed: Encodes ~10,000 tokens/second on CPUQuality: Trained on 1B+ sentence pairs for semantic similaritySize: 80MB model weights—small enough for local execution
The model is loaded once and reused for all chunks:
self.model = SentenceTransformer(self.model_name)
embeddings = self.model.encode(texts, show_progress_bar=True)

Output format

Generates a numpy array of shape (n_chunks, 384):
# Example for 1,000 chunks:
embeddings.shape  # (1000, 384)
Each row is a 384-dimensional vector representing the semantic meaning of one chunk.
The same model must be used for both document embedding and query embedding to ensure vectors exist in the same semantic space.

Step 4: Store in ChromaDB

Component: VectorStore (src/rag/vector_store.py) Embeddings are persisted in ChromaDB for efficient similarity search:
vector_store = VectorStore(
    collection_name=repo.replace("/", "_"),
    persist_directory=Path.home() / ".RepoRAGX" / "vector_store"
)
vector_store.add_documents(chunks, embeddings)

ChromaDB configuration

The vector store is initialized with specific settings:
self.collection = self.client.get_or_create_collection(
    name=collection_name,
    metadata={
        "hnsw:space": "cosine",           # Cosine similarity metric
        "repo": collection_name,           # Repository identifier
        "type": "github_codebase",         # Collection type
        "embedding_model": "all-MiniLM-L6-v2"  # Model reference
    }
)
Similarity metric: Cosine similarity measures the angle between vectors, making it ideal for semantic similarity regardless of text length.

Document storage

Each chunk is stored with:
  1. Unique ID: Generated UUID for tracking (doc_{uuid}_{index})
  2. Embedding: 384-dimensional vector
  3. Metadata: Original file path, document index, content length
  4. Content: Full text of the chunk
self.collection.add(
    ids=ids,
    embeddings=embeddings_list,
    metadatas=metadatas,
    documents=documents_text
)
Implementation: vector_store.py:42-81

Persistence strategy

Vectors are stored at ~/.RepoRAGX/vector_store/ with the following structure:
~/.RepoRAGX/
└── vector_store/
    ├── chroma.sqlite3        # Metadata database
    └── {collection_id}/       # Per-collection data
        ├── data_level0.bin   # HNSW index
        └── header.bin        # Index metadata
This allows instant reuse without re-embedding when querying the same repository.
Collections are named after repositories with slashes replaced:
  • facebook/reactfacebook_react
  • microsoft/vscodemicrosoft_vscode
Each repository gets its own isolated collection.

HNSW indexing

ChromaDB uses Hierarchical Navigable Small World (HNSW) graphs for fast approximate nearest neighbor search:
  • Search time: O(log n) instead of O(n) for brute force
  • Accuracy: >95% recall at top-10 results
  • Trade-off: Small amount of disk space for dramatic speed improvement

Complete ingestion flow

Here’s the complete pipeline as implemented in src/main.py:37-44:
# Step 1: Load from GitHub
docs = GitHubCodeBaseLoader(
    repo=repo,
    branch=branch,
    access_token=github_token
).load()

# Step 2: Chunk documents
chunks = TextSplitter(docs).split_documents_into_chunks()

# Step 3: Generate embeddings
embedding_manager = EmbeddingManager()
texts = [doc.page_content for doc in chunks]
embeddings = embedding_manager.generate_embeddings(texts)

# Step 4: Store in vector database
vector_store = VectorStore(
    collection_name=repo.replace("/", "_"),
    persist_directory=persist_directory
)
vector_store.add_documents(chunks, embeddings)

Performance considerations

Embedding speed

~10,000 tokens/second on modern CPUsA 10,000-line repository (≈500k tokens) embeds in ~50 seconds

Storage efficiency

384 dimensions × 4 bytes = 1.5KB per chunk1,000 chunks = ~1.5MB storage (plus index overhead)

Memory usage

Model: 80MB (loaded once)Peak: ~500MB for large repos during embedding generation

Network transfer

Depends on repo sizeLazy loading prevents memory spikes from large repositories

Error handling

The pipeline gracefully handles common issues:
1

File access errors

Skipped files are logged but don’t halt the pipeline:
except Exception:
    print(f"Skipping file: {doc.metadata.get('path','unknown')}")
2

Existing collections

Previous collections are deleted before re-indexing:
self.client.delete_collection(name=self.collection_name)
3

Model loading failures

Raises early with clear error messages:
except Exception as e:
    print(f"Error loading model {self.model_name}: {e}")
    raise

Next steps

RAG retrieval

Learn how queries are processed and answers are generated using the ingested data

Build docs developers (and LLMs) love