Data ingestion

The data ingestion pipeline is the foundation of RepoRAGX. It runs once per repository to fetch code from GitHub, process it into chunks, generate embeddings, and store them in a vector database for fast semantic search.

Pipeline overview

The ingestion process follows five sequential steps: Each step transforms the data into a format optimized for semantic search.

Step 1: Load files from GitHub

Component: GitHubCodeBaseLoader (src/rag/github_codebase_loader.py) The loader fetches repository contents using GitHub’s API with intelligent filtering to exclude non-source files:

loader = GitHubCodeBaseLoader(
    repo="owner/repo",
    branch="main",
    access_token=github_token
)
docs = loader.load()

Filtering strategy

The loader implements two-level filtering:

Excluded file extensions

Binary and non-text files that don’t contribute to code understanding:

Images: .png, .jpg, .svg, .ico, .webp
Archives: .zip, .tar, .gz, .rar, .7z
Binaries: .exe, .dll, .so, .pyc, .class
Media: .mp3, .mp4, .wav, .avi
Documents: .pdf, .doc, .xls, .ppt
Minified: .min.js, .min.css
Databases: .db, .sqlite

Defined in github_codebase_loader.py:3-14

Excluded directories

Common folders containing dependencies and generated files:

node_modules/ - JavaScript dependencies
.git/ - Version control data
dist/, build/ - Build artifacts
__pycache__/ - Python bytecode
venv/, .venv/ - Python virtual environments

Defined in github_codebase_loader.py:16-24

Lazy loading

Files are loaded one-by-one using lazy loading to handle large repositories efficiently:

for doc in self.loader.lazy_load():
    try:
        docs.append(doc)
    except Exception:
        print(f"Skipping file: {doc.metadata.get('path','unknown')}")

This approach prevents memory issues when processing repositories with thousands of files.

Each loaded document includes metadata like path, source, and repo for traceability during retrieval.

Step 2: Chunk documents

Component: TextSplitter (src/rag/text_splitter.py) Code files are split into smaller chunks to fit within embedding model constraints and improve retrieval precision:

chunks = TextSplitter(
    documents=docs,
    chunk_size=1000,
    chunk_overlap=200
).split_documents_into_chunks()

Language-aware splitting

The splitter recognizes 20+ programming languages and uses syntax-aware boundaries:

# Splits on class/function definitions
class MyClass:     # ← Natural boundary
    def method():  # ← Another boundary
        pass

Supported languages include:

Web: JavaScript, TypeScript, PHP, HTML
Systems: C, C++, Rust, Go, Swift
JVM: Java, Kotlin, Scala
Scripting: Python, Ruby, Lua, Perl, R
Functional: Haskell, Elixir
Others: Solidity, C#, PowerShell, Markdown

Mapping defined in text_splitter.py:3-50

Chunking parameters

Chunk size: 1000 characters

Balances context preservation with embedding model efficiency. Large enough to capture function implementations, small enough for precise matching.

Chunk overlap: 200 characters

Creates 20% overlap between consecutive chunks to prevent information loss at boundaries. Critical for functions that span chunk edges.

Fallback splitter

For unrecognized file types, a generic recursive splitter is used:

RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    separators=["\n\n", "\n", " ", ""]
)

This tries splitting on double newlines first, then single newlines, then spaces.

Step 3: Generate embeddings

Component: EmbeddingManager (src/rag/embedding_manager.py) Text chunks are converted to numerical vectors using the Sentence Transformers library:

embedding_manager = EmbeddingManager(model_name="all-MiniLM-L6-v2")
texts = [doc.page_content for doc in chunks]
embeddings = embedding_manager.generate_embeddings(texts)

Model: all-MiniLM-L6-v2

This model is specifically chosen for code understanding:

Dimensions: 384-dimensional vectorsSpeed: Encodes ~10,000 tokens/second on CPUQuality: Trained on 1B+ sentence pairs for semantic similaritySize: 80MB model weights—small enough for local execution

The model is loaded once and reused for all chunks:

self.model = SentenceTransformer(self.model_name)
embeddings = self.model.encode(texts, show_progress_bar=True)

Output format

Generates a numpy array of shape (n_chunks, 384):

# Example for 1,000 chunks:
embeddings.shape  # (1000, 384)

Each row is a 384-dimensional vector representing the semantic meaning of one chunk.

The same model must be used for both document embedding and query embedding to ensure vectors exist in the same semantic space.

Step 4: Store in ChromaDB

Component: VectorStore (src/rag/vector_store.py) Embeddings are persisted in ChromaDB for efficient similarity search:

vector_store = VectorStore(
    collection_name=repo.replace("/", "_"),
    persist_directory=Path.home() / ".RepoRAGX" / "vector_store"
)
vector_store.add_documents(chunks, embeddings)

ChromaDB configuration

The vector store is initialized with specific settings:

self.collection = self.client.get_or_create_collection(
    name=collection_name,
    metadata={
        "hnsw:space": "cosine",           # Cosine similarity metric
        "repo": collection_name,           # Repository identifier
        "type": "github_codebase",         # Collection type
        "embedding_model": "all-MiniLM-L6-v2"  # Model reference
    }
)

Similarity metric: Cosine similarity measures the angle between vectors, making it ideal for semantic similarity regardless of text length.

Document storage

Each chunk is stored with:

Unique ID: Generated UUID for tracking (doc_{uuid}_{index})
Embedding: 384-dimensional vector
Metadata: Original file path, document index, content length
Content: Full text of the chunk

self.collection.add(
    ids=ids,
    embeddings=embeddings_list,
    metadatas=metadatas,
    documents=documents_text
)

Implementation: vector_store.py:42-81

Persistence strategy

Local storage location

Vectors are stored at ~/.RepoRAGX/vector_store/ with the following structure:

~/.RepoRAGX/
└── vector_store/
    ├── chroma.sqlite3        # Metadata database
    └── {collection_id}/       # Per-collection data
        ├── data_level0.bin   # HNSW index
        └── header.bin        # Index metadata

This allows instant reuse without re-embedding when querying the same repository.

Collection naming

Collections are named after repositories with slashes replaced:

facebook/react → facebook_react
microsoft/vscode → microsoft_vscode

Each repository gets its own isolated collection.

HNSW indexing

ChromaDB uses Hierarchical Navigable Small World (HNSW) graphs for fast approximate nearest neighbor search:

Search time: O(log n) instead of O(n) for brute force
Accuracy: >95% recall at top-10 results
Trade-off: Small amount of disk space for dramatic speed improvement

Complete ingestion flow

Here’s the complete pipeline as implemented in src/main.py:37-44:

# Step 1: Load from GitHub
docs = GitHubCodeBaseLoader(
    repo=repo,
    branch=branch,
    access_token=github_token
).load()

# Step 2: Chunk documents
chunks = TextSplitter(docs).split_documents_into_chunks()

# Step 3: Generate embeddings
embedding_manager = EmbeddingManager()
texts = [doc.page_content for doc in chunks]
embeddings = embedding_manager.generate_embeddings(texts)

# Step 4: Store in vector database
vector_store = VectorStore(
    collection_name=repo.replace("/", "_"),
    persist_directory=persist_directory
)
vector_store.add_documents(chunks, embeddings)

Performance considerations

Embedding speed

~10,000 tokens/second on modern CPUsA 10,000-line repository (≈500k tokens) embeds in ~50 seconds

Storage efficiency

384 dimensions × 4 bytes = 1.5KB per chunk1,000 chunks = ~1.5MB storage (plus index overhead)

Memory usage

Model: 80MB (loaded once)Peak: ~500MB for large repos during embedding generation

Network transfer

Depends on repo sizeLazy loading prevents memory spikes from large repositories

Error handling

The pipeline gracefully handles common issues:

File access errors

Skipped files are logged but don’t halt the pipeline:

except Exception:
    print(f"Skipping file: {doc.metadata.get('path','unknown')}")

Existing collections

Previous collections are deleted before re-indexing:

self.client.delete_collection(name=self.collection_name)

Model loading failures

Raises early with clear error messages:

except Exception as e:
    print(f"Error loading model {self.model_name}: {e}")
    raise

Next steps

RAG retrieval

Learn how queries are processed and answers are generated using the ingested data

Get Started

Core Concepts

Configuration

Usage Guide

Pipeline overview

Step 1: Load files from GitHub

Filtering strategy

Lazy loading

Step 2: Chunk documents

Language-aware splitting

Chunking parameters

Fallback splitter

Step 3: Generate embeddings

Model: all-MiniLM-L6-v2

Output format

Step 4: Store in ChromaDB

ChromaDB configuration

Document storage

Persistence strategy

HNSW indexing

Complete ingestion flow

Performance considerations

Embedding speed

Storage efficiency

Memory usage

Network transfer

Error handling

Next steps

RAG retrieval

Build docs developers (and LLMs) love

Get Started

Core Concepts

Configuration

Usage Guide

​Pipeline overview

​Step 1: Load files from GitHub

​Filtering strategy

​Lazy loading

​Step 2: Chunk documents

​Language-aware splitting

​Chunking parameters

​Fallback splitter

​Step 3: Generate embeddings

​Model: all-MiniLM-L6-v2

​Output format

​Step 4: Store in ChromaDB

​ChromaDB configuration

​Document storage

​Persistence strategy

​HNSW indexing

​Complete ingestion flow

​Performance considerations

Embedding speed

Storage efficiency

Memory usage

Network transfer

​Error handling

​Next steps

RAG retrieval

Build docs developers (and LLMs) love

Pipeline overview

Step 1: Load files from GitHub

Filtering strategy

Lazy loading

Step 2: Chunk documents

Language-aware splitting

Chunking parameters

Fallback splitter

Step 3: Generate embeddings

Model: all-MiniLM-L6-v2

Output format

Step 4: Store in ChromaDB

ChromaDB configuration

Document storage

Persistence strategy

HNSW indexing

Complete ingestion flow

Performance considerations

Error handling

Next steps