Technical Architecture

System Overview

GitNexus is a graph-powered code intelligence system that transforms source code into a queryable knowledge graph.
Architecture Diagram

┌─────────────────┐
│  Source Code    │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│  Tree-sitter    │  Parse ASTs
│   Parsers       │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│  Ingestion      │  Extract symbols,
│   Pipeline      │  resolve calls/imports
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│  KuzuDB Graph   │  Store nodes + edges
│   Database      │
└────────┬────────┘
         │
         ├──► BM25 Full-Text Search
         │
         └──► Semantic Embeddings
                 │
                 ▼
         ┌───────────────┐
         │  MCP Server   │
         │  HTTP API     │
         └───────────────┘
```text

## Tech Stack

### CLI vs Web Comparison

<table>
  <thead>
    <tr>
      <th>Component</th>
      <th>CLI (Node.js)</th>
      <th>Web (Browser)</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>**Runtime**</td>
      <td>Node.js ≥18</td>
      <td>Modern browsers (Chrome, Firefox, Safari)</td>
    </tr>
    <tr>
      <td>**Database**</td>
      <td>KuzuDB (native) - embedded graph database</td>
      <td>KuzuDB WASM - browser-compiled version</td>
    </tr>
    <tr>
      <td>**Parser**</td>
      <td>Tree-sitter native bindings (C)</td>
      <td>Tree-sitter WASM modules</td>
    </tr>
    <tr>
      <td>**Embeddings**</td>
      <td>transformers.js + ONNX Runtime (native or CUDA)</td>
      <td>transformers.js (WebGPU or WASM)</td>
    </tr>
    <tr>
      <td>**Concurrency**</td>
      <td>Worker threads (parallel parsing)</td>
      <td>Web Workers (parallel parsing)</td>
    </tr>
    <tr>
      <td>**Storage**</td>
      <td>`.gitnexus/` directory (filesystem)</td>
      <td>IndexedDB (browser storage)</td>
    </tr>
    <tr>
      <td>**Memory**</td>
      <td>8GB heap (auto-allocated)</td>
      <td>Browser-limited (~2-4GB typical)</td>
    </tr>
  </tbody>
</table>

## Core Technologies

### KuzuDB - Graph Database

[KuzuDB](https://kuzudb.com/) is an **embedded graph database** optimized for knowledge graphs:

**Why KuzuDB?**

- **Embedded** - No separate server process required
- **Graph-native** - Optimized for relationship traversal
- **Cypher queries** - Standard graph query language
- **Fast ingestion** - Bulk CSV loading for millions of nodes/edges
- **Single-file storage** - Database stored as a single `.kuzu` file

**Schema:**

GitNexus uses a **multi-label node schema**:

- **File nodes** - `File`, `Folder`
- **Code nodes** - `Function`, `Class`, `Method`, `Interface`, `Struct`, `Enum`, `Trait`, etc.
- **Grouping nodes** - `Community` (clusters), `Process` (execution flows)
- **Embedding nodes** - `CodeEmbedding` (semantic vectors)

**Relationships:**

- `CALLS` - Function/method calls
- `IMPORTS` - Module imports
- `EXTENDS` - Class inheritance
- `IMPLEMENTS` - Interface implementation
- `DEFINES` - File defines symbol
- `MEMBER_OF` - Symbol belongs to community
- `STEP_IN_PROCESS` - Symbol is part of execution flow

### Tree-sitter - AST Parsing

[Tree-sitter](https://tree-sitter.github.io/) provides **fast, incremental parsing**:

**Key Features:**

- **Error-tolerant** - Parses incomplete code
- **Incremental** - Re-parses only changed regions
- **Language-agnostic** - Consistent API across languages
- **Native speed** - C implementation with bindings

**Integration:**

- CLI uses `tree-sitter` npm package (native Node.js bindings)
- Web uses `tree-sitter-wasm` (WebAssembly builds)
- Each language has a separate parser package (e.g., `tree-sitter-typescript`)

### transformers.js - Embeddings

[transformers.js](https://huggingface.co/docs/transformers.js) runs **Hugging Face models in JavaScript**:

**Model:**

GitNexus uses `snowflake-arctic-embed-xs`:

- **Size**: 22M parameters (~90MB download)
- **Dimensions**: 384-dimensional vectors
- **Quality**: Optimized for code and technical text
- **Speed**: ~1000 embeddings/minute on CPU

**Hardware Acceleration:**

<table>
  <thead>
    <tr>
      <th>Platform</th>
      <th>Backend</th>
      <th>Speed</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Windows CLI</td>
      <td>DirectML (DirectX 12)</td>
      <td>🚀 GPU-accelerated</td>
    </tr>
    <tr>
      <td>Linux CLI (with CUDA)</td>
      <td>ONNX Runtime CUDA</td>
      <td>🚀 GPU-accelerated</td>
    </tr>
    <tr>
      <td>Linux/macOS CLI</td>
      <td>ONNX Runtime CPU</td>
      <td>⚡ CPU (still fast)</td>
    </tr>
    <tr>
      <td>Web (Chrome/Edge)</td>
      <td>WebGPU</td>
      <td>🚀 GPU-accelerated</td>
    </tr>
    <tr>
      <td>Web (Firefox/Safari)</td>
      <td>WASM</td>
      <td>🐢 Slower</td>
    </tr>
  </tbody>
</table>

**Usage:**

Embeddings are **optional** (use `--embeddings` flag). They enable:

- Semantic code search (beyond keyword matching)
- Similarity-based symbol ranking
- Hybrid search (BM25 + semantic + RRF fusion)

## Indexing Pipeline

GitNexus builds the knowledge graph through a **6-phase pipeline**:

### Phase 1: Structure (0-20%)

1. **Scan repository** - Walk file tree, collect paths
2. **Build file graph** - Create `File` and `Folder` nodes
3. **Detect framework** - Identify project type (React, Django, etc.)

### Phase 2: Parsing (20-60%)

1. **Chunk files** - Group files into 20MB batches
2. **Parse ASTs** - Extract functions, classes, methods
3. **Create symbol table** - Register all code symbols
4. **Worker pool** - Parse in parallel across CPU cores

**Worker Thread Concurrency:**

GitNexus uses **Node.js worker threads** for parallel parsing:

- **Pool size**: `Math.min(8, os.cpus().length - 1)`
- **Sub-batching**: Files sent to workers in chunks of 1500
- **Timeout**: 30 seconds per sub-batch
- **Fallback**: Sequential parsing if workers fail

Each worker:
1. Receives batch of files
2. Parses ASTs in isolation
3. Extracts symbols, calls, imports
4. Returns structured data to main thread

### Phase 3: Resolution (60-82%)

1. **Import resolution** - Map import statements to files
2. **Call resolution** - Link function calls to definitions
3. **Heritage resolution** - Build class hierarchies

**Language-Aware Logic:**

- Builds suffix index for fast path matching
- Caches resolution results per chunk
- Handles language-specific module systems

### Phase 4: Communities (82-92%)

Uses **Leiden algorithm** to detect functional clusters:

1. Build weighted graph (call frequency + imports)
2. Run community detection (modularity optimization)
3. Assign heuristic labels (e.g., "auth", "api", "ui")
4. Calculate cohesion scores

### Phase 5: Processes (92-98%)

Traces **execution flows** through call chains:

1. Score entry points (routes, CLI commands, event handlers)
2. Traverse call graph depth-first
3. Detect cross-community flows
4. Label processes by dominant pattern

Output: ~20-300 processes per repo (dynamic limit based on size)

### Phase 6: Search Indexes (98-100%)

1. **KuzuDB bulk load** - CSV import for nodes + edges
2. **Full-text search** - Create FTS indexes on `name` and `content`
3. **Embeddings** (optional) - Generate semantic vectors

## Storage Layout

### CLI Storage (`.gitnexus/`)

Each repository stores its index in `.gitnexus/`:

```text
my-project/
├── .gitnexus/
│   ├── graph.kuzu          # KuzuDB database file
│   ├── graph.kuzu.wal      # Write-ahead log
│   ├── graph.kuzu.lock     # Lock file
│   └── meta.json           # Index metadata
├── AGENTS.md               # GitNexus context for AI agents
└── CLAUDE.md               # Claude-specific context
```text

**Global Registry** (`~/.gitnexus/`):

```text
~/.gitnexus/
└── registry.json           # List of all indexed repos
```text

### Web Storage (Browser)

The web UI uses **IndexedDB** for client-side storage:

- Database per repository
- Nodes + edges stored as structured data
- No filesystem access required

## Query Engine

### Hybrid Search

GitNexus uses **3-tier ranking**:

1. **BM25** - Full-text relevance (KuzuDB FTS)
2. **Semantic** - Vector similarity (cosine distance)
3. **RRF** - Reciprocal Rank Fusion to combine scores

**Algorithm:**

```javascript
RRF_score = Σ (1 / (k + rank_i))
```text

Where `k = 60` (standard RRF constant).

### Process-Grouped Results

Search results are **grouped by execution flow**:

- Symbols in the same process are clustered together
- Helps AI understand functional context
- Reduces noise from unrelated matches

## MCP Server

The **Model Context Protocol server** runs via stdio:

```bash
npx gitnexus mcp
```text

**Multi-Repo Support:**

- Reads `~/.gitnexus/registry.json` at startup
- Switches active database per tool call
- Session lock prevents concurrent DB access

**Tools Exposed:**

- `query` - Hybrid search
- `context` - Symbol 360° view
- `impact` - Blast radius analysis
- `detect_changes` - Git diff impact
- `rename` - Coordinated multi-file rename
- `cypher` - Raw graph queries

## HTTP API (Optional)

Run a local server for the web UI:

```bash
gitnexus serve
```text

Provides REST endpoints:

- `GET /repos` - List indexed repos
- `POST /query` - Run hybrid search
- `POST /cypher` - Execute Cypher query
- `GET /schema` - Get graph schema

## Performance Optimizations

### Memory Management

1. **Chunked parsing** - Process 20MB at a time
2. **AST cache eviction** - LRU cache with size limits
3. **CSV streaming** - Avoid loading entire graph in memory
4. **Lazy embedding loading** - Only load model when `--embeddings` used

### Parse Speed

1. **Worker parallelism** - Use all CPU cores
2. **Sub-batching** - Limit structured clone overhead
3. **Suffix index** - O(1) import path lookups
4. **Resolve cache** - Memoize import resolution

### Query Speed

1. **KuzuDB indexing** - Automatic graph indexes
2. **FTS indexes** - Fast keyword search
3. **Embedding batching** - Process 200 embeddings at a time
4. **Connection pooling** - Reuse KuzuDB connections

## Limitations

### Current Constraints

- **Max repo size**: ~50,000 files (memory-limited)
- **Embedding limit**: 50,000 nodes (auto-skip above this)
- **WASM performance**: 2-3x slower than native
- **Dynamic languages**: Limited runtime dispatch resolution

### Known Issues

- ONNX Runtime cleanup segfaults on macOS (CLI force-exits to avoid)
- Swift parser is optional (may fail to install on some platforms)
- Minified files are not supported
Get Started

Core Concepts

CLI Usage

MCP Integration

Agent Skills

Web UI

Advanced

Technical Architecture

System Overview

Architecture Diagram

Build docs developers (and LLMs) love

Get Started

Core Concepts

CLI Usage

MCP Integration

Agent Skills

Web UI

Advanced

​System Overview

​Architecture Diagram

Build docs developers (and LLMs) love

System Overview

Architecture Diagram