Pipeline Architecture
The pipeline is implemented inpipeline.ts:27 and runs in-memory with chunked processing for large repositories:
Phase 1: Scan Repository
Purpose: Walk the file tree and collect all file paths with metadata (size). Progress: 0% → 15% Implementation:filesystem-walker.ts:walkRepositoryPaths()
- Recursively scans the repository starting from the root
- Filters out
.git,node_modules,.gitnexus, and other ignored directories - Collects file paths and sizes (used for chunking in Phase 3)
- No content is read yet — this is purely a directory walk
GitNexus respects
.gitignore and skips common development directories like node_modules, dist, build, venv, etc.Phase 2: Structure
Purpose: Create File and Folder nodes representing the project structure. Progress: 15% → 20% Implementation:structure-processor.ts:processStructure()
- Creates
Filenodes for all scanned files - Creates
Foldernodes for all directories - Establishes
CONTAINSrelationships (Folder → File) - No parsing or content analysis yet
Phase 3: Parsing
Purpose: Extract symbols (functions, classes, methods, interfaces) using Tree-sitter AST parsing. Progress: 20% → 82% Implementation:parsing-processor.ts:processParsing()
Chunked Processing
To handle large repositories (e.g., Linux kernel with 80K+ files), GitNexus uses byte-budget chunking:- Chunk size: 20MB of source code per chunk
- Files are read in chunks to avoid loading the entire codebase into memory
- Each chunk is parsed, processed, and then freed
pipeline.ts:22
Memory optimization: 20MB of source code expands to ~200-400MB in memory after AST parsing. Chunking keeps peak memory usage under control.
Tree-sitter Parsing
GitNexus uses Tree-sitter for fast, incremental AST parsing:- 11 languages supported: TypeScript, JavaScript, Python, Java, C, C++, C#, Go, Rust, PHP, Swift
- Native bindings (CLI) or WASM (Web UI)
- Worker threads parallelize parsing across CPU cores
Symbol Extraction
For each file, the parser extracts:- Functions (top-level function declarations)
- Classes (class definitions)
- Methods (functions inside classes)
- Interfaces (TypeScript/Java interfaces, Go interfaces)
- Enums, Variables, Structs, Traits, etc.
- Symbol nodes in the graph
- Symbol table for lookup during resolution
- AST cache for import/call resolution
Phase 4: Resolution
Purpose: Resolve imports and function calls across files to createIMPORTS and CALLS edges.
Progress: Integrated into Phase 3 (parsed chunk by chunk)
Implementation:
import-processor.ts:processImports()— Resolves import statementscall-processor.ts:processCalls()— Resolves function callsheritage-processor.ts:processHeritage()— Resolves inheritance (EXTENDS,IMPLEMENTS)
Import Resolution
Resolve import paths
Uses language-specific resolution logic:
- TypeScript: Respects
tsconfig.jsonpath aliases - Python: Resolves relative and package imports
- Go: Uses
go.modmodule paths - PHP: Uses Composer PSR-4 autoloading
- Swift: Uses Swift Package Manager targets
auth/validate.ts → all matching files) to speed up fuzzy path matching.
Call Resolution
Call resolution is more complex because it requires matching call sites to function definitions:- Find the caller: Use Tree-sitter to find the enclosing function for each call site
- Find the callee: Look up the called function in the symbol table
- Resolve across files: Use import statements to resolve calls to imported functions
- Assign confidence:
- Same file: 1.0
- Import-resolved: 0.85
- Fuzzy global match: 0.5 or 0.3
call-processor.ts:48
Heritage Resolution
Finds class inheritance and interface implementation:EXTENDSedges for class inheritanceIMPLEMENTSedges for interface implementation- Confidence: 1.0 (these are explicit declarations)
IMPORTS,CALLS,EXTENDS,IMPLEMENTSedges in the graph- All edges have
confidenceandreasonproperties
Phase 5: Communities
Purpose: Group related symbols into functional clusters using the Leiden algorithm. Progress: 82% → 92% Implementation:community-processor.ts:processCommunities()
Leiden Algorithm
GitNexus uses the Leiden algorithm (vendored from graphology) to detect communities:- Input: Graph of symbols connected by
CALLS,EXTENDS,IMPLEMENTSedges - Output: Community assignments for each symbol
- Modularity score: Measures how well-defined the communities are
community-processor.ts:124
Large graph optimization: For repos with >10K symbols, GitNexus filters out low-confidence edges (< 0.5) and degree-1 nodes before running Leiden to reduce runtime.
Heuristic Labels
Communities are auto-labeled using folder name heuristics:community-processor.ts:295
Authentication, Validation, Database, Api
Community Properties
- cohesion: Internal edge density (0-1) — what fraction of edges stay within the community
- symbolCount: Number of symbols in the community
- heuristicLabel: Auto-generated name
CommunitynodesMEMBER_OFedges (symbol → community)
Phase 6: Processes
Purpose: Trace execution flows from entry points through call chains. Progress: 92% → 100% Implementation:process-processor.ts:processProcesses()
Entry Point Detection
Processes start from entry points — functions that initiate execution flows:entry-point-scoring.ts
@Controller, @Get, @app.route() decorators get boosted scores.
Trace Algorithm
From each entry point, GitNexus traces forward using BFS:- Start at entry point
- Follow
CALLSedges with confidence ≥ 0.5 - Limit depth to 10 steps (configurable)
- Limit branching to 4 paths per node (configurable)
- Stop at terminal nodes (no outgoing calls)
process-processor.ts:337
Deduplication
Multiple similar traces are deduplicated:- Remove traces that are subsets of longer traces
- Keep only the longest path per entry→terminal pair
Process Properties
- processType:
intra_community(single cluster) orcross_community(spans multiple clusters) - stepCount: Number of steps in the trace
- communities: List of community IDs touched
- entryPointId / terminalId: First and last symbols in the chain
ProcessnodesSTEP_IN_PROCESSedges (symbol → process, withstepnumber)
Search Indexing
Purpose: Build full-text search indexes for fast keyword retrieval. Implementation: KuzuDB FTS indexes are created automatically during graph ingestion.- File FTS: Index file paths
- Function FTS: Index function names
- Class FTS: Index class names
- Method FTS: Index method names
- Interface FTS: Index interface names
Output
The final knowledge graph is written to KuzuDB at.gitnexus/db/ inside your repository.
Typical statistics for a 1K-file repo:
- Nodes: ~5,000 (files + symbols + communities + processes)
- Relationships: ~15,000 (calls + imports + memberships + steps)
- Communities: ~20-50 (functional areas)
- Processes: ~50-100 (execution flows)
Index freshness: Run
npx gitnexus analyze again to update the graph after code changes. Incremental indexing is on the roadmap.Performance Characteristics
| Repository Size | Files | Symbols | Index Time | Peak Memory |
|---|---|---|---|---|
| Small | Under 100 | Under 1K | 5-10s | ~200MB |
| Medium | 1K | ~10K | 30-60s | ~500MB |
| Large | 5K | ~50K | 3-5 min | ~1GB |
| Very Large | 20K+ | 200K+ | 10-20 min | ~2GB |
Worker parallelism: Parsing uses worker threads to parallelize across CPU cores. Set
NODE_ENV=development to see detailed phase logs.Next Steps
Knowledge Graph
Understand the graph schema and node types
Processes & Flows
Learn how execution flows are traced