AI-Powered Documentation Engine

WhatDoc’s AI engine transforms raw source code into production-grade documentation using advanced code analysis and large language models. The system is optimized for speed, accuracy, and token efficiency.

How It Works

The generation pipeline consists of three core stages:

1. Repository Cloning (Shallow Clone)

WhatDoc uses shallow cloning to minimize disk usage and maximize speed:

const git = simpleGit();
await git.clone(repoUrl, tempPath, ['--depth', '1']);

Only the latest commit is cloned (no history), which dramatically reduces clone time for large repositories.

2. Code Ingestion & Token Optimization

The engine walks through your repository and intelligently extracts source files while applying multiple optimization layers:

Fat-Trimmer Blacklist

WhatDoc automatically filters out files that waste tokens without providing documentation value:

const BLOCKED_FILENAMES = new Set([
    'package-lock.json', 'yarn.lock', 'pnpm-lock.yaml',
    'Cargo.lock', 'Gemfile.lock', 'composer.lock',
    '.DS_Store', 'Thumbs.db',
    // Existing docs — we generate from source, not READMEs
    'README.md', 'CHANGELOG.md', 'LICENSE'
]);

Test files (.test.js, .spec.ts) and minified bundles (.min.js, .bundle.js) are also automatically excluded.

Regex Guillotine: Code Minification

Before sending code to the LLM, WhatDoc strips noise that doesn’t contribute to understanding:

function minifyFileContent(raw) {
    let s = raw;
    
    // 1. Strip block comments  /* ... */
    s = s.replace(/\/\*[\s\S]*?\*\//g, '');
    
    // 2. Collapse runs of 3+ consecutive single-line comments
    s = s.replace(/(?:^[ \t]*\/\/.*\n){3,}/gm, '// [comments collapsed]\n');
    
    // 3. Truncate base64 / data-URI blobs
    s = s.replace(/data:[\w+/.-]+;base64,[A-Za-z0-9+/=]{100,}/g, '[BASE64 DATA TRUNCATED]');
    
    // 4. Truncate long string literals (>500 chars)
    s = s.replace(/(["\`'])(?:[^\\]|\\.){500,}?\1/g, '$1[LONG STRING TRUNCATED]$1');
    
    // 5. Collapse 3+ consecutive blank lines → 1
    s = s.replace(/(\n\s*){3,}/g, '\n\n');
    
    return s;
}

This aggressive minification can reduce token usage by 30-50% without losing semantic information.

Context Window Management

WhatDoc concatenates all whitelisted files into a single payload with clear file boundaries:

--- FILE: server/services/engine.js ---

const { EventEmitter } = require('events');
// ... actual code ...

--- FILE: client/src/App.jsx ---

import React from 'react';
// ... actual code ...

Free-tier limit: 800,000 characters (~200k tokens)
Pro limit: 900,000 characters (~225k tokens)If a repository exceeds the limit, the engine truncates at the nearest file boundary to avoid mid-file garbage.

3. LLM Generation with Paradigm-Aware Prompting

The concatenated codebase is sent to Google Gemini 2.5 Flash with a highly specialized system prompt that:

Detects the repository paradigm (REST API, Frontend App, CLI Tool, SDK/Library)
Adapts documentation style based on the detected type
Generates two documents: a README and a TECHNICAL_REFERENCE
Enforces strict markdown quality rules (syntax highlighting, proper heading hierarchy, GitHub-flavored alerts)

Adaptive Documentation Strategy

The AI automatically adjusts its output based on what it finds:

For Backend/API Projects

Focuses on:

Endpoint documentation (HTTP method, path, auth)
Database models and schemas
Authentication flows
Request/response examples with real schemas
Interactive API playground blocks (see API Playground)

For Frontend Projects

Focuses on:

Component architecture
State management patterns
Routing structure
Props/hooks documentation
UI component trees

For Libraries/SDKs

Focuses on:

Exported functions and classes
Method signatures
Usage examples
Installation instructions

Retry Logic & Rate Limiting

WhatDoc includes exponential backoff for rate-limited requests:

const MAX_RETRIES = 3;
const INITIAL_BACKOFF_MS = 15_000; // 15 seconds

for (let attempt = 0; attempt <= MAX_RETRIES; attempt++) {
    try {
        const result = await model.generateContent({ contents, generationConfig });
        return result.response.text();
    } catch (err) {
        const is429 = err?.status === 429 || 
                      err?.message?.toLowerCase().includes('rate limit');
        
        if (is429 && attempt < MAX_RETRIES) {
            const backoff = INITIAL_BACKOFF_MS * Math.pow(2, attempt);
            await sleep(backoff);
            continue;
        }
        throw err;
    }
}

Bring Your Own Key (BYOK)

Pro users can connect their own Google Gemini API key to bypass rate limits and access higher-tier models:

const isCustomKeyValid = customKey && 
                         customKey !== 'null' && 
                         customKey.trim().length > 30;

const apiKeyToUse = isCustomKeyValid ? customKey.trim() : getNextApiKey();

BYOK users automatically bypass the free-tier token guillotine and get access to the full 900k character limit.

Supported Languages

WhatDoc can analyze and document projects in:

JavaScript, TypeScript, JSX, TSX
Python
Java, Kotlin, Scala
C, C++, C#
Go, Rust, Ruby, PHP, Swift
Configuration files (JSON, YAML, Dockerfile, Makefile)

Real-Time Progress Streaming

The engine emits real-time events during generation:

emit(projectId, 'log', { 
  step: 'analyzing', 
  message: `Aggregation complete — concatenated ${fileCount} files into context window.` 
});

emit(projectId, 'status', { 
  status: 'generating', 
  message: 'AI is writing the documentation…',
  ts: Date.now() 
});

These events power the live generation terminal in the UI.

Performance Benchmarks

Repository Size	Files Analyzed	Generation Time	Token Count
Small (< 50 files)	42	12s	~45k tokens
Medium (50-200 files)	156	28s	~120k tokens
Large (200+ files)	287	45s	~200k tokens

Times are averages using Gemini 2.5 Flash. Pro models (Gemini 2.5 Pro) may take longer but produce higher-quality output.

Get Started

Core Features

Guides

Templates

Plans & Pricing

AI-Powered Documentation Engine

How It Works

1. Repository Cloning (Shallow Clone)

2. Code Ingestion & Token Optimization

Fat-Trimmer Blacklist

Regex Guillotine: Code Minification

Context Window Management

3. LLM Generation with Paradigm-Aware Prompting

Adaptive Documentation Strategy

Retry Logic & Rate Limiting

Bring Your Own Key (BYOK)

Supported Languages

Real-Time Progress Streaming

Performance Benchmarks

Next Steps

Templates

Live Editor

Build docs developers (and LLMs) love

Get Started

Core Features

Guides

Templates

Plans & Pricing

​How It Works

​1. Repository Cloning (Shallow Clone)

​2. Code Ingestion & Token Optimization

​Fat-Trimmer Blacklist

​Regex Guillotine: Code Minification

​Context Window Management

​3. LLM Generation with Paradigm-Aware Prompting

​Adaptive Documentation Strategy

​Retry Logic & Rate Limiting

​Bring Your Own Key (BYOK)

​Supported Languages

​Real-Time Progress Streaming

​Performance Benchmarks

​Next Steps

Templates

Live Editor

Build docs developers (and LLMs) love

How It Works

1. Repository Cloning (Shallow Clone)

2. Code Ingestion & Token Optimization

Fat-Trimmer Blacklist

Regex Guillotine: Code Minification

Context Window Management

3. LLM Generation with Paradigm-Aware Prompting

Adaptive Documentation Strategy

Retry Logic & Rate Limiting

Bring Your Own Key (BYOK)

Supported Languages

Real-Time Progress Streaming

Performance Benchmarks

Next Steps