Skip to main content
WhatDoc’s AI engine transforms raw source code into production-grade documentation using advanced code analysis and large language models. The system is optimized for speed, accuracy, and token efficiency.

How It Works

The generation pipeline consists of three core stages:

1. Repository Cloning (Shallow Clone)

WhatDoc uses shallow cloning to minimize disk usage and maximize speed:
const git = simpleGit();
await git.clone(repoUrl, tempPath, ['--depth', '1']);
Only the latest commit is cloned (no history), which dramatically reduces clone time for large repositories.

2. Code Ingestion & Token Optimization

The engine walks through your repository and intelligently extracts source files while applying multiple optimization layers:

Fat-Trimmer Blacklist

WhatDoc automatically filters out files that waste tokens without providing documentation value:
const BLOCKED_FILENAMES = new Set([
    'package-lock.json', 'yarn.lock', 'pnpm-lock.yaml',
    'Cargo.lock', 'Gemfile.lock', 'composer.lock',
    '.DS_Store', 'Thumbs.db',
    // Existing docs — we generate from source, not READMEs
    'README.md', 'CHANGELOG.md', 'LICENSE'
]);
Test files (.test.js, .spec.ts) and minified bundles (.min.js, .bundle.js) are also automatically excluded.

Regex Guillotine: Code Minification

Before sending code to the LLM, WhatDoc strips noise that doesn’t contribute to understanding:
function minifyFileContent(raw) {
    let s = raw;
    
    // 1. Strip block comments  /* ... */
    s = s.replace(/\/\*[\s\S]*?\*\//g, '');
    
    // 2. Collapse runs of 3+ consecutive single-line comments
    s = s.replace(/(?:^[ \t]*\/\/.*\n){3,}/gm, '// [comments collapsed]\n');
    
    // 3. Truncate base64 / data-URI blobs
    s = s.replace(/data:[\w+/.-]+;base64,[A-Za-z0-9+/=]{100,}/g, '[BASE64 DATA TRUNCATED]');
    
    // 4. Truncate long string literals (>500 chars)
    s = s.replace(/(["\`'])(?:[^\\]|\\.){500,}?\1/g, '$1[LONG STRING TRUNCATED]$1');
    
    // 5. Collapse 3+ consecutive blank lines → 1
    s = s.replace(/(\n\s*){3,}/g, '\n\n');
    
    return s;
}
This aggressive minification can reduce token usage by 30-50% without losing semantic information.

Context Window Management

WhatDoc concatenates all whitelisted files into a single payload with clear file boundaries:
--- FILE: server/services/engine.js ---

const { EventEmitter } = require('events');
// ... actual code ...

--- FILE: client/src/App.jsx ---

import React from 'react';
// ... actual code ...
Free-tier limit: 800,000 characters (~200k tokens)
Pro limit: 900,000 characters (~225k tokens)
If a repository exceeds the limit, the engine truncates at the nearest file boundary to avoid mid-file garbage.

3. LLM Generation with Paradigm-Aware Prompting

The concatenated codebase is sent to Google Gemini 2.5 Flash with a highly specialized system prompt that:
  • Detects the repository paradigm (REST API, Frontend App, CLI Tool, SDK/Library)
  • Adapts documentation style based on the detected type
  • Generates two documents: a README and a TECHNICAL_REFERENCE
  • Enforces strict markdown quality rules (syntax highlighting, proper heading hierarchy, GitHub-flavored alerts)

Adaptive Documentation Strategy

The AI automatically adjusts its output based on what it finds:
Focuses on:
  • Endpoint documentation (HTTP method, path, auth)
  • Database models and schemas
  • Authentication flows
  • Request/response examples with real schemas
  • Interactive API playground blocks (see API Playground)
Focuses on:
  • Component architecture
  • State management patterns
  • Routing structure
  • Props/hooks documentation
  • UI component trees
Focuses on:
  • Exported functions and classes
  • Method signatures
  • Usage examples
  • Installation instructions

Retry Logic & Rate Limiting

WhatDoc includes exponential backoff for rate-limited requests:
const MAX_RETRIES = 3;
const INITIAL_BACKOFF_MS = 15_000; // 15 seconds

for (let attempt = 0; attempt <= MAX_RETRIES; attempt++) {
    try {
        const result = await model.generateContent({ contents, generationConfig });
        return result.response.text();
    } catch (err) {
        const is429 = err?.status === 429 || 
                      err?.message?.toLowerCase().includes('rate limit');
        
        if (is429 && attempt < MAX_RETRIES) {
            const backoff = INITIAL_BACKOFF_MS * Math.pow(2, attempt);
            await sleep(backoff);
            continue;
        }
        throw err;
    }
}

Bring Your Own Key (BYOK)

Pro users can connect their own Google Gemini API key to bypass rate limits and access higher-tier models:
const isCustomKeyValid = customKey && 
                         customKey !== 'null' && 
                         customKey.trim().length > 30;

const apiKeyToUse = isCustomKeyValid ? customKey.trim() : getNextApiKey();
BYOK users automatically bypass the free-tier token guillotine and get access to the full 900k character limit.

Supported Languages

WhatDoc can analyze and document projects in:
  • JavaScript, TypeScript, JSX, TSX
  • Python
  • Java, Kotlin, Scala
  • C, C++, C#
  • Go, Rust, Ruby, PHP, Swift
  • Configuration files (JSON, YAML, Dockerfile, Makefile)

Real-Time Progress Streaming

The engine emits real-time events during generation:
emit(projectId, 'log', { 
  step: 'analyzing', 
  message: `Aggregation complete — concatenated ${fileCount} files into context window.` 
});

emit(projectId, 'status', { 
  status: 'generating', 
  message: 'AI is writing the documentation…',
  ts: Date.now() 
});
These events power the live generation terminal in the UI.

Performance Benchmarks

Repository SizeFiles AnalyzedGeneration TimeToken Count
Small (< 50 files)4212s~45k tokens
Medium (50-200 files)15628s~120k tokens
Large (200+ files)28745s~200k tokens
Times are averages using Gemini 2.5 Flash. Pro models (Gemini 2.5 Pro) may take longer but produce higher-quality output.

Next Steps

Templates

Explore 14+ professional documentation templates

Live Editor

Edit generated docs with the rich markdown editor

Build docs developers (and LLMs) love