Introduction
DocuGen AI uses a three-phase pipeline to transform Python source code into comprehensive documentation. Each phase has a specific responsibility, ensuring separation of concerns and maintainability.Phase 1: Ingestion & Scanning
Purpose: Discover all relevant Python files in a project while respecting.gitignore rules.
Implementation
The scanning phase is implemented indocugen/core/scanner.py:94 with the scan_python_files() function:
The scanner automatically excludes common directories like
__pycache__, .git, .venv, build, and dist to avoid processing unnecessary files.Key Features
- GitIgnore Support: Parses
.gitignorefiles and respects negation patterns, directory-only rules, and anchored patterns - Smart Filtering: Built-in exclusions for common development directories (defined in
scanner.py:8) - Single File or Directory: Handles both individual Python files and entire project directories
- Relative Path Tracking: Maintains relative paths for cleaner documentation references
GitIgnore Rule Parsing
The scanner implements a sophisticated gitignore parser (scanner.py:28-61) that handles:
- Negation patterns (
!important.py) - Directory-only rules (
build/) - Anchored patterns (
/dist) - Glob patterns (
*.pyc,__pycache__/*)
Phase 2: Parsing & Normalization
Purpose: Extract structured metadata from Python source code using Abstract Syntax Trees (AST) and normalize it for AI consumption.2.1 AST Parsing
Implemented indocugen/core/parser.py:81, the parse_file() function extracts:
- Classes with base classes and docstrings
- Methods within classes (including async methods)
- Functions at module level
- Type annotations for arguments and return values
- Default values for function parameters
- Code metrics (line count, class count, function count)
The parser uses Python’s built-in
ast module to ensure accurate parsing. Syntax errors are caught and recorded in the errors field rather than causing the entire process to fail.2.2 Normalization
The processor (docugen/core/processor.py:57) normalizes raw AST data into a clean, consistent format:
- Converting all values to clean strings (
_as_clean_text()atprocessor.py:6) - Standardizing function signatures with argument kinds (positional, keyword-only, variadic)
- Aggregating project-level statistics
- Filtering out empty or null values
Phase 3: Synthesis & Rendering
Purpose: Generate human-readable documentation using AI and template-based rendering.3.1 AI Synthesis
TheGeminiClient (docugen/api/gemini_client.py:23) sends normalized metadata to Google’s Gemini API:
- JSON-formatted project metadata
- A system prompt defining the Technical Writer role
- Optional user-provided instructions
3.2 Template Rendering
The final step uses Jinja2 templates (docugen/templates/engine.py:11) to combine:
- AI-generated content
- Project metadata (file counts, class counts)
- Custom branding and structure
The template engine supports custom templates. By default, it uses
default_readme.md.j2 but you can provide your own template directory.Data Flow Summary
| Phase | Input | Output | Key Module |
|---|---|---|---|
| Ingestion | Project directory or file path | List of .py file paths | scanner.py |
| Parsing | Python source files | AST metadata (classes, functions, signatures) | parser.py |
| Normalization | Raw AST metadata | Clean, structured JSON | processor.py |
| AI Synthesis | Normalized metadata | AI-generated Markdown content | gemini_client.py |
| Rendering | AI content + metadata | Final documentation file | engine.py |
Error Handling
Each phase includes robust error handling:- Scanner: Checks for path existence, handles permission errors
- Parser: Catches syntax errors and continues processing other files
- Processor: Filters invalid data and tracks error counts
- Gemini Client: Wraps API errors with context
- Template Engine: Validates template existence
Next Steps
AST Parsing Deep Dive
Learn how metadata is extracted from Python code
AI Generation
Understand how Gemini transforms metadata into docs
