Pipeline Overview
The video generation pipeline consists of 11 stages that transform a text prompt into a complete video presentation with synchronized audio and visuals.Pipeline Stages
Stage 1: Initialization (0-10%)
Duration: ~1 second Location:backend/app.py:223-237
Tasks:
- Receive POST request with topic, num_slides, language, tone
- Sanitize topic for file naming
- Create generation ID
- Initialize progress tracking
Stage 2: Content Generation (10-20%)
Duration: 10-30 seconds Location:backend/app.py:240-243
Generator: ContentGenerator (content_generator.py)
Process:
Build Prompt
Create detailed prompt with requirements for slide structure, mutual exclusivity rules, and content guidelines
Validate Structure
- Check all required fields present
- Enforce mutual exclusivity (animation XOR image)
- Add missing fields with defaults
Stage 3: Script Generation (20-30%)
Duration: 10-20 seconds Location:backend/app.py:246-249
Generator: ScriptGenerator (script_generator.py)
Process:
Special Cases:
Animation Slides
Animation Slides
Narration includes visual descriptions:
- “As you can see on screen…”
- “Watch as the rocket…”
- “Notice how the force vectors…”
Image Slides
Image Slides
Natural image references:
- “Looking at this image…”
- “This diagram shows…”
Text Slides
Text Slides
Pure conceptual explanation without visual references
Stage 4: Audio Generation (30-48%)
Duration: 30-60 seconds (depends on slide count) Location:backend/app.py:252-303
Generator: VoiceGenerator (voice_generator.py)
Process:
API Call (voice_generator.py):
- Individual files:
outputs/audio/{topic}_slide_1.mp3, etc. - Durations stored in memory for timestamp correction
Stage 4.5: Audio Combining (48-49%)
Duration: 2-5 seconds Location:backend/app.py:301-303
Task: Concatenate all slide audio files into single track
Stage 5: Visual Generation (50-80%)
Duration: 1-3 minutes (varies by visual complexity) Location:backend/app.py:306-433
Generators: ManimGenerator, ImageFetcher, SlideRenderer
Process Loop:
Stage 6: Video Composition (85-95%)
Duration: 30-90 seconds Location:backend/app.py:441-451
Composer: VideoComposer (video_composer.py)
Process:
Timing Validation (video_composer.py:289-290):
Stage 7: Completion (100%)
Duration: Instant Location:backend/app.py:453-470
Tasks:
Frontend Transition:
Error Handling & Fallbacks
Content Generation Errors
- Gemini API timeout → Retry with exponential backoff
- Invalid JSON response → Clean and re-parse
- Missing fields → Add defaults and continue
Audio Generation Errors
Image Fetch Errors
Animation Generation Errors
- Syntax error in generated code → Show error, fallback to text
- Rendering timeout → Skip animation, use text slide
- Missing dependencies → Warning in logs, text fallback
Progress Tracking
Progress Percentages
| Stage | Start % | End % | Duration | Status ID |
|---|---|---|---|---|
| Initialization | 0 | 10 | 1s | started |
| Content Generation | 10 | 20 | 10-30s | generating_content |
| Script Generation | 20 | 30 | 10-20s | generating_scripts |
| Audio Generation | 30 | 48 | 30-60s | generating_audio |
| Audio Combining | 48 | 49 | 2-5s | combining_audio |
| Visual Generation | 50 | 80 | 60-180s | generating_media |
| - Animation | 50-80 | (portion) | varies | generating_animation |
| - Images | 50-80 | (portion) | varies | fetching_image |
| - Text Slides | 50-80 | (portion) | varies | generating_slide |
| Video Composition | 85 | 95 | 30-90s | composing_video |
| Completion | 95 | 100 | instant | completed |
Real-time Updates
Backend (app.py:52-61):Performance Characteristics
Total Time by Slide Count
Times are approximate and vary based on:
- Gemini API response time
- Number of animations (slowest step)
- Audio length
- System performance
| Slides | Text-Only | With Images | With Animations | Total |
|---|---|---|---|---|
| 3 | 1.5 min | 2 min | 3.5 min | ~2-3 min |
| 5 | 2 min | 2.5 min | 4.5 min | ~3-5 min |
| 10 | 3 min | 4 min | 7 min | ~5-8 min |
Bottlenecks
-
Manim Rendering (30-60s per animation)
- Solution: Limit animations to 1-2 per presentation
- Alternative: Pre-render common animations
-
Gemini API Calls (10-30s per call)
- Solution: Use streaming responses (future)
- Cache: Store common content patterns
-
Video Composition (30-90s)
- Depends on: Total video length, number of clips
- Optimization: Use GPU acceleration if available
Next Steps
API Reference
Explore detailed API documentation
Troubleshooting
Common issues and solutions
Backend Architecture
Understand the backend structure
Frontend Architecture
Understand the frontend structure