Overview
The AI Video Presentation Generator supports 11+ languages for voice narration through Sarvam AI’s text-to-speech API. Each language has carefully selected voice speakers and configurable parameters for tone, pace, and style.Supported Languages
The system supports the following languages with dedicated voice speakers:Primary Languages
English
Language Code:
Speaker: anushka
Voice Type: Indian English Female
Use Case: International presentations, technical content
en-INSpeaker: anushka
Voice Type: Indian English Female
Use Case: International presentations, technical content
Hindi
Language Code:
Speaker: manisha
Voice Type: Hindi Female
Use Case: North Indian audience, cultural content
hi-INSpeaker: manisha
Voice Type: Hindi Female
Use Case: North Indian audience, cultural content
Kannada
Language Code:
Speaker: vidya
Voice Type: Kannada Female
Use Case: Karnataka region, local content
kn-INSpeaker: vidya
Voice Type: Kannada Female
Use Case: Karnataka region, local content
Telugu
Language Code:
Speaker: arya
Voice Type: Telugu Female
Use Case: Andhra Pradesh, Telangana regions
te-INSpeaker: arya
Voice Type: Telugu Female
Use Case: Andhra Pradesh, Telangana regions
Additional Supported Languages
The system also supports these languages (configured invoice_generator.py:142-155):
| Language | Code | Region | Status |
|---|---|---|---|
| Tamil | ta-IN | Tamil Nadu | Supported |
| Bengali | bn-IN | West Bengal, Bangladesh | Supported |
| Gujarati | gu-IN | Gujarat | Supported |
| Malayalam | ml-IN | Kerala | Supported |
| Marathi | mr-IN | Maharashtra | Supported |
| Odia | or-IN | Odisha | Supported |
| Punjabi | pa-IN | Punjab | Supported |
All language codes follow the ISO 639-1 standard with
-IN suffix for Indian regional variants.Language Configuration
Setting Language Parameter
Specify the language when making a presentation generation request:Language-Speaker Mapping
The system automatically maps language names to appropriate voice speakers (fromconfig.py:40-45):
Tone Options
Customize the narration style with different tone parameters:Available Tones
- Formal
- Casual
- Enthusiastic
Use Case: Professional presentations, academic content, business meetingsCharacteristics:
- Professional and authoritative
- Clear pronunciation
- Structured delivery
- Appropriate for corporate settings
- Technical documentation
- Research presentations
- Product launches
- Investor pitches
Setting Tone
Include thetone parameter in your request:
Voice Customization
TTS Parameters
The voice generator uses these parameters for audio synthesis (fromvoice_generator.py:26-36):
Voice pitch adjustment in semitones.
- Negative values: Lower pitch (deeper voice)
- Zero: Natural pitch
- Positive values: Higher pitch
-12 to +12Speech rate multiplier.
0.5: Half speed (slower)1.0: Normal speed2.0: Double speed (faster)
0.5 to 2.0Audio volume multiplier.
1.0: Standard volume1.5: Increased volume (default)2.0: Maximum volume
0.5 to 2.0Audio sample rate in Hz. Higher values = better quality but larger files.
16000: Lower quality, smaller files22050: Balanced (default)44100: High quality, larger files
Example Custom Configuration
To customize voice parameters, modify the payload invoice_generator.py:26-36:
Multi-Language Support Details
How Language Processing Works
Content Generation
Gemini AI generates presentation content in the requested language.The
ScriptGenerator creates narration scripts appropriate for the language and tone.Script Formatting
Scripts are formatted with timestamps and segmented by slide.Each slide gets a dedicated narration text aligned with its content.
Language Mapping
The system maps the language name to the appropriate:
- Language code (e.g.,
en-IN,hi-IN) - Voice speaker (e.g.,
anushka,manisha) - Regional variations if applicable
Audio Generation
Sarvam AI’s TTS engine synthesizes audio for each slide.Audio files are saved to
backend/outputs/audio/ as WAV files.Per-Slide Audio Generation
The system generates audio per slide for better timing control (fromapp.py:258-286):
- Accurate slide timing
- Better audio quality
- Easier debugging
- Flexible editing
Text Length Limitations
Sarvam AI TTS has a 500 character limit per request. The system automatically handles this:voice_generator.py:157-183).
Language Selection Best Practices
Choose Based on Audience
Choose Based on Audience
Consider your target audience:
- International: Use English
- Regional: Use local language (Hindi, Kannada, etc.)
- Multilingual: Generate multiple versions
Match Content Type to Tone
Match Content Type to Tone
Tone selection guidelines:
- Technical/Professional: Use
formaltone - Educational/Tutorial: Use
casualtone - Marketing/Sales: Use
enthusiastictone
Test Different Languages
Test Different Languages
Quality assurance:
- Generate test videos in target languages
- Verify pronunciation and clarity
- Check cultural appropriateness
- Gather feedback from native speakers
Consider Regional Variations
Consider Regional Variations
Regional considerations:
- Indian English has distinct pronunciation
- Regional languages may have dialects
- Cultural context affects word choice
- Formal vs. casual varies by culture
Code Examples
Backend: Language Processing
Fromvoice_generator.py:140-155, the language code mapping:
Frontend: Language Selection UI
Example React component for language selection:Troubleshooting
Language Not Supported Error
Language Not Supported Error
Problem: Error message about unsupported languageSolutions:
- Check spelling of language name (must be lowercase)
- Verify language is in supported list
- Use English name, not native name (“hindi” not “हिन्दी”)
- Check if Sarvam AI supports the language
Audio Quality Issues
Audio Quality Issues
Problem: Poor audio quality or robotic voiceSolutions:
- Increase
speech_sample_rateto 44100 - Adjust
paceparameter (try 0.9 for better clarity) - Check
loudnesssetting (may cause distortion if too high) - Verify API key has access to latest models
Wrong Voice Gender/Accent
Wrong Voice Gender/Accent
Problem: Voice doesn’t match expected speakerSolutions:
- Check speaker mapping in
config.py - Verify language is correctly specified
- Consult Sarvam AI docs for available speakers
- Update speaker map if needed
Text Cut Off (500 Character Limit)
Text Cut Off (500 Character Limit)
Problem: Narration text is truncatedSolutions:
- System automatically limits to 500 chars per slide
- Use
generate_complete_audio()for long texts (auto-chunks) - Reduce narration length in script generation
- Split slides for better pacing
Performance Considerations
Audio Generation Time
Expected generation times per slide:| Language | Avg. Time | Notes |
|---|---|---|
| English | 2-3 seconds | Fastest, most optimized |
| Hindi | 3-4 seconds | Good performance |
| Regional | 4-5 seconds | Slightly slower due to processing |
Optimization Tips
Parallel Processing
Audio generation runs sequentially by default.Consider parallel processing for faster generation with multiple slides.
Caching
Cache generated audio for repeated presentations.Reuse audio files if topic/script haven’t changed.
Batch Requests
Generate multiple slides in batches.Reduces API overhead and improves throughput.
Audio Compression
Use compressed formats (MP3) instead of WAV.Reduces storage and bandwidth requirements.
Next Steps
API Keys Setup
Learn how to obtain your Sarvam AI API key
Environment Setup
Configure your complete development environment
Generation Process
Understand how the full generation pipeline works
API Reference
View complete API documentation for generation endpoint