Overview
The Voice Narration feature converts slide text into high-quality audio using Sarvam AI’s TTS (Text-to-Speech) API. It supports 11 Indian languages and generates professional narration that synchronizes perfectly with your video slides.How It Works
Audio Generation Per Slide
For each slide, the system:- Extracts Narration Text: Takes the
content_textfrom the slide structure - Selects Voice: Chooses an appropriate speaker based on the target language
- Generates Audio: Calls Sarvam AI TTS API to create WAV audio
- Saves Individual Files: Stores audio as
<topic>_slide_<number>.wav - Combines All Audio: Concatenates individual clips into a single track for final video
The system generates audio per slide first, then combines them. This approach allows for easier debugging and potential future features like slide-level audio editing.
Supported Languages
Sarvam AI supports 11 languages with native Indian accents:| Language | Code | Default Speaker |
|---|---|---|
| English | en-IN | Anushka |
| Hindi | hi-IN | Anushka |
| Kannada | kn-IN | Anushka |
| Telugu | te-IN | Anushka |
| Tamil | ta-IN | Anushka |
| Bengali | bn-IN | Anushka |
| Gujarati | gu-IN | Anushka |
| Malayalam | ml-IN | Anushka |
| Marathi | mr-IN | Anushka |
| Odia | or-IN | Anushka |
| Punjabi | pa-IN | Anushka |
Language Code Mapping
The system automatically maps language names to Sarvam AI codes:Audio Configuration
API Request Parameters
Each TTS request includes fine-tuned parameters for optimal voice quality:Configuration Options
You can customize voice parameters inconfig.py:
Text Chunking
Character Limit Handling
Sarvam AI’s TTS API has a 500-character limit per request. For longer narrations, the system automatically chunks text:Audio Format
Output Specifications
- Format: WAV (Waveform Audio File Format)
- Sample Rate: 22,050 Hz (22 kHz)
- Channels: Mono
- Encoding: Base64 (from API) → Binary (saved to disk)
- Codec: PCM signed 16-bit little-endian (when combined)
File Naming
Generated audio files follow this pattern:Audio Combination
After generating individual slide audio, the system combines them:The combined audio duration must match the total video duration for proper synchronization. Any mismatch triggers a warning during video composition.
Usage Example
Basic Usage
Multi-Language Support
Error Handling
The voice generator includes robust error handling:- Invalid API Key: Check your
SARVAM_API_KEYenvironment variable - Empty Response: Verify the text input is not empty or too short
- Rate Limiting: Sarvam AI may throttle requests; add delays between chunks if needed
Best Practices
- Keep Slide Text Concise: Aim for 2-4 sentences per slide (under 500 characters) to avoid chunking
- Use Natural Language: Write conversationally - the TTS engine handles punctuation and pauses naturally
- Test Different Speakers: If available, try different speakers to find the best fit for your content
- Monitor Audio Duration: Ensure generated audio matches expected slide duration
- Handle Special Characters: Remove or spell out symbols that TTS might mispronounce
Troubleshooting
Audio Not Generating
Symptom: API request fails or returns empty response Solutions:- Verify
SARVAM_API_KEYis set correctly - Check if text contains only supported characters
- Ensure text is not empty or just whitespace
Audio Duration Mismatch
Symptom: Warning during video composition about duration mismatch Solutions:- Check if all slides have audio generated
- Verify slide durations in
content_data.jsonare reasonable - Regenerate audio if any files are corrupted
Poor Voice Quality
Symptom: Audio sounds robotic or unnatural Solutions:- Increase
loudnessparameter (try 1.5-2.0) - Adjust
paceto slow down narration (try 0.9) - Ensure
enable_preprocessingis set toTrue
Related Features
- Content Generation - Creates the narration text that feeds into TTS
- Video Composition - Synchronizes audio with video timeline