Overview
The VoiceGenerator class converts narration scripts into audio files using Sarvam AI’s multilingual TTS API. It supports 11+ Indian languages and generates high-quality voice narration for presentations.
Class Definition
from generators.voice_generator import VoiceGenerator
voice_gen = VoiceGenerator()
Constructor
Initializes the voice generator with Sarvam AI API configuration.
Configuration:
- API Key:
Config.SARVAM_API_KEY
- API URL:
Config.SARVAM_TTS_URL
- Model:
Config.SARVAM_MODEL
- Sample Rate: 22050 Hz
Methods
generate_voice_for_slide
Generates voice audio for a single slide.
def generate_voice_for_slide(narration_text: str, slide_number: int,
topic: str, language: str = "english") -> str
The narration text to convert to speech (max 500 characters)
The slide number (used for filename)
Presentation topic (used for filename)
Language for voice synthesis. Supported languages:
english (en-IN)
hindi (hi-IN)
kannada (kn-IN)
telugu (te-IN)
tamil (ta-IN)
bengali (bn-IN)
gujarati (gu-IN)
malayalam (ml-IN)
marathi (mr-IN)
odia (or-IN)
punjabi (pa-IN)
Absolute path to the generated audio file (WAV format)
Returns example:
"/path/to/audio/Newtons_Laws_of_Motion_slide_1.wav"
generate_complete_audio
Generates complete audio for all slides combined into one file.
def generate_complete_audio(script_data: Dict, language: str = "english") -> str
Complete script data from ScriptGenerator containing all slide narrations
Language for voice synthesis
Path to the combined complete audio file
Process:
- Combines all narration texts
- Splits into chunks (max 500 chars each, respecting sentence boundaries)
- Generates audio for each chunk
- Concatenates all audio chunks
- Saves as single WAV file
combine_slide_audios
Combines individual slide audio files into one complete audio track.
def combine_slide_audios(slide_audio_paths: dict, topic: str) -> str
Dictionary mapping slide numbers to their audio file paths:{1: "/path/to/slide_1.wav", 2: "/path/to/slide_2.wav"}
Presentation topic for output filename
Path to the combined audio file
Uses MoviePy to concatenate audio clips in order.
Speaker Configuration
From config.py, speaker voices are mapped per language:
SARVAM_SPEAKER_MAP = {
"english": "anushka",
"hindi": "aarav",
"kannada": "meera",
# ... other languages
}
API Request Parameters
From backend/generators/voice_generator.py:26-36:
payload = {
"inputs": [narration_text[:500]], # Text to synthesize
"target_language_code": self._get_language_code(language),
"speaker": speaker, # Voice model
"pitch": 0, # Normal pitch
"pace": 1.0, # Normal speed
"loudness": 1.5, # Slightly enhanced volume
"speech_sample_rate": 22050, # CD quality
"enable_preprocessing": True, # Text normalization
"model": Config.SARVAM_MODEL
}
Usage Example
From backend/app.py:251-303:
# Step 3: Generate voice audio PER SLIDE and get actual durations
update_progress(generation_id, 30, "generating_audio",
"🎤 Generating voice narration per slide...")
voice_gen = VoiceGenerator()
slide_audio_paths = {}
actual_durations = {}
total_slides = len(script_data['slide_scripts'])
# Generate audio for each slide separately
for idx, slide_script in enumerate(script_data['slide_scripts'], 1):
slide_num = slide_script['slide_number']
audio_progress = 30 + int((idx / total_slides) * 15)
update_progress(generation_id, audio_progress, "generating_audio",
f"🎤 Generating audio for slide {idx}/{total_slides}...")
try:
audio_path = voice_gen.generate_voice_for_slide(
slide_script['narration_text'],
slide_num,
topic,
request.language
)
slide_audio_paths[slide_num] = audio_path
# Get actual duration from generated audio
from moviepy import AudioFileClip
audio_clip = AudioFileClip(audio_path)
actual_durations[slide_num] = audio_clip.duration
audio_clip.close()
except Exception as e:
print(f"Error generating audio for slide {slide_num}: {e}")
actual_durations[slide_num] = slide_script['end_time'] - slide_script['start_time']
# Combine all slide audios into one file
update_progress(generation_id, 48, "combining_audio", "🎵 Combining audio tracks...")
audio_path = voice_gen.combine_slide_audios(slide_audio_paths, topic)
Text Chunking Strategy
For long narrations, text is split intelligently:
def _split_text_into_chunks(self, text: str, max_length: int = 500) -> list:
"""Split text into chunks respecting sentence boundaries"""
if len(text) <= max_length:
return [text]
chunks = []
sentences = text.replace('!', '.').replace('?', '.').split('.')
current_chunk = ""
for sentence in sentences:
sentence = sentence.strip()
if not sentence:
continue
# If adding this sentence would exceed limit, save current chunk
if len(current_chunk) + len(sentence) + 2 > max_length:
if current_chunk:
chunks.append(current_chunk.strip())
current_chunk = sentence + ". "
else:
current_chunk += sentence + ". "
# Add remaining text
if current_chunk.strip():
chunks.append(current_chunk.strip())
return chunks
Language Code Mapping
From backend/generators/voice_generator.py:140-155:
def _get_language_code(self, language: str) -> str:
"""Map language name to Sarvam AI language code"""
language_map = {
"english": "en-IN",
"hindi": "hi-IN",
"kannada": "kn-IN",
"telugu": "te-IN",
"tamil": "ta-IN",
"bengali": "bn-IN",
"gujarati": "gu-IN",
"malayalam": "ml-IN",
"marathi": "mr-IN",
"odia": "or-IN",
"punjabi": "pa-IN"
}
return language_map.get(language.lower(), "en-IN")
Audio Response Handling
Sarvam AI returns base64-encoded audio:
response = requests.post(self.api_url, headers=headers, json=payload)
response.raise_for_status()
result = response.json()
if "audios" in result and len(result["audios"]) > 0:
# Decode base64 audio
import base64
audio_data = base64.b64decode(result["audios"][0])
# Save as WAV file
audio_filename = f"{topic_name}_slide_{slide_number}.wav"
audio_path = Config.AUDIO_DIR / audio_filename
with open(audio_path, 'wb') as f:
f.write(audio_data)
return str(audio_path)
Error Handling
try:
response = requests.post(self.api_url, headers=headers, json=payload)
if response.status_code != 200:
print(f"Sarvam API Error Response: {response.text}")
print(f"Request payload: {json.dumps(payload, indent=2)}")
response.raise_for_status()
except Exception as e:
print(f"Sarvam AI TTS Error: {e}")
raise
File Output
Generated audio files are saved to:
Config.AUDIO_DIR / "{topic_sanitized}_slide_{slide_number}.wav"
Config.AUDIO_DIR / "{topic_sanitized}_complete.wav" # Combined audio
Format specifications:
- Codec: PCM signed 16-bit little-endian
- Sample Rate: 22050 Hz
- Channels: Mono
- Format: WAV
- ScriptGenerator - Provides narration text input
- VideoComposer - Uses generated audio for final video
- Configuration - API keys and endpoints in
config.py