Skip to main content

Overview

The VoiceGenerator class converts narration scripts into audio files using Sarvam AI’s multilingual TTS API. It supports 11+ Indian languages and generates high-quality voice narration for presentations.

Class Definition

from generators.voice_generator import VoiceGenerator

voice_gen = VoiceGenerator()

Constructor

def __init__(self)
Initializes the voice generator with Sarvam AI API configuration. Configuration:
  • API Key: Config.SARVAM_API_KEY
  • API URL: Config.SARVAM_TTS_URL
  • Model: Config.SARVAM_MODEL
  • Sample Rate: 22050 Hz

Methods

generate_voice_for_slide

Generates voice audio for a single slide.
def generate_voice_for_slide(narration_text: str, slide_number: int, 
                             topic: str, language: str = "english") -> str
narration_text
string
required
The narration text to convert to speech (max 500 characters)
slide_number
int
required
The slide number (used for filename)
topic
string
required
Presentation topic (used for filename)
language
string
default:"english"
Language for voice synthesis. Supported languages:
  • english (en-IN)
  • hindi (hi-IN)
  • kannada (kn-IN)
  • telugu (te-IN)
  • tamil (ta-IN)
  • bengali (bn-IN)
  • gujarati (gu-IN)
  • malayalam (ml-IN)
  • marathi (mr-IN)
  • odia (or-IN)
  • punjabi (pa-IN)
return
string
Absolute path to the generated audio file (WAV format)
Returns example:
"/path/to/audio/Newtons_Laws_of_Motion_slide_1.wav"

generate_complete_audio

Generates complete audio for all slides combined into one file.
def generate_complete_audio(script_data: Dict, language: str = "english") -> str
script_data
Dict
required
Complete script data from ScriptGenerator containing all slide narrations
language
string
default:"english"
Language for voice synthesis
return
string
Path to the combined complete audio file
Process:
  1. Combines all narration texts
  2. Splits into chunks (max 500 chars each, respecting sentence boundaries)
  3. Generates audio for each chunk
  4. Concatenates all audio chunks
  5. Saves as single WAV file

combine_slide_audios

Combines individual slide audio files into one complete audio track.
def combine_slide_audios(slide_audio_paths: dict, topic: str) -> str
slide_audio_paths
dict
required
Dictionary mapping slide numbers to their audio file paths:
{1: "/path/to/slide_1.wav", 2: "/path/to/slide_2.wav"}
topic
string
required
Presentation topic for output filename
return
string
Path to the combined audio file
Uses MoviePy to concatenate audio clips in order.

Speaker Configuration

From config.py, speaker voices are mapped per language:
SARVAM_SPEAKER_MAP = {
    "english": "anushka",
    "hindi": "aarav",
    "kannada": "meera",
    # ... other languages
}

API Request Parameters

From backend/generators/voice_generator.py:26-36:
payload = {
    "inputs": [narration_text[:500]],  # Text to synthesize
    "target_language_code": self._get_language_code(language),
    "speaker": speaker,  # Voice model
    "pitch": 0,  # Normal pitch
    "pace": 1.0,  # Normal speed
    "loudness": 1.5,  # Slightly enhanced volume
    "speech_sample_rate": 22050,  # CD quality
    "enable_preprocessing": True,  # Text normalization
    "model": Config.SARVAM_MODEL
}

Usage Example

From backend/app.py:251-303:
# Step 3: Generate voice audio PER SLIDE and get actual durations
update_progress(generation_id, 30, "generating_audio", 
                "🎤 Generating voice narration per slide...")

voice_gen = VoiceGenerator()
slide_audio_paths = {}
actual_durations = {}
total_slides = len(script_data['slide_scripts'])

# Generate audio for each slide separately
for idx, slide_script in enumerate(script_data['slide_scripts'], 1):
    slide_num = slide_script['slide_number']
    
    audio_progress = 30 + int((idx / total_slides) * 15)
    update_progress(generation_id, audio_progress, "generating_audio", 
                  f"🎤 Generating audio for slide {idx}/{total_slides}...")
    
    try:
        audio_path = voice_gen.generate_voice_for_slide(
            slide_script['narration_text'],
            slide_num,
            topic,
            request.language
        )
        slide_audio_paths[slide_num] = audio_path
        
        # Get actual duration from generated audio
        from moviepy import AudioFileClip
        audio_clip = AudioFileClip(audio_path)
        actual_durations[slide_num] = audio_clip.duration
        audio_clip.close()
        
    except Exception as e:
        print(f"Error generating audio for slide {slide_num}: {e}")
        actual_durations[slide_num] = slide_script['end_time'] - slide_script['start_time']

# Combine all slide audios into one file
update_progress(generation_id, 48, "combining_audio", "🎵 Combining audio tracks...")
audio_path = voice_gen.combine_slide_audios(slide_audio_paths, topic)

Text Chunking Strategy

For long narrations, text is split intelligently:
def _split_text_into_chunks(self, text: str, max_length: int = 500) -> list:
    """Split text into chunks respecting sentence boundaries"""
    if len(text) <= max_length:
        return [text]
    
    chunks = []
    sentences = text.replace('!', '.').replace('?', '.').split('.')
    current_chunk = ""
    
    for sentence in sentences:
        sentence = sentence.strip()
        if not sentence:
            continue
            
        # If adding this sentence would exceed limit, save current chunk
        if len(current_chunk) + len(sentence) + 2 > max_length:
            if current_chunk:
                chunks.append(current_chunk.strip())
            current_chunk = sentence + ". "
        else:
            current_chunk += sentence + ". "
    
    # Add remaining text
    if current_chunk.strip():
        chunks.append(current_chunk.strip())
    
    return chunks

Language Code Mapping

From backend/generators/voice_generator.py:140-155:
def _get_language_code(self, language: str) -> str:
    """Map language name to Sarvam AI language code"""
    language_map = {
        "english": "en-IN",
        "hindi": "hi-IN",
        "kannada": "kn-IN",
        "telugu": "te-IN",
        "tamil": "ta-IN",
        "bengali": "bn-IN",
        "gujarati": "gu-IN",
        "malayalam": "ml-IN",
        "marathi": "mr-IN",
        "odia": "or-IN",
        "punjabi": "pa-IN"
    }
    return language_map.get(language.lower(), "en-IN")

Audio Response Handling

Sarvam AI returns base64-encoded audio:
response = requests.post(self.api_url, headers=headers, json=payload)
response.raise_for_status()

result = response.json()

if "audios" in result and len(result["audios"]) > 0:
    # Decode base64 audio
    import base64
    audio_data = base64.b64decode(result["audios"][0])
    
    # Save as WAV file
    audio_filename = f"{topic_name}_slide_{slide_number}.wav"
    audio_path = Config.AUDIO_DIR / audio_filename
    
    with open(audio_path, 'wb') as f:
        f.write(audio_data)
    
    return str(audio_path)

Error Handling

try:
    response = requests.post(self.api_url, headers=headers, json=payload)
    if response.status_code != 200:
        print(f"Sarvam API Error Response: {response.text}")
        print(f"Request payload: {json.dumps(payload, indent=2)}")
    response.raise_for_status()
    
except Exception as e:
    print(f"Sarvam AI TTS Error: {e}")
    raise

File Output

Generated audio files are saved to:
Config.AUDIO_DIR / "{topic_sanitized}_slide_{slide_number}.wav"
Config.AUDIO_DIR / "{topic_sanitized}_complete.wav"  # Combined audio
Format specifications:
  • Codec: PCM signed 16-bit little-endian
  • Sample Rate: 22050 Hz
  • Channels: Mono
  • Format: WAV
  • ScriptGenerator - Provides narration text input
  • VideoComposer - Uses generated audio for final video
  • Configuration - API keys and endpoints in config.py

Build docs developers (and LLMs) love