Audio Overview Generation

Overview

DecipherIt transforms your research into engaging podcast-style audio overviews. Using CrewAI agents and LemonFox AI text-to-speech, it creates natural conversations between two hosts discussing your research findings.

Audio overviews are generated on-demand and typically ready in 2-4 minutes. The voice and content are AI-generated and may contain inaccuracies or audio glitches.

How It Works

Content Preparation

All research content is retrieved from the vector database.Content Retrieval:

All chunks for the notebook fetched from Qdrant
Content sorted by chunk index
Assembled into complete research text
Passed to audio generation crew

Implementation: backend/agents/audio_overview_agent.py:107-115

Research Analysis

A Research Analyst agent extracts and organizes key insights.Analysis Tasks:

Identifies main themes and key insights
Highlights important supporting details
Maintains factual accuracy
Organizes points logically for discussion

Implementation: backend/agents/audio_overview_agent.py:37-52

Conversation Planning

A Podcast Producer agent structures insights into a conversation outline.Planning Process:

Designs 4-5 minute conversation flow
Creates natural transitions between topics
Balances thoroughness with brevity
Plans for Michael (host) and Sarah (expert)

Implementation: backend/agents/audio_overview_agent.py:54-71

Script Writing

A Scriptwriter agent crafts natural podcast dialogue.Script Requirements:

Opens with “The DecipherIt Podcast” welcome
800-1000 words (4-5 minutes)
Casual, natural dialogue
Authentic reactions and interjections
Meaningful back-and-forth discussion

Implementation: backend/agents/audio_overview_agent.py:73-97

Text-to-Speech Conversion

LemonFox AI converts the script to audio with different voices.TTS Process:

Michael: “liam” voice (host)
Sarah: “jessica” voice (expert)
Segments processed concurrently (max 5 at once)
0.5 second pause between segments
Combined into single MP3 file

Implementation: backend/services/tts_service.py:84-191

Audio Storage

Final audio is uploaded to Cloudflare R2 storage.Storage Process:

Upload to R2 bucket
Generate public URL
Save URL to database
Return for playback

Generating Audio Overviews

First Generation
Playback
Regeneration

Navigate to your processed notebook
Click the Audio Overview tab
Click Generate Audio Overview button
Wait 2-4 minutes for processing
Audio player appears when ready

The page automatically polls for completion, so you don’t need to refresh.

Once generated, use the audio player to:Player Features:

Play/pause controls
Volume adjustment
Seek through timeline
Playback speed control
Download option

UI Implementation: client/components/ui/audio-player.tsx

CrewAI Audio Generation Workflow

Agent Configuration

def get_audio_overview_crew():
    # Research Analyst
    research_analyst = Agent(
        name="Research Analyst",
        role="Content Analyst",
        goal="Extract and organize key insights from research content",
        backstory="""You are an expert research analyst who excels at 
                     distilling complex information into clear, actionable 
                     summaries while maintaining accuracy.""",
        llm=llm,
        verbose=True
    )
    
    # Conversation Planner
    conversation_planner = Agent(
        name="Conversation Planner",
        role="Podcast Producer",
        goal="Structure research insights into engaging podcast conversation",
        backstory="""You are a podcast producer who specializes in 
                     transforming complex topics into natural, flowing 
                     conversations that educate and engage.""",
        llm=llm,
        verbose=True
    )
    
    # Script Writer
    script_writer = Agent(
        name="Script Writer",
        role="Podcast Scriptwriter",
        goal="Write natural podcast dialogue",
        backstory="""You are a scriptwriter who excels at crafting 
                     authentic podcast conversations that balance education 
                     with entertainment.""",
        llm=llm,
        verbose=True
    )

Source: backend/agents/audio_overview_agent.py:7-34

Script Output Format

class AudioOverviewTranscript(BaseModel):
    transcript: List[TranscriptSegment]

class TranscriptSegment(BaseModel):
    name: str      # "Michael" or "Sarah"
    transcript: str  # Dialogue text

Example Output:

[
    {"name": "Michael", "transcript": "Welcome to The DecipherIt Podcast..."},
    {"name": "Sarah", "transcript": "Thanks for having me, Michael..."},
    {"name": "Michael", "transcript": "So let's dive into the research..."}
]

Source: backend/models/audio_overview_models.py

Text-to-Speech Implementation

TTS Service Architecture

class TTSService:
    def __init__(self):
        self.api_key = os.environ.get("LEMONFOX_API_KEY")
        self.base_url = "https://api.lemonfox.ai/v1/audio/speech"
        self.response_format = "mp3"
        
        # Voice mapping
        self.speaker_voices = {
            "Michael": "liam",    # Host voice
            "Sarah": "jessica"    # Guest voice
        }
        
        # Performance settings
        self.pause_duration = 500  # 0.5 seconds between segments
        self.max_concurrent_requests = 5
        self.semaphore = asyncio.Semaphore(self.max_concurrent_requests)

Source: backend/services/tts_service.py:16-42

Concurrent Generation

async def generate_audio_from_transcript(
    self,
    transcript: List[Dict[str, Any]],
    notebook_id: str
) -> bytes:
    # Process segments concurrently with semaphore for rate limiting
    tasks = []
    for i, segment in enumerate(valid_segments):
        speaker = segment.get("name", "Michael")
        text = segment.get("transcript", "")
        voice = self.speaker_voices.get(speaker, "jessica")
        
        task = self._generate_audio_with_semaphore(
            text, voice, i + 1, len(transcript)
        )
        tasks.append(task)
    
    # Execute all TTS requests concurrently
    audio_bytes_list = await asyncio.gather(*tasks, return_exceptions=True)
    
    # Combine audio segments
    return await self._combine_audio_segments(audio_bytes_list, valid_segments)

Source: backend/services/tts_service.py:84-133

Audio Combination

async def _combine_audio_segments(
    self,
    audio_bytes_list: List[bytes],
    valid_segments: List[tuple]
) -> bytes:
    combined_audio = None
    
    for i, audio_bytes in enumerate(audio_bytes_list):
        async with self._audio_segment_context(audio_bytes) as segment_audio:
            if combined_audio is None:
                combined_audio = segment_audio
            else:
                combined_audio += segment_audio
            
            # Add pause between segments (except for the last one)
            if i < len(audio_bytes_list) - 1:
                pause = AudioSegment.silent(duration=self.pause_duration)
                combined_audio += pause
    
    # Export to MP3
    output_buffer = io.BytesIO()
    combined_audio.export(output_buffer, format="mp3")
    return output_buffer.getvalue()

Source: backend/services/tts_service.py:151-185

UI Implementation

Status Management

The audio overview component tracks generation status:

const [audioOverviewUrl, setAudioOverviewUrl] = useState<string | null>(
  initialAudioOverviewUrl || null
);

// Status values:
// - null: Not generated yet
// - "IN_PROGRESS": Currently generating
// - "ERROR": Generation failed
// - "https://...": Audio URL (success)

Source: client/components/notebook/audio-overview-section.tsx:19-22

Polling for Completion

useEffect(() => {
  if (audioOverviewUrl !== "IN_PROGRESS") {
    return;
  }
  
  const pollInterval = setInterval(async () => {
    const response = await fetch(`/api/notebooks/${notebookId}`);
    const notebook = await response.json();
    const newAudioUrl = notebook.output?.audioOverviewUrl;
    
    if (newAudioUrl && newAudioUrl !== "IN_PROGRESS") {
      setAudioOverviewUrl(newAudioUrl);
      
      if (newAudioUrl === "ERROR") {
        toast.error("Audio overview generation failed");
      } else {
        toast.success("Audio overview ready!");
      }
    }
  }, 3000); // Poll every 3 seconds
  
  return () => clearInterval(pollInterval);
}, [audioOverviewUrl, notebookId]);

Source: client/components/notebook/audio-overview-section.tsx:25-68

Audio Quality Features

Natural Voices

High-quality AI voices (Liam and Jessica) create authentic-sounding podcast conversations.

Conversational Flow

Script includes natural reactions, interjections, and back-and-forth discussion for engagement.

Optimized Length

4-5 minute duration balances comprehensiveness with listenability.

Professional Production

Automatic pausing between segments and smooth transitions create polished output.

Use Cases

On-the-Go Learning

Listen to research summaries while:

Commuting
Exercising
Doing chores
Taking breaks

Accessibility

Audio format provides:

Alternative to reading long summaries
Accessibility for visual impairments
Multi-modal learning options
Reduced screen time

Content Review

Use audio for:

Quick refreshers on research
Pre-presentation review
Sharing insights with colleagues
Multi-tasking while learning

Performance Optimizations

Concurrent TTS

Up to 5 segments processed simultaneously for faster generation.

HTTP/2 Support

Connection pooling and HTTP/2 reduce API call overhead.

Memory Efficiency

Context managers ensure proper cleanup of audio buffers during processing.

Progressive Updates

UI polls every 3 seconds to show progress without blocking.

Technical Details

Connection Pooling

async def _get_client(self) -> httpx.AsyncClient:
    """Get or create HTTP client with connection pooling."""
    if self._client is None or self._client.is_closed:
        async with self._client_lock:
            limits = httpx.Limits(
                max_keepalive_connections=10,
                max_connections=20,
                keepalive_expiry=30.0
            )
            self._client = httpx.AsyncClient(
                timeout=httpx.Timeout(300.0),  # 5 minute timeout
                limits=limits,
                http2=True
            )
    return self._client

Source: backend/services/tts_service.py:44-60

Rate Limiting

self.semaphore = asyncio.Semaphore(self.max_concurrent_requests)

async def _generate_audio_with_semaphore(
    self,
    text: str,
    voice: str,
    segment_num: int,
    total_segments: int
) -> bytes:
    async with self.semaphore:
        return await self._generate_audio(text, voice)

Source: backend/services/tts_service.py:42, 139-149

Limitations

AI-Generated Content: Voices and script are AI-generated and may contain inaccuracies
Audio Quality: May include occasional glitches or unnatural phrasing
Length Constraint: Limited to 4-5 minutes (800-1000 words)
Processing Time: Typically 2-4 minutes, varies with content length
Storage: Audio files stored on Cloudflare R2, availability depends on storage service

Best Practices

For Best Results:

Generate after research is fully processed
Use headphones for best audio quality
Download for offline listening
Verify important information from the written summary
Treat as overview, not definitive source

AI Summaries

Read the full written summary

Interactive Q&A

Ask questions about specific details

Mindmaps

Visual representation of research structure

Get Started

Core Features

Architecture

Self-Hosting

Integrations

Audio Overview Generation

Overview

How It Works

Generating Audio Overviews

CrewAI Audio Generation Workflow

Agent Configuration

Script Output Format

Text-to-Speech Implementation

TTS Service Architecture

Concurrent Generation

Audio Combination

UI Implementation

Status Management

Polling for Completion

Audio Quality Features

Natural Voices

Conversational Flow

Optimized Length

Professional Production

Use Cases

Performance Optimizations

Concurrent TTS

HTTP/2 Support

Memory Efficiency

Progressive Updates

Technical Details

Connection Pooling

Rate Limiting

Limitations

Best Practices

AI Summaries

Interactive Q&A

Mindmaps

Build docs developers (and LLMs) love

Get Started

Core Features

Architecture

Self-Hosting

Integrations

​Overview

​How It Works

​Generating Audio Overviews

​CrewAI Audio Generation Workflow

​Agent Configuration

​Script Output Format

​Text-to-Speech Implementation

​TTS Service Architecture

​Concurrent Generation

​Audio Combination

​UI Implementation

​Status Management

​Polling for Completion

​Audio Quality Features

Natural Voices

Conversational Flow

Optimized Length

Professional Production

​Use Cases

​Performance Optimizations

Concurrent TTS

HTTP/2 Support

Memory Efficiency

Progressive Updates

​Technical Details

​Connection Pooling

​Rate Limiting

​Limitations

​Best Practices

​Related Features

AI Summaries

Interactive Q&A

Mindmaps

Build docs developers (and LLMs) love

Overview

How It Works

Generating Audio Overviews

CrewAI Audio Generation Workflow

Agent Configuration

Script Output Format

Text-to-Speech Implementation

TTS Service Architecture

Concurrent Generation

Audio Combination

UI Implementation

Status Management

Polling for Completion

Audio Quality Features

Use Cases

Performance Optimizations

Technical Details

Connection Pooling

Rate Limiting

Limitations

Best Practices

Related Features