Skip to main content

Overview

The Podcast Agent provides a comprehensive suite of tools for processing podcast videos, including automatic transcription with speaker identification, AI-assisted video editing, and semantic knowledge base creation from podcast content.

Features

  • Video Transcription: Convert podcast videos to structured transcripts with speaker identification
  • Speaker Recognition: Automatically identify speakers based on visual characteristics
  • AI Video Editing: Intelligent editing suggestions based on content analysis
  • Knowledge Base: Create searchable embeddings from podcast transcripts
  • Multi-format Support: Process MP4, MOV, AVI, MKV, and WebM formats

Components

The Podcast Agent consists of three main modules:

1. Video Transcription (geminivideo.py)

2. AI Video Editor (aiagenteditor.py)

3. Knowledge Base (podcast_knowledge_base.py)


Video Transcription

Automatic transcription with speaker identification using Google’s Gemini 1.5 Pro vision model.

Setup

export GOOGLE_APPLICATION_CREDENTIALS="/path/to/service-account-key.json"
Requires a Google Cloud project with Vertex AI API enabled and a service account with appropriate permissions.

Configuration

podcast_agent/geminivideo.py
PROJECT_ID = "your-project-id"
LOCATION = "us-central1"
MODEL_ID = "gemini-1.5-pro"

vertexai.init(project=PROJECT_ID, location=LOCATION)
model = GenerativeModel(MODEL_ID)

Basic Usage

from podcast_agent.geminivideo import process_video

# Process a single video
output_path = process_video("path/to/podcast.mp4")
print(f"Transcript saved to: {output_path}")

Speaker Identification

The transcription tool uses visual analysis to identify speakers:
podcast_agent/geminivideo.py
prompt = f"""
The name of this podcast is The Rollup. There are two hosts:
- Andy: Light blonde, curly hair, longer on top with wave, light complexion.
- Rob: Short dark hair, slightly receding, light to medium skin, short beard.

Any other speakers are guests.

Transcribe this interview, identify speakers, and return JSON format:
[
    {{
        "speaker": "Speaker Name",
        "content": "What they said"
    }}
]
"""
Customize the speaker descriptions in the prompt to match your podcast’s hosts and guests for accurate identification.

Output Format

Transcripts are saved as JSON files in the jsonoutputs/ directory:
[
  {
    "speaker": "Andy",
    "content": "Welcome to The Rollup! Today we're discussing..."
  },
  {
    "speaker": "Rob",
    "content": "Thanks for having me. I'm excited to talk about..."
  },
  {
    "speaker": "Guest",
    "content": "It's great to be here. Let me share some insights on..."
  }
]

Batch Processing

from podcast_agent.geminivideo import main
import os

# Process all videos in a directory
video_dir = "split_videos"
video_files = [
    os.path.join(video_dir, f)
    for f in os.listdir(video_dir)
    if f.lower().endswith(('.mov', '.mp4', '.avi', '.mkv', '.webm'))
]

for video_path in video_files:
    try:
        process_video(video_path)
    except Exception as e:
        print(f"Failed to process {video_path}: {str(e)}")

Retry Logic

Automatic retry with exponential backoff for API rate limits:
podcast_agent/geminivideo.py
@retry_with_exponential_backoff(max_retries=3, initial_delay=1)
def process_video(video_path):
    # Processing logic with automatic retry on failure
    video_bytes = pathlib.Path(video_path).read_bytes()
    mime_type = get_mime_type(video_path)
    video_file = Part.from_data(video_bytes, mime_type=mime_type)
    
    contents = [video_file, prompt]
    responses = model.generate_content(contents)
    return responses

AI Video Editor

Intelligent video editing with automated clip selection and assembly using Gemini’s video analysis.

Features

  • Content Analysis: AI analyzes video content and suggests edits
  • Timestamp Validation: Ensures all edits are within video duration
  • Parallel Processing: Concurrent clip trimming for faster processing
  • FFmpeg Integration: Professional-grade video editing

Basic Usage

from podcast_agent.aiagenteditor import process_video

# Process video with custom instructions
output_path = process_video(
    videopath="podcast_episode.mp4",
    custom_instructions="Focus on technical discussions, remove pauses longer than 2 seconds"
)

if output_path:
    print(f"Edited video saved to: {output_path}")

Edit Analysis Workflow

1

Video Analysis

Gemini analyzes the entire video and identifies segments to keep or remove based on content quality and pacing.
2

Timestamp Generation

AI generates precise timestamps for each suggested edit in HH:MM:SS format.
3

Validation

All timestamps are validated against video duration and checked for chronological order.
4

Clip Extraction

Valid segments are extracted using FFmpeg with parallel processing.
5

Concatenation

Approved clips are concatenated into the final edited video.

Response Schema

The AI returns structured editing suggestions:
podcast_agent/aiagenteditor.py
response_schema = {
    "type": "object",
    "properties": {
        "edits": {
            "type": "array",
            "items": {
                "type": "object",
                "properties": {
                    "start_time": {"type": "string", "pattern": "^[0-9]{2}:[0-9]{2}:[0-9]{2}$"},
                    "end_time": {"type": "string", "pattern": "^[0-9]{2}:[0-9]{2}:[0-9]{2}$"},
                    "keep": {"type": "boolean"},
                    "reason": {"type": "string"}
                },
                "required": ["start_time", "end_time", "keep", "reason"]
            }
        }
    }
}

Example Edit Output

{
  "edits": [
    {
      "start_time": "00:00:00",
      "end_time": "00:05:30",
      "keep": true,
      "reason": "Strong introduction with key points"
    },
    {
      "start_time": "00:05:30",
      "end_time": "00:06:15",
      "keep": false,
      "reason": "Long pause and technical difficulties"
    },
    {
      "start_time": "00:06:15",
      "end_time": "00:15:00",
      "keep": true,
      "reason": "Main discussion with valuable insights"
    }
  ]
}

Video Processing Functions

podcast_agent/aiagenteditor.py
def get_video_duration(video_path):
    """Get video duration using ffprobe."""
    command = [
        'ffprobe',
        '-v', 'error',
        '-select_streams', 'v:0',
        '-show_entries', 'format=duration',
        '-of', 'default=noprint_wrappers=1:nokey=1',
        video_path
    ]
    result = subprocess.run(command, capture_output=True, text=True)
    duration = float(result.stdout.strip())
    return duration

def trim_video(input_path, output_path, start_time, end_time):
    """Trim video using ffmpeg."""
    command = [
        'ffmpeg',
        '-i', input_path,
        '-ss', start_time,
        '-to', end_time,
        '-c', 'copy',
        '-y',
        output_path
    ]
    subprocess.run(command, check=True)

def concatenate_videos(clip_paths, output_path):
    """Concatenate multiple clips into final video."""
    with tempfile.NamedTemporaryFile(mode="w", suffix=".txt") as tmp_file:
        for clip_path in clip_paths:
            tmp_file.write(f"file '{clip_path}'\n")
        tmp_file.flush()
        
        command = [
            'ffmpeg',
            '-f', 'concat',
            '-safe', '0',
            '-i', tmp_file.name,
            '-c', 'copy',
            '-y',
            output_path
        ]
        subprocess.run(command, check=True)

Custom Editing Instructions

# Example 1: Technical content focus
process_video(
    "tech_podcast.mp4",
    custom_instructions="Keep all technical discussions, remove casual banter"
)

# Example 2: Pacing improvement
process_video(
    "interview.mp4",
    custom_instructions="Remove pauses longer than 3 seconds, keep all Q&A segments"
)

# Example 3: Highlight reel
process_video(
    "long_episode.mp4",
    custom_instructions="Extract only the most insightful moments and key takeaways"
)

Podcast Knowledge Base

Create a semantic search engine from podcast transcripts using ChromaDB and sentence transformers.

Setup

pip install chromadb sentence-transformers

Initialization

from podcast_agent.podcast_knowledge_base import PodcastKnowledgeBase

# Initialize with persistent storage
kb = PodcastKnowledgeBase(collection_name="podcast_knowledge")

# Process all JSON transcripts
kb.process_all_json_files(directory="jsonoutputs/")

Architecture

podcast_agent/podcast_knowledge_base.py
class PodcastKnowledgeBase:
    def __init__(self, collection_name: str = "podcast_knowledge"):
        # Persistent ChromaDB client
        self.client = chromadb.PersistentClient(path="./chroma_db")
        
        # Advanced embedding model
        self.embedding_model = SentenceTransformer('all-mpnet-base-v2')
        
        # Create collection with custom embedding function
        self.collection = self.client.get_or_create_collection(
            name=collection_name,
            embedding_function=embedding_func
        )

Adding Segments

from podcast_agent.podcast_knowledge_base import PodcastSegment

# Create segments manually
segments = [
    PodcastSegment(
        id="ep1_0",
        speaker="Andy",
        content="Today we're discussing the future of AI...",
        source_file="episode_1.json"
    ),
    PodcastSegment(
        id="ep1_1",
        speaker="Rob",
        content="Machine learning has evolved significantly...",
        source_file="episode_1.json"
    )
]

kb.add_segments(segments)
# Query the knowledge base
results = kb.query_knowledge_base(
    query="What did they say about machine learning?",
    n_results=5
)

# Format and display results
formatted = kb.format_query_results(results)
print(formatted)

Query Results Format

[
    {
        "content": "Machine learning has evolved significantly...",
        "metadata": {
            "speaker": "Rob",
            "source_file": "episode_1.json",
            "timestamp": "2024-01-15T10:30:00"
        },
        "relevance_score": 0.89
    },
    {
        "content": "The applications of ML in healthcare are tremendous...",
        "metadata": {
            "speaker": "Guest",
            "source_file": "episode_3.json",
            "timestamp": "2024-01-20T14:15:00"
        },
        "relevance_score": 0.76
    }
]

Processing JSON Files

podcast_agent/podcast_knowledge_base.py
def process_json_file(self, file_path: str):
    """Process a podcast transcript JSON and add to knowledge base."""
    with open(file_path, 'r', encoding='utf-8') as f:
        transcript_data = json.load(f)
    
    segments = []
    for idx, entry in enumerate(transcript_data):
        segment = PodcastSegment(
            id=f"{os.path.basename(file_path)}_{idx}",
            speaker=entry['speaker'],
            content=entry['content'],
            source_file=file_path
        )
        segments.append(segment)
    
    self.add_segments(segments)

Knowledge Base Statistics

# Get collection stats
stats = kb.get_collection_stats()
print(f"Total segments: {stats['count']}")
print(f"Last updated: {stats['last_update']}")

# Check processed files
processed = kb.get_processed_files()
print(f"Processed files: {processed}")

Advanced Queries

# Complex semantic search
queries = [
    "What are the challenges in blockchain scalability?",
    "How does NFT technology work?",
    "What did guests say about DeFi adoption?"
]

for query in queries:
    print(f"\n=== Query: {query} ===")
    results = kb.query_knowledge_base(query, n_results=3)
    
    for i, result in enumerate(results, 1):
        print(f"\n{i}. [{result['metadata']['speaker']}] "
              f"(Score: {result['relevance_score']:.2f})")
        print(f"   {result['content'][:200]}...")

Collection Management

# Clear the entire collection
kb.clear_collection()

# Re-process all files
kb.process_all_json_files()

# Get processed file list
processed_files = kb.get_processed_files()
print(f"Knowledge base contains {len(processed_files)} transcripts")

Integration Example

Complete workflow from video to searchable knowledge base:
import asyncio
from podcast_agent.geminivideo import process_video as transcribe
from podcast_agent.aiagenteditor import process_video as edit
from podcast_agent.podcast_knowledge_base import PodcastKnowledgeBase

async def process_podcast_workflow(video_path: str):
    # Step 1: Transcribe the video
    print("Step 1: Transcribing video...")
    transcript_path = transcribe(video_path)
    
    # Step 2: Edit the video (optional)
    print("Step 2: Editing video...")
    edited_path = edit(
        video_path,
        custom_instructions="Keep main discussion, remove technical issues"
    )
    
    # Step 3: Add to knowledge base
    print("Step 3: Adding to knowledge base...")
    kb = PodcastKnowledgeBase()
    kb.process_json_file(transcript_path)
    
    # Step 4: Query the content
    print("Step 4: Testing semantic search...")
    results = kb.query_knowledge_base("main topics discussed", n_results=5)
    print(kb.format_query_results(results))
    
    return {
        "transcript": transcript_path,
        "edited_video": edited_path,
        "segments_added": len(results)
    }

# Run the workflow
result = asyncio.run(process_podcast_workflow("my_podcast.mp4"))
print(f"Workflow complete: {result}")

Configuration

PROJECT_ID
string
required
Google Cloud project ID for Vertex AI
LOCATION
string
default:"us-central1"
Google Cloud region for Vertex AI API
MODEL_ID
string
default:"gemini-1.5-pro"
Gemini model version for video analysis
MAX_RETRIES
int
default:"3"
Number of retry attempts for API calls

Best Practices

Speaker Customization

Customize speaker descriptions in the transcription prompt for your specific podcast hosts.

Batch Processing

Process multiple videos in batches with delays between API calls to avoid rate limits.

Edit Instructions

Provide clear, specific editing instructions for best AI-assisted editing results.

Knowledge Base Updates

Regularly update the knowledge base with new episodes for comprehensive search coverage.
The video editor uses FFmpeg. Ensure it’s installed: brew install ffmpeg (macOS) or apt-get install ffmpeg (Linux)

Build docs developers (and LLMs) love