Podcast Integration

TypeAgent provides specialized support for ingesting and querying podcast transcripts, including WebVTT files with speaker annotations.

Podcast Workflow

Working with podcasts follows a simple two-step workflow:

Ingest Transcript

Parse and index podcast transcript into TypeAgent

Query Content

Ask questions about what was discussed in the podcast

Podcast Message Format

Podcasts use conversation messages with speaker and recipient metadata:

from typeagent.knowpro.universal_message import ConversationMessage, ConversationMessageMeta

message = ConversationMessage(
    text_chunks=["Welcome to our podcast about AI and science fiction."],
    metadata=ConversationMessageMeta(
        speaker="host",
        recipients=["guest", "audience"]
    ),
    timestamp="1970-01-01T00:00:00Z"  # Relative timestamp
)

TypeAgent uses Unix epoch (1970-01-01) as the base timestamp for podcasts when the actual date is unknown, preserving relative timing.

Ingesting Plain Text Transcripts

For simple speaker-prefixed transcripts:

import asyncio
from datetime import timedelta
from dotenv import load_dotenv

from typeagent.knowpro.convsettings import ConversationSettings
from typeagent.knowpro.universal_message import format_timestamp_utc, UNIX_EPOCH
from typeagent.podcasts.podcast_ingest import ingest_podcast

load_dotenv()

async def main():
    settings = ConversationSettings()
    
    podcast = await ingest_podcast(
        transcript_file_path="transcript.txt",
        settings=settings,
        podcast_name="AI Discussion",
        length_minutes=60.0,  # Total podcast length
        dbname="podcast.db",
        verbose=True
    )
    
    print(f"Ingested {await podcast.messages.size()} messages")

if __name__ == "__main__":
    asyncio.run(main())

Transcript Format

Plain text transcripts should use SPEAKER: text format:

HOST: Welcome to the AI podcast.
GUEST: Thanks for having me.
HOST: Let's talk about machine learning.
GUEST: Machine learning is fascinating because...

Timestamp Assignment

TypeAgent assigns timestamps proportionally based on text length:

# Timestamps are calculated based on:
# - Total podcast length (length_minutes)
# - Relative text length of each message
# - Base date (default: Unix epoch)

podcast = await ingest_podcast(
    transcript_file_path="transcript.txt",
    settings=settings,
    start_date=None,  # Uses Unix epoch
    length_minutes=60.0
)

Ingesting WebVTT Transcripts

For WebVTT files with timing and speaker annotations:

# Basic VTT ingestion
python tools/ingest_vtt.py transcript.vtt -d podcast.db

# With custom name
python tools/ingest_vtt.py transcript.vtt \
    -d podcast.db \
    --name "Episode 53: Adrian Tchaikovsky"

# Merge consecutive speaker segments
python tools/ingest_vtt.py transcript.vtt \
    -d podcast.db \
    --merge

# With custom batch size
python tools/ingest_vtt.py transcript.vtt \
    -d podcast.db \
    --batchsize 10

# Verbose output
python tools/ingest_vtt.py transcript.vtt \
    -d podcast.db \
    --verbose

WebVTT Format

TypeAgent supports WebVTT files with voice tags:

WEBVTT

00:00:00.000 --> 00:00:05.000
<v Host>Welcome to Behind the Tech.

00:00:05.000 --> 00:00:12.000
<v Kevin>I'm Kevin Scott, CTO of Microsoft.

00:00:12.000 --> 00:00:18.000
<v Kevin>Today we're talking with Adrian Tchaikovsky.

00:00:18.000 --> 00:00:25.000
<v Adrian>Thanks for having me on the show.

Voice Tag Parsing

TypeAgent parses WebVTT voice annotations:

from typeagent.transcripts.transcript_ingest import parse_voice_tags

# Parse voice-tagged text
text = "<v Host>Welcome to the show<v Guest>Thanks for having me"
segments = parse_voice_tags(text)

# Returns: [
#     ("host", "Welcome to the show"),
#     ("guest", "Thanks for having me")
# ]

Multiple VTT Files

Ingest multiple VTT files as a continuous conversation:

# Ingest multiple files with time continuity
python tools/ingest_vtt.py \
    episode1.vtt episode2.vtt episode3.vtt \
    -d combined.db \
    --name "Complete Series"

Programmatic Podcast Ingestion

Create custom podcast ingestion pipelines:

import asyncio
from datetime import datetime, timezone
from dotenv import load_dotenv

from typeagent.knowpro.convsettings import ConversationSettings  
from typeagent.podcasts.podcast_ingest import ingest_podcast

load_dotenv()

async def ingest_podcast_series():
    settings = ConversationSettings()
    
    # Configure knowledge extraction
    settings.semantic_ref_index_settings.auto_extract_knowledge = True
    settings.semantic_ref_index_settings.batch_size = 4
    
    # Ingest podcast
    podcast = await ingest_podcast(
        transcript_file_path="episode_53.txt",
        settings=settings,
        podcast_name="Episode 53: Adrian Tchaikovsky",
        start_date=datetime(2024, 1, 15, tzinfo=timezone.utc),
        length_minutes=45.0,
        dbname="episode_53.db",
        batch_size=10,  # Override batch size
        verbose=True
    )
    
    print(f"Podcast '{podcast.name_tag}' ingested successfully")
    print(f"Messages: {await podcast.messages.size()}")
    print(f"Semantic refs: {await podcast.semantic_refs.size()}")
    
    return podcast

if __name__ == "__main__":
    asyncio.run(ingest_podcast_series())

Podcast-Specific Features

Participant Alias Resolution

TypeAgent automatically builds aliases for participants:

# "Kevin Scott" is aliased to "Kevin"
# "Adrian Tchaikovsky" is aliased to "Adrian"

# Queries work with either form:
await podcast.query("What did Kevin say?")
await podcast.query("What did Kevin Scott say?")  # Same results

Synonym Expansion

Podcasts include verb synonyms from podcastVerbs.json:

[
  {
    "term": "discuss",
    "relatedTerms": ["talk about", "mention", "bring up", "cover"]
  },
  {
    "term": "explain",
    "relatedTerms": ["describe", "clarify", "elaborate"]
  }
]

This enables queries like:

# All of these find similar results:
await podcast.query("What did they discuss about AI?")
await podcast.query("What did they talk about regarding AI?")
await podcast.query("What did they mention about AI?")

Querying Podcasts

Query ingested podcasts using natural language:

# Interactive query mode
python tools/query.py --database podcast.db

# Single query
python tools/query.py --database podcast.db \
    --query "What did Kevin say to Adrian about science fiction?"

Podcast Query Examples

from typeagent.podcasts.podcast import Podcast
from typeagent.knowpro.convsettings import ConversationSettings

settings = ConversationSettings()
podcast = await Podcast.read_from_file(
    "tests/testdata/Episode_53_AdrianTchaikovsky_index",
    settings
)

# Who questions
answer = await podcast.query("Who is Adrian Tchaikovsky?")
answer = await podcast.query("Who spoke about AI ethics?")

# What questions
answer = await podcast.query("What did Kevin say about science fiction?")
answer = await podcast.query("What books were mentioned?")

# How questions  
answer = await podcast.query("How was Asimov mentioned?")
answer = await podcast.query("How did they describe the challenges?")

# Topic searches
answer = await podcast.query("What was discussed about AI ethics?")
answer = await podcast.query("Tell me about the robotics discussion")

Podcast Serialization

Save and load podcast data efficiently:

Saving Podcasts

from typeagent.podcasts.podcast import Podcast

# Save to files
await podcast.write_to_file("podcast_index")

# Creates two files:
# - podcast_index_data.json (metadata, messages, indexes)
# - podcast_index_embeddings.bin (embedding vectors)

Loading Podcasts

from typeagent.podcasts.podcast import Podcast
from typeagent.knowpro.convsettings import ConversationSettings

settings = ConversationSettings()

# Load from files
podcast = await Podcast.read_from_file(
    "podcast_index",  # Filename prefix
    settings
)

print(f"Loaded {await podcast.messages.size()} messages")

Embedding files are binary and specific to the embedding model used during ingestion.

Advanced Podcast Features

Resuming Interrupted Ingestion

# Resume from message 100 if ingestion was interrupted
podcast = await ingest_podcast(
    transcript_file_path="large_transcript.txt",
    settings=settings,
    dbname="podcast.db",
    start_message=100,  # Resume from this message
    batch_size=50,
    verbose=True
)

Custom Timestamp Base

from datetime import datetime, timezone

# Use actual podcast date
podcast = await ingest_podcast(
    transcript_file_path="transcript.txt",
    settings=settings,
    start_date=datetime(2024, 3, 15, 14, 30, tzinfo=timezone.utc),
    length_minutes=60.0
)

Extracting Metadata

from typeagent.transcripts.transcript_ingest import (
    get_transcript_duration,
    get_transcript_speakers
)

# Analyze VTT file before ingestion
duration = get_transcript_duration("podcast.vtt")
speakers = get_transcript_speakers("podcast.vtt")

print(f"Duration: {duration:.2f} seconds")
print(f"Speakers: {speakers}")

Podcast Knowledge Extraction

Podcasts are enriched with semantic knowledge:

# Knowledge extracted includes:
# - Entities: Speaker names, mentioned people/organizations
# - Actions: Discussions, explanations, questions
# - Topics: Subjects covered
# - Relationships: Speaker interactions

result = await podcast.add_messages_with_indexing(messages)
print(f"Extracted {result.semrefs_added} semantic references")

Complete Podcast Example

Here’s a complete example from ingestion to query:

import asyncio
from dotenv import load_dotenv

from typeagent.knowpro.convsettings import ConversationSettings
from typeagent.podcasts.podcast_ingest import ingest_podcast

load_dotenv()

async def main():
    # 1. Configure settings
    settings = ConversationSettings()
    settings.semantic_ref_index_settings.batch_size = 4
    
    # 2. Ingest podcast transcript
    print("Ingesting podcast...")
    podcast = await ingest_podcast(
        transcript_file_path="podcast_transcript.txt",
        settings=settings,
        podcast_name="Tech Talk Episode 1",
        length_minutes=45.0,
        dbname="tech_talk.db",
        verbose=True
    )
    
    # 3. Check ingestion results
    msg_count = await podcast.messages.size()
    ref_count = await podcast.semantic_refs.size()
    print(f"\nIngested {msg_count} messages")
    print(f"Extracted {ref_count} semantic references")
    
    # 4. Query the podcast
    print("\nQuerying podcast...")
    
    questions = [
        "Who were the speakers?",
        "What topics were discussed?",
        "What was said about AI?"
    ]
    
    for question in questions:
        print(f"\nQ: {question}")
        answer = await podcast.query(question)
        print(f"A: {answer}")
    
    # 5. Interactive mode
    print("\n" + "="*50)
    print("Entering interactive mode (type 'q' to exit)")
    print("="*50)
    
    while True:
        try:
            question = input("\ntypeagent> ")
            if question.strip().lower() in ('q', 'quit', 'exit'):
                break
            if not question.strip():
                continue
            
            answer = await podcast.query(question)
            print(answer)
        
        except (EOFError, KeyboardInterrupt):
            break
    
    print("\nGoodbye!")

if __name__ == "__main__":
    asyncio.run(main())

Performance Tips

Optimize Batch Size

Adjust based on podcast length:

# Short podcasts (< 30 min): smaller batches
settings.semantic_ref_index_settings.batch_size = 4

# Long podcasts (> 60 min): larger batches  
settings.semantic_ref_index_settings.batch_size = 10

Monitor Progress

Track ingestion with verbose mode:

podcast = await ingest_podcast(
    transcript_file_path="long_transcript.txt",
    settings=settings,
    verbose=True  # Shows progress updates
)

Resume Long Ingestions

For very long transcripts, ingest in stages:

# First batch
python -c "ingest_podcast(..., start_message=0, batch_size=100)"

# Resume if interrupted
python -c "ingest_podcast(..., start_message=100, batch_size=100)"

Troubleshooting

VTT Parsing Errors

import webvtt

# Validate VTT file before ingestion
try:
    vtt = webvtt.read("podcast.vtt")
    print(f"Valid VTT with {len(vtt)} captions")
except Exception as e:
    print(f"Invalid VTT file: {e}")

Speaker Name Normalization

Speaker names are normalized to lowercase:

# In transcript: "KEVIN:", "Kevin:", "kevin:" all become "kevin"
# Queries work case-insensitively

Missing Timestamps

For transcripts without timing:

# TypeAgent assigns proportional timestamps
# based on text length and total podcast duration
podcast = await ingest_podcast(
    transcript_file_path="no_timestamps.txt",
    settings=settings,
    length_minutes=60.0  # Distribute across 60 minutes
)

Get Started

Core Concepts

Guides

Podcast Integration

Podcast Workflow

Podcast Message Format

Ingesting Plain Text Transcripts

Transcript Format

Timestamp Assignment

Ingesting WebVTT Transcripts

WebVTT Format

Voice Tag Parsing

Multiple VTT Files

Programmatic Podcast Ingestion

Podcast-Specific Features

Participant Alias Resolution

Synonym Expansion

Querying Podcasts

Podcast Query Examples

Podcast Serialization

Saving Podcasts

Loading Podcasts

Advanced Podcast Features

Resuming Interrupted Ingestion

Custom Timestamp Base

Extracting Metadata

Podcast Knowledge Extraction

Complete Podcast Example

Performance Tips

Troubleshooting

VTT Parsing Errors

Speaker Name Normalization

Missing Timestamps

Next Steps

Build docs developers (and LLMs) love

Get Started

Core Concepts

Guides

​Podcast Workflow

​Podcast Message Format

​Ingesting Plain Text Transcripts

​Transcript Format

​Timestamp Assignment

​Ingesting WebVTT Transcripts

​WebVTT Format

​Voice Tag Parsing

​Multiple VTT Files

​Programmatic Podcast Ingestion

​Podcast-Specific Features

​Participant Alias Resolution

​Synonym Expansion

​Querying Podcasts

​Podcast Query Examples

​Podcast Serialization

​Saving Podcasts

​Loading Podcasts

​Advanced Podcast Features

​Resuming Interrupted Ingestion

​Custom Timestamp Base

​Extracting Metadata

​Podcast Knowledge Extraction

​Complete Podcast Example

​Performance Tips

​Troubleshooting

​VTT Parsing Errors

​Speaker Name Normalization

​Missing Timestamps

​Next Steps

Build docs developers (and LLMs) love

Podcast Workflow

Podcast Message Format

Ingesting Plain Text Transcripts

Transcript Format

Timestamp Assignment

Ingesting WebVTT Transcripts

WebVTT Format

Voice Tag Parsing

Multiple VTT Files

Programmatic Podcast Ingestion

Podcast-Specific Features

Participant Alias Resolution

Synonym Expansion

Querying Podcasts

Podcast Query Examples

Podcast Serialization

Saving Podcasts

Loading Podcasts

Advanced Podcast Features

Resuming Interrupted Ingestion

Custom Timestamp Base

Extracting Metadata

Podcast Knowledge Extraction

Complete Podcast Example

Performance Tips

Troubleshooting

VTT Parsing Errors

Speaker Name Normalization

Missing Timestamps

Next Steps