Skip to main content
TypeAgent provides specialized support for ingesting and querying podcast transcripts, including WebVTT files with speaker annotations.

Podcast Workflow

Working with podcasts follows a simple two-step workflow:
1
Ingest Transcript
2
Parse and index podcast transcript into TypeAgent
3
Query Content
4
Ask questions about what was discussed in the podcast

Podcast Message Format

Podcasts use conversation messages with speaker and recipient metadata:
from typeagent.knowpro.universal_message import ConversationMessage, ConversationMessageMeta

message = ConversationMessage(
    text_chunks=["Welcome to our podcast about AI and science fiction."],
    metadata=ConversationMessageMeta(
        speaker="host",
        recipients=["guest", "audience"]
    ),
    timestamp="1970-01-01T00:00:00Z"  # Relative timestamp
)
TypeAgent uses Unix epoch (1970-01-01) as the base timestamp for podcasts when the actual date is unknown, preserving relative timing.

Ingesting Plain Text Transcripts

For simple speaker-prefixed transcripts:
import asyncio
from datetime import timedelta
from dotenv import load_dotenv

from typeagent.knowpro.convsettings import ConversationSettings
from typeagent.knowpro.universal_message import format_timestamp_utc, UNIX_EPOCH
from typeagent.podcasts.podcast_ingest import ingest_podcast

load_dotenv()

async def main():
    settings = ConversationSettings()
    
    podcast = await ingest_podcast(
        transcript_file_path="transcript.txt",
        settings=settings,
        podcast_name="AI Discussion",
        length_minutes=60.0,  # Total podcast length
        dbname="podcast.db",
        verbose=True
    )
    
    print(f"Ingested {await podcast.messages.size()} messages")

if __name__ == "__main__":
    asyncio.run(main())

Transcript Format

Plain text transcripts should use SPEAKER: text format:
HOST: Welcome to the AI podcast.
GUEST: Thanks for having me.
HOST: Let's talk about machine learning.
GUEST: Machine learning is fascinating because...

Timestamp Assignment

TypeAgent assigns timestamps proportionally based on text length:
# Timestamps are calculated based on:
# - Total podcast length (length_minutes)
# - Relative text length of each message
# - Base date (default: Unix epoch)

podcast = await ingest_podcast(
    transcript_file_path="transcript.txt",
    settings=settings,
    start_date=None,  # Uses Unix epoch
    length_minutes=60.0
)

Ingesting WebVTT Transcripts

For WebVTT files with timing and speaker annotations:
# Basic VTT ingestion
python tools/ingest_vtt.py transcript.vtt -d podcast.db

# With custom name
python tools/ingest_vtt.py transcript.vtt \
    -d podcast.db \
    --name "Episode 53: Adrian Tchaikovsky"

# Merge consecutive speaker segments
python tools/ingest_vtt.py transcript.vtt \
    -d podcast.db \
    --merge

# With custom batch size
python tools/ingest_vtt.py transcript.vtt \
    -d podcast.db \
    --batchsize 10

# Verbose output
python tools/ingest_vtt.py transcript.vtt \
    -d podcast.db \
    --verbose

WebVTT Format

TypeAgent supports WebVTT files with voice tags:
WEBVTT

00:00:00.000 --> 00:00:05.000
<v Host>Welcome to Behind the Tech.

00:00:05.000 --> 00:00:12.000
<v Kevin>I'm Kevin Scott, CTO of Microsoft.

00:00:12.000 --> 00:00:18.000
<v Kevin>Today we're talking with Adrian Tchaikovsky.

00:00:18.000 --> 00:00:25.000
<v Adrian>Thanks for having me on the show.

Voice Tag Parsing

TypeAgent parses WebVTT voice annotations:
from typeagent.transcripts.transcript_ingest import parse_voice_tags

# Parse voice-tagged text
text = "<v Host>Welcome to the show<v Guest>Thanks for having me"
segments = parse_voice_tags(text)

# Returns: [
#     ("host", "Welcome to the show"),
#     ("guest", "Thanks for having me")
# ]

Multiple VTT Files

Ingest multiple VTT files as a continuous conversation:
# Ingest multiple files with time continuity
python tools/ingest_vtt.py \
    episode1.vtt episode2.vtt episode3.vtt \
    -d combined.db \
    --name "Complete Series"

Programmatic Podcast Ingestion

Create custom podcast ingestion pipelines:
import asyncio
from datetime import datetime, timezone
from dotenv import load_dotenv

from typeagent.knowpro.convsettings import ConversationSettings  
from typeagent.podcasts.podcast_ingest import ingest_podcast

load_dotenv()

async def ingest_podcast_series():
    settings = ConversationSettings()
    
    # Configure knowledge extraction
    settings.semantic_ref_index_settings.auto_extract_knowledge = True
    settings.semantic_ref_index_settings.batch_size = 4
    
    # Ingest podcast
    podcast = await ingest_podcast(
        transcript_file_path="episode_53.txt",
        settings=settings,
        podcast_name="Episode 53: Adrian Tchaikovsky",
        start_date=datetime(2024, 1, 15, tzinfo=timezone.utc),
        length_minutes=45.0,
        dbname="episode_53.db",
        batch_size=10,  # Override batch size
        verbose=True
    )
    
    print(f"Podcast '{podcast.name_tag}' ingested successfully")
    print(f"Messages: {await podcast.messages.size()}")
    print(f"Semantic refs: {await podcast.semantic_refs.size()}")
    
    return podcast

if __name__ == "__main__":
    asyncio.run(ingest_podcast_series())

Podcast-Specific Features

Participant Alias Resolution

TypeAgent automatically builds aliases for participants:
# "Kevin Scott" is aliased to "Kevin"
# "Adrian Tchaikovsky" is aliased to "Adrian"

# Queries work with either form:
await podcast.query("What did Kevin say?")
await podcast.query("What did Kevin Scott say?")  # Same results

Synonym Expansion

Podcasts include verb synonyms from podcastVerbs.json:
[
  {
    "term": "discuss",
    "relatedTerms": ["talk about", "mention", "bring up", "cover"]
  },
  {
    "term": "explain",
    "relatedTerms": ["describe", "clarify", "elaborate"]
  }
]
This enables queries like:
# All of these find similar results:
await podcast.query("What did they discuss about AI?")
await podcast.query("What did they talk about regarding AI?")
await podcast.query("What did they mention about AI?")

Querying Podcasts

Query ingested podcasts using natural language:
# Interactive query mode
python tools/query.py --database podcast.db

# Single query
python tools/query.py --database podcast.db \
    --query "What did Kevin say to Adrian about science fiction?"

Podcast Query Examples

from typeagent.podcasts.podcast import Podcast
from typeagent.knowpro.convsettings import ConversationSettings

settings = ConversationSettings()
podcast = await Podcast.read_from_file(
    "tests/testdata/Episode_53_AdrianTchaikovsky_index",
    settings
)

# Who questions
answer = await podcast.query("Who is Adrian Tchaikovsky?")
answer = await podcast.query("Who spoke about AI ethics?")

# What questions
answer = await podcast.query("What did Kevin say about science fiction?")
answer = await podcast.query("What books were mentioned?")

# How questions  
answer = await podcast.query("How was Asimov mentioned?")
answer = await podcast.query("How did they describe the challenges?")

# Topic searches
answer = await podcast.query("What was discussed about AI ethics?")
answer = await podcast.query("Tell me about the robotics discussion")

Podcast Serialization

Save and load podcast data efficiently:

Saving Podcasts

from typeagent.podcasts.podcast import Podcast

# Save to files
await podcast.write_to_file("podcast_index")

# Creates two files:
# - podcast_index_data.json (metadata, messages, indexes)
# - podcast_index_embeddings.bin (embedding vectors)

Loading Podcasts

from typeagent.podcasts.podcast import Podcast
from typeagent.knowpro.convsettings import ConversationSettings

settings = ConversationSettings()

# Load from files
podcast = await Podcast.read_from_file(
    "podcast_index",  # Filename prefix
    settings
)

print(f"Loaded {await podcast.messages.size()} messages")
Embedding files are binary and specific to the embedding model used during ingestion.

Advanced Podcast Features

Resuming Interrupted Ingestion

# Resume from message 100 if ingestion was interrupted
podcast = await ingest_podcast(
    transcript_file_path="large_transcript.txt",
    settings=settings,
    dbname="podcast.db",
    start_message=100,  # Resume from this message
    batch_size=50,
    verbose=True
)

Custom Timestamp Base

from datetime import datetime, timezone

# Use actual podcast date
podcast = await ingest_podcast(
    transcript_file_path="transcript.txt",
    settings=settings,
    start_date=datetime(2024, 3, 15, 14, 30, tzinfo=timezone.utc),
    length_minutes=60.0
)

Extracting Metadata

from typeagent.transcripts.transcript_ingest import (
    get_transcript_duration,
    get_transcript_speakers
)

# Analyze VTT file before ingestion
duration = get_transcript_duration("podcast.vtt")
speakers = get_transcript_speakers("podcast.vtt")

print(f"Duration: {duration:.2f} seconds")
print(f"Speakers: {speakers}")

Podcast Knowledge Extraction

Podcasts are enriched with semantic knowledge:
# Knowledge extracted includes:
# - Entities: Speaker names, mentioned people/organizations
# - Actions: Discussions, explanations, questions
# - Topics: Subjects covered
# - Relationships: Speaker interactions

result = await podcast.add_messages_with_indexing(messages)
print(f"Extracted {result.semrefs_added} semantic references")

Complete Podcast Example

Here’s a complete example from ingestion to query:
import asyncio
from dotenv import load_dotenv

from typeagent.knowpro.convsettings import ConversationSettings
from typeagent.podcasts.podcast_ingest import ingest_podcast

load_dotenv()

async def main():
    # 1. Configure settings
    settings = ConversationSettings()
    settings.semantic_ref_index_settings.batch_size = 4
    
    # 2. Ingest podcast transcript
    print("Ingesting podcast...")
    podcast = await ingest_podcast(
        transcript_file_path="podcast_transcript.txt",
        settings=settings,
        podcast_name="Tech Talk Episode 1",
        length_minutes=45.0,
        dbname="tech_talk.db",
        verbose=True
    )
    
    # 3. Check ingestion results
    msg_count = await podcast.messages.size()
    ref_count = await podcast.semantic_refs.size()
    print(f"\nIngested {msg_count} messages")
    print(f"Extracted {ref_count} semantic references")
    
    # 4. Query the podcast
    print("\nQuerying podcast...")
    
    questions = [
        "Who were the speakers?",
        "What topics were discussed?",
        "What was said about AI?"
    ]
    
    for question in questions:
        print(f"\nQ: {question}")
        answer = await podcast.query(question)
        print(f"A: {answer}")
    
    # 5. Interactive mode
    print("\n" + "="*50)
    print("Entering interactive mode (type 'q' to exit)")
    print("="*50)
    
    while True:
        try:
            question = input("\ntypeagent> ")
            if question.strip().lower() in ('q', 'quit', 'exit'):
                break
            if not question.strip():
                continue
            
            answer = await podcast.query(question)
            print(answer)
        
        except (EOFError, KeyboardInterrupt):
            break
    
    print("\nGoodbye!")

if __name__ == "__main__":
    asyncio.run(main())

Performance Tips

1
Optimize Batch Size
2
Adjust based on podcast length:
3
# Short podcasts (< 30 min): smaller batches
settings.semantic_ref_index_settings.batch_size = 4

# Long podcasts (> 60 min): larger batches  
settings.semantic_ref_index_settings.batch_size = 10
4
Monitor Progress
5
Track ingestion with verbose mode:
6
podcast = await ingest_podcast(
    transcript_file_path="long_transcript.txt",
    settings=settings,
    verbose=True  # Shows progress updates
)
7
Resume Long Ingestions
8
For very long transcripts, ingest in stages:
9
# First batch
python -c "ingest_podcast(..., start_message=0, batch_size=100)"

# Resume if interrupted
python -c "ingest_podcast(..., start_message=100, batch_size=100)"

Troubleshooting

VTT Parsing Errors

import webvtt

# Validate VTT file before ingestion
try:
    vtt = webvtt.read("podcast.vtt")
    print(f"Valid VTT with {len(vtt)} captions")
except Exception as e:
    print(f"Invalid VTT file: {e}")

Speaker Name Normalization

Speaker names are normalized to lowercase:
# In transcript: "KEVIN:", "Kevin:", "kevin:" all become "kevin"
# Queries work case-insensitively

Missing Timestamps

For transcripts without timing:
# TypeAgent assigns proportional timestamps
# based on text length and total podcast duration
podcast = await ingest_podcast(
    transcript_file_path="no_timestamps.txt",
    settings=settings,
    length_minutes=60.0  # Distribute across 60 minutes
)

Next Steps

Build docs developers (and LLMs) love