TypeAgent provides specialized support for ingesting and querying podcast transcripts, including WebVTT files with speaker annotations.
Podcast Workflow
Working with podcasts follows a simple two-step workflow:
Parse and index podcast transcript into TypeAgent
Ask questions about what was discussed in the podcast
Podcasts use conversation messages with speaker and recipient metadata:
from typeagent.knowpro.universal_message import ConversationMessage, ConversationMessageMeta
message = ConversationMessage(
text_chunks=["Welcome to our podcast about AI and science fiction."],
metadata=ConversationMessageMeta(
speaker="host",
recipients=["guest", "audience"]
),
timestamp="1970-01-01T00:00:00Z" # Relative timestamp
)
TypeAgent uses Unix epoch (1970-01-01) as the base timestamp for podcasts when the actual date is unknown, preserving relative timing.
Ingesting Plain Text Transcripts
For simple speaker-prefixed transcripts:
import asyncio
from datetime import timedelta
from dotenv import load_dotenv
from typeagent.knowpro.convsettings import ConversationSettings
from typeagent.knowpro.universal_message import format_timestamp_utc, UNIX_EPOCH
from typeagent.podcasts.podcast_ingest import ingest_podcast
load_dotenv()
async def main():
settings = ConversationSettings()
podcast = await ingest_podcast(
transcript_file_path="transcript.txt",
settings=settings,
podcast_name="AI Discussion",
length_minutes=60.0, # Total podcast length
dbname="podcast.db",
verbose=True
)
print(f"Ingested {await podcast.messages.size()} messages")
if __name__ == "__main__":
asyncio.run(main())
Plain text transcripts should use SPEAKER: text format:
HOST: Welcome to the AI podcast.
GUEST: Thanks for having me.
HOST: Let's talk about machine learning.
GUEST: Machine learning is fascinating because...
Timestamp Assignment
TypeAgent assigns timestamps proportionally based on text length:
# Timestamps are calculated based on:
# - Total podcast length (length_minutes)
# - Relative text length of each message
# - Base date (default: Unix epoch)
podcast = await ingest_podcast(
transcript_file_path="transcript.txt",
settings=settings,
start_date=None, # Uses Unix epoch
length_minutes=60.0
)
Ingesting WebVTT Transcripts
For WebVTT files with timing and speaker annotations:
# Basic VTT ingestion
python tools/ingest_vtt.py transcript.vtt -d podcast.db
# With custom name
python tools/ingest_vtt.py transcript.vtt \
-d podcast.db \
--name "Episode 53: Adrian Tchaikovsky"
# Merge consecutive speaker segments
python tools/ingest_vtt.py transcript.vtt \
-d podcast.db \
--merge
# With custom batch size
python tools/ingest_vtt.py transcript.vtt \
-d podcast.db \
--batchsize 10
# Verbose output
python tools/ingest_vtt.py transcript.vtt \
-d podcast.db \
--verbose
TypeAgent supports WebVTT files with voice tags:
WEBVTT
00:00:00.000 --> 00:00:05.000
<v Host>Welcome to Behind the Tech.
00:00:05.000 --> 00:00:12.000
<v Kevin>I'm Kevin Scott, CTO of Microsoft.
00:00:12.000 --> 00:00:18.000
<v Kevin>Today we're talking with Adrian Tchaikovsky.
00:00:18.000 --> 00:00:25.000
<v Adrian>Thanks for having me on the show.
Voice Tag Parsing
TypeAgent parses WebVTT voice annotations:
from typeagent.transcripts.transcript_ingest import parse_voice_tags
# Parse voice-tagged text
text = "<v Host>Welcome to the show<v Guest>Thanks for having me"
segments = parse_voice_tags(text)
# Returns: [
# ("host", "Welcome to the show"),
# ("guest", "Thanks for having me")
# ]
Multiple VTT Files
Ingest multiple VTT files as a continuous conversation:
# Ingest multiple files with time continuity
python tools/ingest_vtt.py \
episode1.vtt episode2.vtt episode3.vtt \
-d combined.db \
--name "Complete Series"
Programmatic Podcast Ingestion
Create custom podcast ingestion pipelines:
import asyncio
from datetime import datetime, timezone
from dotenv import load_dotenv
from typeagent.knowpro.convsettings import ConversationSettings
from typeagent.podcasts.podcast_ingest import ingest_podcast
load_dotenv()
async def ingest_podcast_series():
settings = ConversationSettings()
# Configure knowledge extraction
settings.semantic_ref_index_settings.auto_extract_knowledge = True
settings.semantic_ref_index_settings.batch_size = 4
# Ingest podcast
podcast = await ingest_podcast(
transcript_file_path="episode_53.txt",
settings=settings,
podcast_name="Episode 53: Adrian Tchaikovsky",
start_date=datetime(2024, 1, 15, tzinfo=timezone.utc),
length_minutes=45.0,
dbname="episode_53.db",
batch_size=10, # Override batch size
verbose=True
)
print(f"Podcast '{podcast.name_tag}' ingested successfully")
print(f"Messages: {await podcast.messages.size()}")
print(f"Semantic refs: {await podcast.semantic_refs.size()}")
return podcast
if __name__ == "__main__":
asyncio.run(ingest_podcast_series())
Podcast-Specific Features
Participant Alias Resolution
TypeAgent automatically builds aliases for participants:
# "Kevin Scott" is aliased to "Kevin"
# "Adrian Tchaikovsky" is aliased to "Adrian"
# Queries work with either form:
await podcast.query("What did Kevin say?")
await podcast.query("What did Kevin Scott say?") # Same results
Synonym Expansion
Podcasts include verb synonyms from podcastVerbs.json:
[
{
"term": "discuss",
"relatedTerms": ["talk about", "mention", "bring up", "cover"]
},
{
"term": "explain",
"relatedTerms": ["describe", "clarify", "elaborate"]
}
]
This enables queries like:
# All of these find similar results:
await podcast.query("What did they discuss about AI?")
await podcast.query("What did they talk about regarding AI?")
await podcast.query("What did they mention about AI?")
Querying Podcasts
Query ingested podcasts using natural language:
# Interactive query mode
python tools/query.py --database podcast.db
# Single query
python tools/query.py --database podcast.db \
--query "What did Kevin say to Adrian about science fiction?"
Podcast Query Examples
from typeagent.podcasts.podcast import Podcast
from typeagent.knowpro.convsettings import ConversationSettings
settings = ConversationSettings()
podcast = await Podcast.read_from_file(
"tests/testdata/Episode_53_AdrianTchaikovsky_index",
settings
)
# Who questions
answer = await podcast.query("Who is Adrian Tchaikovsky?")
answer = await podcast.query("Who spoke about AI ethics?")
# What questions
answer = await podcast.query("What did Kevin say about science fiction?")
answer = await podcast.query("What books were mentioned?")
# How questions
answer = await podcast.query("How was Asimov mentioned?")
answer = await podcast.query("How did they describe the challenges?")
# Topic searches
answer = await podcast.query("What was discussed about AI ethics?")
answer = await podcast.query("Tell me about the robotics discussion")
Podcast Serialization
Save and load podcast data efficiently:
Saving Podcasts
from typeagent.podcasts.podcast import Podcast
# Save to files
await podcast.write_to_file("podcast_index")
# Creates two files:
# - podcast_index_data.json (metadata, messages, indexes)
# - podcast_index_embeddings.bin (embedding vectors)
Loading Podcasts
from typeagent.podcasts.podcast import Podcast
from typeagent.knowpro.convsettings import ConversationSettings
settings = ConversationSettings()
# Load from files
podcast = await Podcast.read_from_file(
"podcast_index", # Filename prefix
settings
)
print(f"Loaded {await podcast.messages.size()} messages")
Embedding files are binary and specific to the embedding model used during ingestion.
Advanced Podcast Features
Resuming Interrupted Ingestion
# Resume from message 100 if ingestion was interrupted
podcast = await ingest_podcast(
transcript_file_path="large_transcript.txt",
settings=settings,
dbname="podcast.db",
start_message=100, # Resume from this message
batch_size=50,
verbose=True
)
Custom Timestamp Base
from datetime import datetime, timezone
# Use actual podcast date
podcast = await ingest_podcast(
transcript_file_path="transcript.txt",
settings=settings,
start_date=datetime(2024, 3, 15, 14, 30, tzinfo=timezone.utc),
length_minutes=60.0
)
from typeagent.transcripts.transcript_ingest import (
get_transcript_duration,
get_transcript_speakers
)
# Analyze VTT file before ingestion
duration = get_transcript_duration("podcast.vtt")
speakers = get_transcript_speakers("podcast.vtt")
print(f"Duration: {duration:.2f} seconds")
print(f"Speakers: {speakers}")
Podcasts are enriched with semantic knowledge:
# Knowledge extracted includes:
# - Entities: Speaker names, mentioned people/organizations
# - Actions: Discussions, explanations, questions
# - Topics: Subjects covered
# - Relationships: Speaker interactions
result = await podcast.add_messages_with_indexing(messages)
print(f"Extracted {result.semrefs_added} semantic references")
Complete Podcast Example
Here’s a complete example from ingestion to query:
import asyncio
from dotenv import load_dotenv
from typeagent.knowpro.convsettings import ConversationSettings
from typeagent.podcasts.podcast_ingest import ingest_podcast
load_dotenv()
async def main():
# 1. Configure settings
settings = ConversationSettings()
settings.semantic_ref_index_settings.batch_size = 4
# 2. Ingest podcast transcript
print("Ingesting podcast...")
podcast = await ingest_podcast(
transcript_file_path="podcast_transcript.txt",
settings=settings,
podcast_name="Tech Talk Episode 1",
length_minutes=45.0,
dbname="tech_talk.db",
verbose=True
)
# 3. Check ingestion results
msg_count = await podcast.messages.size()
ref_count = await podcast.semantic_refs.size()
print(f"\nIngested {msg_count} messages")
print(f"Extracted {ref_count} semantic references")
# 4. Query the podcast
print("\nQuerying podcast...")
questions = [
"Who were the speakers?",
"What topics were discussed?",
"What was said about AI?"
]
for question in questions:
print(f"\nQ: {question}")
answer = await podcast.query(question)
print(f"A: {answer}")
# 5. Interactive mode
print("\n" + "="*50)
print("Entering interactive mode (type 'q' to exit)")
print("="*50)
while True:
try:
question = input("\ntypeagent> ")
if question.strip().lower() in ('q', 'quit', 'exit'):
break
if not question.strip():
continue
answer = await podcast.query(question)
print(answer)
except (EOFError, KeyboardInterrupt):
break
print("\nGoodbye!")
if __name__ == "__main__":
asyncio.run(main())
Adjust based on podcast length:
# Short podcasts (< 30 min): smaller batches
settings.semantic_ref_index_settings.batch_size = 4
# Long podcasts (> 60 min): larger batches
settings.semantic_ref_index_settings.batch_size = 10
Track ingestion with verbose mode:
podcast = await ingest_podcast(
transcript_file_path="long_transcript.txt",
settings=settings,
verbose=True # Shows progress updates
)
For very long transcripts, ingest in stages:
# First batch
python -c "ingest_podcast(..., start_message=0, batch_size=100)"
# Resume if interrupted
python -c "ingest_podcast(..., start_message=100, batch_size=100)"
Troubleshooting
VTT Parsing Errors
import webvtt
# Validate VTT file before ingestion
try:
vtt = webvtt.read("podcast.vtt")
print(f"Valid VTT with {len(vtt)} captions")
except Exception as e:
print(f"Invalid VTT file: {e}")
Speaker Name Normalization
Speaker names are normalized to lowercase:
# In transcript: "KEVIN:", "Kevin:", "kevin:" all become "kevin"
# Queries work case-insensitively
Missing Timestamps
For transcripts without timing:
# TypeAgent assigns proportional timestamps
# based on text length and total podcast duration
podcast = await ingest_podcast(
transcript_file_path="no_timestamps.txt",
settings=settings,
length_minutes=60.0 # Distribute across 60 minutes
)
Next Steps