TypeAgent provides comprehensive email integration for ingesting and querying email conversations from multiple sources.
Email Workflow Overview
Working with emails follows a three-step workflow:
Fetch raw .eml files from your email provider
Parse and index emails into a TypeAgent database
Ask natural language questions about your email content
TypeAgent represents emails using the EmailMessage class:
from typeagent.emails.email_message import EmailMessage, EmailMessageMeta
message = EmailMessage(
text_chunks=["Subject: Project Update", "The project is on track..."],
metadata=EmailMessageMeta(
sender="[email protected]",
recipients=["[email protected]", "[email protected]"],
cc=["[email protected]"],
subject="Project Update",
id="<[email protected]>"
),
timestamp="2024-01-15T10:30:00Z",
src_url="/path/to/email.eml"
)
Email metadata includes:
- sender: From address
- recipients: To addresses (list)
- cc: CC addresses (list)
- bcc: BCC addresses (list, if available)
- subject: Email subject line
- id: Message-ID header
- timestamp: ISO 8601 timestamp
- src_url: Source file path or identifier
Importing Email Files
TypeAgent provides utilities for importing .eml files:
from typeagent.emails.email_import import import_email_from_file
# Import single email
email = import_email_from_file("message.eml")
print(f"From: {email.metadata.sender}")
print(f"To: {', '.join(email.metadata.recipients)}")
print(f"Subject: {email.metadata.subject}")
print(f"Body chunks: {len(email.text_chunks)}")
Import from Directory
from typeagent.emails.email_import import import_emails_from_dir
# Import all .eml files from directory
for email in import_emails_from_dir("inbox_dump"):
print(f"Imported: {email.metadata.subject}")
Import from String
from typeagent.emails.email_import import import_email_string
# Import from MIME string
with open("message.eml", "r") as f:
mime_string = f.read()
email = import_email_string(mime_string)
The ingest_email.py tool provides a complete email ingestion pipeline:
# Basic ingestion
python tools/ingest_email.py -d emails.db inbox_dump/
# Ingest specific files
python tools/ingest_email.py -d emails.db msg1.eml msg2.eml
# Verbose output
python tools/ingest_email.py -d emails.db inbox_dump/ --verbose
Date Filtering
Filter emails by date range:
# Ingest only January 2024 emails
python tools/ingest_email.py -d emails.db inbox_dump/ \
--start-date 2024-01-01 \
--stop-date 2024-02-01
# Date range is [start, stop) - start inclusive, stop exclusive
Process emails in batches:
# Ingest first 20 emails
python tools/ingest_email.py -d emails.db inbox_dump/ --limit 20
# Skip first 100, process next 50
python tools/ingest_email.py -d emails.db inbox_dump/ \
--offset 100 \
--limit 50
Filter Pipeline
The ingestion tool applies filters in this order:
Slice the input file list: files[offset:offset+limit]
Skip emails that were previously ingested
Filter by --start-date and --stop-date
Programmatic Email Ingestion
Create a custom ingestion pipeline:
import asyncio
from pathlib import Path
from dotenv import load_dotenv
from typeagent.emails.email_import import import_email_from_file
from typeagent.emails.email_memory import EmailMemory
from typeagent.emails.email_message import EmailMessage
from typeagent.knowpro.convsettings import ConversationSettings
from typeagent.storage.utils import create_storage_provider
load_dotenv()
async def ingest_emails():
# Create settings
settings = ConversationSettings()
# Create storage provider
settings.storage_provider = await create_storage_provider(
settings.message_text_index_settings,
settings.related_term_index_settings,
"emails.db",
EmailMessage
)
# Create email memory
email_memory = await EmailMemory.create(settings)
# Process email files
email_dir = Path("inbox_dump")
for email_file in email_dir.glob("*.eml"):
source_id = str(email_file)
# Skip if already ingested
if await settings.storage_provider.is_source_ingested(source_id):
print(f"Skipping {email_file.name} (already ingested)")
continue
try:
# Import and ingest
email = import_email_from_file(str(email_file))
await email_memory.add_messages_with_indexing(
[email],
source_ids=[source_id]
)
print(f"Ingested {email_file.name}")
except Exception as e:
print(f"Failed to ingest {email_file.name}: {e}")
# Mark as failed
async with settings.storage_provider:
await settings.storage_provider.mark_source_ingested(
source_id,
status=e.__class__.__name__
)
if __name__ == "__main__":
asyncio.run(ingest_emails())
Downloading Emails
TypeAgent includes tools for downloading emails from various sources:
Download emails using the Gmail API:# Download 50 most recent emails (default)
cd tools/mail
python gmail_dump.py
# Download 200 emails
python gmail_dump.py --max-results 200
# Output to specific directory
python gmail_dump.py --output-dir ~/gmail_export
Gmail API Setup
Navigate to “Credentials” in sidebar
Click ”+ Create Credentials”
Select “OAuth client ID”
Choose “Desktop app”
Download JSON credentials
Save credentials as tools/mail/client_secret.json
Run gmail_dump.py
Complete OAuth flow in browser
Token saved to tools/mail/token.json
The Gmail API token expires after about a week. Delete token.json to trigger re-authentication.
Download emails using Microsoft Graph API:# Download from Outlook
cd tools/mail
python outlook_dump.py
# Download specific number of emails
python outlook_dump.py --max-results 100
Microsoft Graph Setup
Go to Azure Portal
Navigate to “App registrations”
Register new application
Add “Mail.Read” permission
Grant admin consent
Create client secret
Set environment variables:
OUTLOOK_CLIENT_ID
OUTLOOK_CLIENT_SECRET
OUTLOOK_TENANT_ID
Run outlook_dump.py
Extract emails from mbox archives:# Extract from local mbox file
cd tools/mail
python mbox_dump.py archive.mbox
# Extract to specific directory
python mbox_dump.py archive.mbox --output-dir ~/mbox_export
Mbox files are commonly exported from:
- Thunderbird
- Apple Mail
- Gmail Takeout
- Many email servers
Email Features
Reply Detection
TypeAgent automatically detects and extracts only the latest response from email threads:
from typeagent.emails.email_import import is_reply, get_last_response_in_thread
# Check if email is a reply
if is_reply(email_message):
# Extract only the new content
body = get_last_response_in_thread(body_text)
Forward Detection
from typeagent.emails.email_import import is_forwarded, get_forwarded_email_parts
# Check if email is forwarded
if is_forwarded(email_message):
# Split into parts
parts = get_forwarded_email_parts(email_text)
Encoding Handling
TypeAgent properly handles RFC 2047 encoded words:
from typeagent.emails.email_import import decode_encoded_words
# Decode encoded headers
subject = decode_encoded_words("=?UTF-8?B?SGVsbG8gV29ybGQ=?=")
print(subject) # "Hello World"
Querying Emails
Once ingested, query emails using natural language:
# Interactive query
python tools/query.py --database emails.db
# Single query
python tools/query.py --database emails.db \
--query "What emails did Alice send about the project?"
Email Query Examples
from typeagent import create_conversation
from typeagent.emails.email_message import EmailMessage
conversation = await create_conversation("emails.db", EmailMessage)
# Who questions
answer = await conversation.query("Who sent emails about the meeting?")
answer = await conversation.query("Who did Alice email yesterday?")
# What questions
answer = await conversation.query("What was discussed in the project emails?")
answer = await conversation.query("What action items were mentioned?")
# When questions
answer = await conversation.query("When was the deadline mentioned?")
answer = await conversation.query("What emails were sent last week?")
# Topic searches
answer = await conversation.query("Find emails about budget approval")
answer = await conversation.query("Show me emails related to deployment")
Emails are automatically enriched with semantic knowledge:
# EmailMessage.metadata.get_knowledge() extracts:
# - Entities: People (sender, recipients), email addresses
# - Actions: "sent email", "received email"
# - Topics: Subject line
# - Relationships: sender -> recipient connections
knowledge = email.metadata.get_knowledge()
print(f"Entities: {len(knowledge.entities)}")
print(f"Actions: {len(knowledge.actions)}")
print(f"Topics: {knowledge.topics}")
Email addresses are parsed into entities:
Email actions capture communication:
Email ingestion can take 1-2 seconds per message due to LLM-based knowledge extraction.
Batch Size Configuration
from typeagent.knowpro.convsettings import ConversationSettings
settings = ConversationSettings()
# Adjust concurrent extraction (default: 4)
settings.semantic_ref_index_settings.batch_size = 4
Progress Monitoring
import time
start_time = time.time()
success_count = 0
batch_size = 4
for i, email in enumerate(emails):
await email_memory.add_messages_with_indexing([email])
success_count += 1
# Print progress periodically
if (success_count % batch_size) == 0:
elapsed = time.time() - start_time
semref_count = await semref_collection.size()
print(f"{success_count} imported | "
f"{semref_count} semrefs | "
f"{elapsed:.1f}s elapsed")
Error Handling
Handle common email ingestion errors:
import traceback
import openai
success_count = 0
failed_count = 0
skipped_count = 0
for source_id, email_file in email_files:
try:
email = import_email_from_file(str(email_file))
# Apply date filter
if not email_matches_date_filter(
email.timestamp,
start_date,
stop_date
):
skipped_count += 1
continue
await email_memory.add_messages_with_indexing(
[email],
source_ids=[source_id]
)
success_count += 1
except openai.AuthenticationError as e:
print(f"Authentication error: {e}")
break # Fatal error
except Exception as e:
failed_count += 1
print(f"Error processing {source_id}: {e}")
# Mark as failed
async with storage_provider:
await storage_provider.mark_source_ingested(
source_id,
status=e.__class__.__name__
)
if verbose:
traceback.print_exc()
print(f"\nSuccessfully imported {success_count} emails")
print(f"Skipped {skipped_count} emails (date filter)")
print(f"Failed to import {failed_count} emails")
Example: Complete Email Pipeline
Here’s a complete example from download to query:
#!/bin/bash
# complete_email_pipeline.sh
set -e # Exit on error
# 1. Download emails from Gmail
echo "Downloading emails..."
cd tools/mail
python gmail_dump.py --max-results 100 --output-dir ../../email_dump
cd ../..
# 2. Ingest emails into database
echo "Ingesting emails..."
python tools/ingest_email.py \
-d emails.db \
email_dump/ \
--start-date 2024-01-01 \
--verbose
# 3. Query the database
echo "Database ready for queries!"
echo "Run: python tools/query.py --database emails.db"
Next Steps