Email Integration

TypeAgent provides comprehensive email integration for ingesting and querying email conversations from multiple sources.

Email Workflow Overview

Working with emails follows a three-step workflow:

Download Emails

Fetch raw .eml files from your email provider

Ingest Emails

Parse and index emails into a TypeAgent database

Query Emails

Ask natural language questions about your email content

Email Message Format

TypeAgent represents emails using the EmailMessage class:

from typeagent.emails.email_message import EmailMessage, EmailMessageMeta

message = EmailMessage(
    text_chunks=["Subject: Project Update", "The project is on track..."],
    metadata=EmailMessageMeta(
        sender="[email protected]",
        recipients=["[email protected]", "[email protected]"],
        cc=["[email protected]"],
        subject="Project Update",
        id="<[email protected]>"
    ),
    timestamp="2024-01-15T10:30:00Z",
    src_url="/path/to/email.eml"
)

Metadata Structure

Email metadata includes:

sender: From address
recipients: To addresses (list)
cc: CC addresses (list)
bcc: BCC addresses (list, if available)
subject: Email subject line
id: Message-ID header
timestamp: ISO 8601 timestamp
src_url: Source file path or identifier

Importing Email Files

TypeAgent provides utilities for importing .eml files:

from typeagent.emails.email_import import import_email_from_file

# Import single email
email = import_email_from_file("message.eml")

print(f"From: {email.metadata.sender}")
print(f"To: {', '.join(email.metadata.recipients)}")
print(f"Subject: {email.metadata.subject}")
print(f"Body chunks: {len(email.text_chunks)}")

Import from Directory

from typeagent.emails.email_import import import_emails_from_dir

# Import all .eml files from directory
for email in import_emails_from_dir("inbox_dump"):
    print(f"Imported: {email.metadata.subject}")

Import from String

from typeagent.emails.email_import import import_email_string

# Import from MIME string
with open("message.eml", "r") as f:
    mime_string = f.read()

email = import_email_string(mime_string)

Email Ingestion Tool

The ingest_email.py tool provides a complete email ingestion pipeline:

# Basic ingestion
python tools/ingest_email.py -d emails.db inbox_dump/

# Ingest specific files
python tools/ingest_email.py -d emails.db msg1.eml msg2.eml

# Verbose output
python tools/ingest_email.py -d emails.db inbox_dump/ --verbose

Date Filtering

Filter emails by date range:

# Ingest only January 2024 emails
python tools/ingest_email.py -d emails.db inbox_dump/ \
    --start-date 2024-01-01 \
    --stop-date 2024-02-01

# Date range is [start, stop) - start inclusive, stop exclusive

Pagination

Process emails in batches:

# Ingest first 20 emails
python tools/ingest_email.py -d emails.db inbox_dump/ --limit 20

# Skip first 100, process next 50
python tools/ingest_email.py -d emails.db inbox_dump/ \
    --offset 100 \
    --limit 50

Filter Pipeline

The ingestion tool applies filters in this order:

Offset/Limit Slicing

Slice the input file list: files[offset:offset+limit]

Already-Ingested Check

Skip emails that were previously ingested

Date Range Filter

Filter by --start-date and --stop-date

Programmatic Email Ingestion

Create a custom ingestion pipeline:

import asyncio
from pathlib import Path
from dotenv import load_dotenv

from typeagent.emails.email_import import import_email_from_file
from typeagent.emails.email_memory import EmailMemory
from typeagent.emails.email_message import EmailMessage
from typeagent.knowpro.convsettings import ConversationSettings
from typeagent.storage.utils import create_storage_provider

load_dotenv()

async def ingest_emails():
    # Create settings
    settings = ConversationSettings()
    
    # Create storage provider
    settings.storage_provider = await create_storage_provider(
        settings.message_text_index_settings,
        settings.related_term_index_settings,
        "emails.db",
        EmailMessage
    )
    
    # Create email memory
    email_memory = await EmailMemory.create(settings)
    
    # Process email files
    email_dir = Path("inbox_dump")
    for email_file in email_dir.glob("*.eml"):
        source_id = str(email_file)
        
        # Skip if already ingested
        if await settings.storage_provider.is_source_ingested(source_id):
            print(f"Skipping {email_file.name} (already ingested)")
            continue
        
        try:
            # Import and ingest
            email = import_email_from_file(str(email_file))
            await email_memory.add_messages_with_indexing(
                [email],
                source_ids=[source_id]
            )
            print(f"Ingested {email_file.name}")
        
        except Exception as e:
            print(f"Failed to ingest {email_file.name}: {e}")
            # Mark as failed
            async with settings.storage_provider:
                await settings.storage_provider.mark_source_ingested(
                    source_id,
                    status=e.__class__.__name__
                )

if __name__ == "__main__":
    asyncio.run(ingest_emails())

Downloading Emails

TypeAgent includes tools for downloading emails from various sources:

Gmail
Outlook
Mbox

Download emails using the Gmail API:

# Download 50 most recent emails (default)
cd tools/mail
python gmail_dump.py

# Download 200 emails
python gmail_dump.py --max-results 200

# Output to specific directory
python gmail_dump.py --output-dir ~/gmail_export

Gmail API Setup

Create Google Cloud App

Go to Google Cloud Console

Create a new project

Enable the Gmail API

Create OAuth Client

Navigate to “Credentials” in sidebar

Click ”+ Create Credentials”

Select “OAuth client ID”

Choose “Desktop app”

Download JSON credentials

Configure Tool

Save credentials as tools/mail/client_secret.json

Run gmail_dump.py

Complete OAuth flow in browser

Token saved to tools/mail/token.json

The Gmail API token expires after about a week. Delete token.json to trigger re-authentication.

Download emails using Microsoft Graph API:

# Download from Outlook
cd tools/mail
python outlook_dump.py

# Download specific number of emails
python outlook_dump.py --max-results 100

Microsoft Graph Setup

Go to Azure Portal

Navigate to “App registrations”

Configure Permissions

Add “Mail.Read” permission

Grant admin consent

Create client secret

Configure Tool

Set environment variables:

OUTLOOK_CLIENT_ID
OUTLOOK_CLIENT_SECRET
OUTLOOK_TENANT_ID

Run outlook_dump.py

Extract emails from mbox archives:

# Extract from local mbox file
cd tools/mail
python mbox_dump.py archive.mbox

# Extract to specific directory
python mbox_dump.py archive.mbox --output-dir ~/mbox_export

Mbox files are commonly exported from:

Thunderbird
Apple Mail
Gmail Takeout
Many email servers

Email Features

Reply Detection

TypeAgent automatically detects and extracts only the latest response from email threads:

from typeagent.emails.email_import import is_reply, get_last_response_in_thread

# Check if email is a reply
if is_reply(email_message):
    # Extract only the new content
    body = get_last_response_in_thread(body_text)

Forward Detection

from typeagent.emails.email_import import is_forwarded, get_forwarded_email_parts

# Check if email is forwarded
if is_forwarded(email_message):
    # Split into parts
    parts = get_forwarded_email_parts(email_text)

Encoding Handling

TypeAgent properly handles RFC 2047 encoded words:

from typeagent.emails.email_import import decode_encoded_words

# Decode encoded headers
subject = decode_encoded_words("=?UTF-8?B?SGVsbG8gV29ybGQ=?=")
print(subject)  # "Hello World"

Querying Emails

Once ingested, query emails using natural language:

# Interactive query
python tools/query.py --database emails.db

# Single query
python tools/query.py --database emails.db \
    --query "What emails did Alice send about the project?"

Email Query Examples

from typeagent import create_conversation
from typeagent.emails.email_message import EmailMessage

conversation = await create_conversation("emails.db", EmailMessage)

# Who questions
answer = await conversation.query("Who sent emails about the meeting?")
answer = await conversation.query("Who did Alice email yesterday?")

# What questions  
answer = await conversation.query("What was discussed in the project emails?")
answer = await conversation.query("What action items were mentioned?")

# When questions
answer = await conversation.query("When was the deadline mentioned?")
answer = await conversation.query("What emails were sent last week?")

# Topic searches
answer = await conversation.query("Find emails about budget approval")
answer = await conversation.query("Show me emails related to deployment")

Knowledge Extraction from Emails

Emails are automatically enriched with semantic knowledge:

# EmailMessage.metadata.get_knowledge() extracts:
# - Entities: People (sender, recipients), email addresses
# - Actions: "sent email", "received email"  
# - Topics: Subject line
# - Relationships: sender -> recipient connections

knowledge = email.metadata.get_knowledge()
print(f"Entities: {len(knowledge.entities)}")
print(f"Actions: {len(knowledge.actions)}")
print(f"Topics: {knowledge.topics}")

Entity Extraction

Email addresses are parsed into entities:

# "Alice Smith <[email protected]>" becomes:
# - Entity: "Alice Smith" (type: person)
#   - Facet: email_address = [email protected]
# - Entity: "[email protected]" (type: email_address, alias)

Action Extraction

Email actions capture communication:

# For email from [email protected] to [email protected]:
# - Action: "Alice Smith" sent email to "Bob Jones"
# - Action: "[email protected]" sent email to "[email protected]"
# - Action: "Bob Jones" received email from "Alice Smith"
# - Action: "[email protected]" received email from "[email protected]"

Performance Tuning

Email ingestion can take 1-2 seconds per message due to LLM-based knowledge extraction.

Batch Size Configuration

from typeagent.knowpro.convsettings import ConversationSettings

settings = ConversationSettings()

# Adjust concurrent extraction (default: 4)
settings.semantic_ref_index_settings.batch_size = 4

Progress Monitoring

import time

start_time = time.time()
success_count = 0
batch_size = 4

for i, email in enumerate(emails):
    await email_memory.add_messages_with_indexing([email])
    success_count += 1
    
    # Print progress periodically
    if (success_count % batch_size) == 0:
        elapsed = time.time() - start_time
        semref_count = await semref_collection.size()
        print(f"{success_count} imported | "
              f"{semref_count} semrefs | "
              f"{elapsed:.1f}s elapsed")

Error Handling

Handle common email ingestion errors:

import traceback
import openai

success_count = 0
failed_count = 0
skipped_count = 0

for source_id, email_file in email_files:
    try:
        email = import_email_from_file(str(email_file))
        
        # Apply date filter
        if not email_matches_date_filter(
            email.timestamp,
            start_date,
            stop_date
        ):
            skipped_count += 1
            continue
        
        await email_memory.add_messages_with_indexing(
            [email],
            source_ids=[source_id]
        )
        success_count += 1
    
    except openai.AuthenticationError as e:
        print(f"Authentication error: {e}")
        break  # Fatal error
    
    except Exception as e:
        failed_count += 1
        print(f"Error processing {source_id}: {e}")
        
        # Mark as failed
        async with storage_provider:
            await storage_provider.mark_source_ingested(
                source_id,
                status=e.__class__.__name__
            )
        
        if verbose:
            traceback.print_exc()

print(f"\nSuccessfully imported {success_count} emails")
print(f"Skipped {skipped_count} emails (date filter)")
print(f"Failed to import {failed_count} emails")

Example: Complete Email Pipeline

Here’s a complete example from download to query:

#!/bin/bash
# complete_email_pipeline.sh

set -e  # Exit on error

# 1. Download emails from Gmail
echo "Downloading emails..."
cd tools/mail
python gmail_dump.py --max-results 100 --output-dir ../../email_dump
cd ../..

# 2. Ingest emails into database
echo "Ingesting emails..."
python tools/ingest_email.py \
    -d emails.db \
    email_dump/ \
    --start-date 2024-01-01 \
    --verbose

# 3. Query the database
echo "Database ready for queries!"
echo "Run: python tools/query.py --database emails.db"

Get Started

Core Concepts

Guides

Email Workflow Overview

Email Message Format

Metadata Structure

Importing Email Files

Import from Directory

Import from String

Email Ingestion Tool

Date Filtering

Filter Pipeline

Programmatic Email Ingestion

Downloading Emails

Gmail API Setup

Microsoft Graph Setup

Email Features

Reply Detection

Forward Detection

Encoding Handling

Querying Emails

Email Query Examples

Knowledge Extraction from Emails

Entity Extraction

Action Extraction

Performance Tuning

Batch Size Configuration

Progress Monitoring

Error Handling

Example: Complete Email Pipeline

Next Steps

Build docs developers (and LLMs) love

Get Started

Core Concepts

Guides

​Email Workflow Overview

​Email Message Format

​Metadata Structure

​Importing Email Files

​Import from Directory

​Import from String

​Email Ingestion Tool

​Date Filtering

​Pagination

​Filter Pipeline

​Programmatic Email Ingestion

​Downloading Emails

​Gmail API Setup

​Microsoft Graph Setup

​Email Features

​Reply Detection

​Forward Detection

​Encoding Handling

​Querying Emails

​Email Query Examples

​Knowledge Extraction from Emails

​Entity Extraction

​Action Extraction

​Performance Tuning

​Batch Size Configuration

​Progress Monitoring

​Error Handling

​Example: Complete Email Pipeline

​Next Steps

Build docs developers (and LLMs) love

Email Workflow Overview

Email Message Format

Metadata Structure

Importing Email Files

Import from Directory

Import from String

Email Ingestion Tool

Date Filtering

Pagination

Filter Pipeline

Programmatic Email Ingestion

Downloading Emails

Gmail API Setup

Microsoft Graph Setup

Email Features

Reply Detection

Forward Detection

Encoding Handling

Querying Emails

Email Query Examples

Knowledge Extraction from Emails

Entity Extraction

Action Extraction

Performance Tuning

Batch Size Configuration

Progress Monitoring

Error Handling

Example: Complete Email Pipeline

Next Steps