Skip to main content
TypeAgent provides comprehensive email integration for ingesting and querying email conversations from multiple sources.

Email Workflow Overview

Working with emails follows a three-step workflow:
1
Download Emails
2
Fetch raw .eml files from your email provider
3
Ingest Emails
4
Parse and index emails into a TypeAgent database
5
Query Emails
6
Ask natural language questions about your email content

Email Message Format

TypeAgent represents emails using the EmailMessage class:
from typeagent.emails.email_message import EmailMessage, EmailMessageMeta

message = EmailMessage(
    text_chunks=["Subject: Project Update", "The project is on track..."],
    metadata=EmailMessageMeta(
        sender="[email protected]",
        recipients=["[email protected]", "[email protected]"],
        cc=["[email protected]"],
        subject="Project Update",
        id="<[email protected]>"
    ),
    timestamp="2024-01-15T10:30:00Z",
    src_url="/path/to/email.eml"
)

Metadata Structure

Email metadata includes:
  • sender: From address
  • recipients: To addresses (list)
  • cc: CC addresses (list)
  • bcc: BCC addresses (list, if available)
  • subject: Email subject line
  • id: Message-ID header
  • timestamp: ISO 8601 timestamp
  • src_url: Source file path or identifier

Importing Email Files

TypeAgent provides utilities for importing .eml files:
from typeagent.emails.email_import import import_email_from_file

# Import single email
email = import_email_from_file("message.eml")

print(f"From: {email.metadata.sender}")
print(f"To: {', '.join(email.metadata.recipients)}")
print(f"Subject: {email.metadata.subject}")
print(f"Body chunks: {len(email.text_chunks)}")

Import from Directory

from typeagent.emails.email_import import import_emails_from_dir

# Import all .eml files from directory
for email in import_emails_from_dir("inbox_dump"):
    print(f"Imported: {email.metadata.subject}")

Import from String

from typeagent.emails.email_import import import_email_string

# Import from MIME string
with open("message.eml", "r") as f:
    mime_string = f.read()

email = import_email_string(mime_string)

Email Ingestion Tool

The ingest_email.py tool provides a complete email ingestion pipeline:
# Basic ingestion
python tools/ingest_email.py -d emails.db inbox_dump/

# Ingest specific files
python tools/ingest_email.py -d emails.db msg1.eml msg2.eml

# Verbose output
python tools/ingest_email.py -d emails.db inbox_dump/ --verbose

Date Filtering

Filter emails by date range:
# Ingest only January 2024 emails
python tools/ingest_email.py -d emails.db inbox_dump/ \
    --start-date 2024-01-01 \
    --stop-date 2024-02-01

# Date range is [start, stop) - start inclusive, stop exclusive

Pagination

Process emails in batches:
# Ingest first 20 emails
python tools/ingest_email.py -d emails.db inbox_dump/ --limit 20

# Skip first 100, process next 50
python tools/ingest_email.py -d emails.db inbox_dump/ \
    --offset 100 \
    --limit 50

Filter Pipeline

The ingestion tool applies filters in this order:
1
Offset/Limit Slicing
2
Slice the input file list: files[offset:offset+limit]
3
Already-Ingested Check
4
Skip emails that were previously ingested
5
Date Range Filter
6
Filter by --start-date and --stop-date

Programmatic Email Ingestion

Create a custom ingestion pipeline:
import asyncio
from pathlib import Path
from dotenv import load_dotenv

from typeagent.emails.email_import import import_email_from_file
from typeagent.emails.email_memory import EmailMemory
from typeagent.emails.email_message import EmailMessage
from typeagent.knowpro.convsettings import ConversationSettings
from typeagent.storage.utils import create_storage_provider

load_dotenv()

async def ingest_emails():
    # Create settings
    settings = ConversationSettings()
    
    # Create storage provider
    settings.storage_provider = await create_storage_provider(
        settings.message_text_index_settings,
        settings.related_term_index_settings,
        "emails.db",
        EmailMessage
    )
    
    # Create email memory
    email_memory = await EmailMemory.create(settings)
    
    # Process email files
    email_dir = Path("inbox_dump")
    for email_file in email_dir.glob("*.eml"):
        source_id = str(email_file)
        
        # Skip if already ingested
        if await settings.storage_provider.is_source_ingested(source_id):
            print(f"Skipping {email_file.name} (already ingested)")
            continue
        
        try:
            # Import and ingest
            email = import_email_from_file(str(email_file))
            await email_memory.add_messages_with_indexing(
                [email],
                source_ids=[source_id]
            )
            print(f"Ingested {email_file.name}")
        
        except Exception as e:
            print(f"Failed to ingest {email_file.name}: {e}")
            # Mark as failed
            async with settings.storage_provider:
                await settings.storage_provider.mark_source_ingested(
                    source_id,
                    status=e.__class__.__name__
                )

if __name__ == "__main__":
    asyncio.run(ingest_emails())

Downloading Emails

TypeAgent includes tools for downloading emails from various sources:
Download emails using the Gmail API:
# Download 50 most recent emails (default)
cd tools/mail
python gmail_dump.py

# Download 200 emails
python gmail_dump.py --max-results 200

# Output to specific directory
python gmail_dump.py --output-dir ~/gmail_export

Gmail API Setup

1
Create Google Cloud App
2
  • Go to Google Cloud Console
  • Create a new project
  • Enable the Gmail API
  • 3
    Create OAuth Client
    4
  • Navigate to “Credentials” in sidebar
  • Click ”+ Create Credentials”
  • Select “OAuth client ID”
  • Choose “Desktop app”
  • Download JSON credentials
  • 5
    Configure Tool
    6
  • Save credentials as tools/mail/client_secret.json
  • Run gmail_dump.py
  • Complete OAuth flow in browser
  • Token saved to tools/mail/token.json
  • The Gmail API token expires after about a week. Delete token.json to trigger re-authentication.

    Email Features

    Reply Detection

    TypeAgent automatically detects and extracts only the latest response from email threads:
    from typeagent.emails.email_import import is_reply, get_last_response_in_thread
    
    # Check if email is a reply
    if is_reply(email_message):
        # Extract only the new content
        body = get_last_response_in_thread(body_text)
    

    Forward Detection

    from typeagent.emails.email_import import is_forwarded, get_forwarded_email_parts
    
    # Check if email is forwarded
    if is_forwarded(email_message):
        # Split into parts
        parts = get_forwarded_email_parts(email_text)
    

    Encoding Handling

    TypeAgent properly handles RFC 2047 encoded words:
    from typeagent.emails.email_import import decode_encoded_words
    
    # Decode encoded headers
    subject = decode_encoded_words("=?UTF-8?B?SGVsbG8gV29ybGQ=?=")
    print(subject)  # "Hello World"
    

    Querying Emails

    Once ingested, query emails using natural language:
    # Interactive query
    python tools/query.py --database emails.db
    
    # Single query
    python tools/query.py --database emails.db \
        --query "What emails did Alice send about the project?"
    

    Email Query Examples

    from typeagent import create_conversation
    from typeagent.emails.email_message import EmailMessage
    
    conversation = await create_conversation("emails.db", EmailMessage)
    
    # Who questions
    answer = await conversation.query("Who sent emails about the meeting?")
    answer = await conversation.query("Who did Alice email yesterday?")
    
    # What questions  
    answer = await conversation.query("What was discussed in the project emails?")
    answer = await conversation.query("What action items were mentioned?")
    
    # When questions
    answer = await conversation.query("When was the deadline mentioned?")
    answer = await conversation.query("What emails were sent last week?")
    
    # Topic searches
    answer = await conversation.query("Find emails about budget approval")
    answer = await conversation.query("Show me emails related to deployment")
    

    Knowledge Extraction from Emails

    Emails are automatically enriched with semantic knowledge:
    # EmailMessage.metadata.get_knowledge() extracts:
    # - Entities: People (sender, recipients), email addresses
    # - Actions: "sent email", "received email"  
    # - Topics: Subject line
    # - Relationships: sender -> recipient connections
    
    knowledge = email.metadata.get_knowledge()
    print(f"Entities: {len(knowledge.entities)}")
    print(f"Actions: {len(knowledge.actions)}")
    print(f"Topics: {knowledge.topics}")
    

    Entity Extraction

    Email addresses are parsed into entities:
    # "Alice Smith <[email protected]>" becomes:
    # - Entity: "Alice Smith" (type: person)
    #   - Facet: email_address = [email protected]
    # - Entity: "[email protected]" (type: email_address, alias)
    

    Action Extraction

    Email actions capture communication:
    # For email from [email protected] to [email protected]:
    # - Action: "Alice Smith" sent email to "Bob Jones"
    # - Action: "[email protected]" sent email to "[email protected]"
    # - Action: "Bob Jones" received email from "Alice Smith"
    # - Action: "[email protected]" received email from "[email protected]"
    

    Performance Tuning

    Email ingestion can take 1-2 seconds per message due to LLM-based knowledge extraction.

    Batch Size Configuration

    from typeagent.knowpro.convsettings import ConversationSettings
    
    settings = ConversationSettings()
    
    # Adjust concurrent extraction (default: 4)
    settings.semantic_ref_index_settings.batch_size = 4
    

    Progress Monitoring

    import time
    
    start_time = time.time()
    success_count = 0
    batch_size = 4
    
    for i, email in enumerate(emails):
        await email_memory.add_messages_with_indexing([email])
        success_count += 1
        
        # Print progress periodically
        if (success_count % batch_size) == 0:
            elapsed = time.time() - start_time
            semref_count = await semref_collection.size()
            print(f"{success_count} imported | "
                  f"{semref_count} semrefs | "
                  f"{elapsed:.1f}s elapsed")
    

    Error Handling

    Handle common email ingestion errors:
    import traceback
    import openai
    
    success_count = 0
    failed_count = 0
    skipped_count = 0
    
    for source_id, email_file in email_files:
        try:
            email = import_email_from_file(str(email_file))
            
            # Apply date filter
            if not email_matches_date_filter(
                email.timestamp,
                start_date,
                stop_date
            ):
                skipped_count += 1
                continue
            
            await email_memory.add_messages_with_indexing(
                [email],
                source_ids=[source_id]
            )
            success_count += 1
        
        except openai.AuthenticationError as e:
            print(f"Authentication error: {e}")
            break  # Fatal error
        
        except Exception as e:
            failed_count += 1
            print(f"Error processing {source_id}: {e}")
            
            # Mark as failed
            async with storage_provider:
                await storage_provider.mark_source_ingested(
                    source_id,
                    status=e.__class__.__name__
                )
            
            if verbose:
                traceback.print_exc()
    
    print(f"\nSuccessfully imported {success_count} emails")
    print(f"Skipped {skipped_count} emails (date filter)")
    print(f"Failed to import {failed_count} emails")
    

    Example: Complete Email Pipeline

    Here’s a complete example from download to query:
    #!/bin/bash
    # complete_email_pipeline.sh
    
    set -e  # Exit on error
    
    # 1. Download emails from Gmail
    echo "Downloading emails..."
    cd tools/mail
    python gmail_dump.py --max-results 100 --output-dir ../../email_dump
    cd ../..
    
    # 2. Ingest emails into database
    echo "Ingesting emails..."
    python tools/ingest_email.py \
        -d emails.db \
        email_dump/ \
        --start-date 2024-01-01 \
        --verbose
    
    # 3. Query the database
    echo "Database ready for queries!"
    echo "Run: python tools/query.py --database emails.db"
    

    Next Steps

    Build docs developers (and LLMs) love