Specialized Indexes

TypeAgent maintains six specialized indexes that enable different query patterns and access methods. Each index serves a specific purpose and is updated incrementally as new knowledge is extracted.

Index Overview

All six indexes are managed through the ConversationSecondaryIndexes class:

from typeagent.knowpro.secindex import ConversationSecondaryIndexes

# Accessed through conversation
secondary_indexes = conversation.secondary_indexes

# Six indexes available:
# 1. semantic_ref_index         - Term → SemanticRef mappings
# 2. property_to_semantic_ref_index - Property → SemanticRef mappings  
# 3. timestamp_index             - Timestamp → Message mappings
# 4. message_index              - Message embedding search
# 5. term_to_related_terms_index - Fuzzy term matching
# 6. threads                    - Conversation threading

1. SemanticRef Index

Purpose: Fast term-based lookup of semantic references (entities, actions, topics).

Structure

class TermToSemanticRefIndex(ITermToSemanticRefIndex):
    _map: dict[str, list[ScoredSemanticRefOrdinal]]
    # Maps lowercase terms to semantic reference ordinals with scores

Operations

Add Term
Lookup Term
Remove Term

# Add entity name to index
await semantic_ref_index.add_term(
    "Alice",                # Term
    semantic_ref_ordinal    # Reference to semantic ref
)

# Stored as:
# _map["alice"] = [ScoredSemanticRefOrdinal(42, 1.0)]

# Find all references to "Alice"
scored_refs = await semantic_ref_index.lookup_term("Alice")
# Returns: [ScoredSemanticRefOrdinal(42, 1.0), ...]

# Retrieve actual semantic references
for scored_ref in scored_refs:
    semantic_ref = await semantic_refs.get_item(
        scored_ref.semantic_ref_ordinal
    )
    # semantic_ref.knowledge - Entity, Action, or Topic
    # semantic_ref.range - TextRange with message location

# Remove specific reference from term
await semantic_ref_index.remove_term(
    "Alice",
    semantic_ref_ordinal
)
# Removes only that specific ordinal, not the entire term

What Gets Indexed

Entity names: entity.name
Entity types: Each string in entity.type
Facet names and values: facet.name and str(facet.value)
Action verbs: " ".join(action.verbs)
Action entities: Subject, object, indirect object names
Topics: topic.text

Storage Backends

Memory Implementation

from typeagent.storage.memory.semrefindex import TermToSemanticRefIndex

class TermToSemanticRefIndex:
    _map: dict[str, list[ScoredSemanticRefOrdinal]]
    
    # In-memory dictionary
    # Fast lookups: O(1)
    # No persistence

SQLite Implementation

from typeagent.storage.sqlite.semrefindex import SqliteTermToSemanticRefIndex

# Table: SemanticRefIndex
# Columns: term (text), semantic_ref_ordinal (int), score (real)
# Index: CREATE INDEX idx_semref_term ON SemanticRefIndex(term)

# Persistent storage
# Indexed queries
# Transaction support

2. Property Index

Purpose: Structured property queries with name-value pairs.

Structure

class PropertyIndex(IPropertyToSemanticRefIndex):
    _map: dict[str, list[ScoredSemanticRefOrdinal]]
    # Maps "prop.{name}@@{value}" to semantic ref ordinals

Property Names

from typeagent.storage.memory.propindex import PropertyNames

class PropertyNames(enum.Enum):
    EntityName = "name"              # Entity names
    EntityType = "type"              # Entity types  
    FacetName = "facet.name"         # Facet names
    FacetValue = "facet.value"       # Facet values
    Verb = "verb"                    # Action verbs
    Subject = "subject"              # Action subjects
    Object = "object"                # Action objects
    IndirectObject = "indirectObject" # Indirect objects
    Tag = "tag"                      # Message tags
    Topic = "topic"                  # Topics

Operations

Add Property
Lookup Property
Scoped Lookup

# Add entity name property
await property_index.add_property(
    PropertyNames.EntityName.value,  # "name"
    "Alice",
    semantic_ref_ordinal
)

# Stored as:
# _map["prop.name@@alice"] = [ScoredSemanticRefOrdinal(42, 1.0)]

# Add facet property
await property_index.add_property(
    "color",       # Facet name
    "blue",        # Facet value
    semantic_ref_ordinal
)

# Stored as:
# _map["prop.color@@blue"] = [ScoredSemanticRefOrdinal(43, 1.0)]

# Find entities named "Alice"
scored_refs = await property_index.lookup_property(
    PropertyNames.EntityName.value,
    "Alice"
)

# Find actions with "discuss" verb
scored_refs = await property_index.lookup_property(
    PropertyNames.Verb.value,
    "discuss"
)

# Find actions where Alice is the subject
scored_refs = await property_index.lookup_property(
    PropertyNames.Subject.value,
    "Alice"
)

from typeagent.storage.memory.propindex import (
    lookup_property_in_property_index
)
from typeagent.knowpro.collections import TextRangesInScope

# Only search within specific time range or thread
ranges_in_scope = TextRangesInScope(...)

scored_refs = await lookup_property_in_property_index(
    property_index,
    PropertyNames.EntityName.value,
    "Alice",
    semantic_refs,
    ranges_in_scope  # Filter to this scope
)

Why Separate from SemanticRef Index?

The PropertyIndex enables structured queries that the SemanticRef index cannot:

# SemanticRef index: "What mentions 'blue'?"
results = await semantic_ref_index.lookup_term("blue")
# Returns all semantic refs with "blue" anywhere

# Property index: "What entities have color=blue facet?"
results = await property_index.lookup_property(
    PropertyNames.FacetValue.value,
    "blue"
)
# Returns only entities with blue as a facet value

# Property index: "What actions did Alice perform?"
results = await property_index.lookup_property(
    PropertyNames.Subject.value,
    "Alice"
)
# Returns only actions where Alice is the subject

3. Timestamp Index

Purpose: Temporal navigation and time-based filtering.

Structure

class TimestampToTextRangeIndex(ITimestampToTextRangeIndex):
    _timestamp_to_ordinals: dict[str, list[MessageOrdinal]]
    # Maps ISO timestamp strings to message ordinals

Operations

Add Timestamps
Query Time Range
Get Bounds

# Add message timestamps
timestamp_data = [
    (message_ordinal_0, "2024-01-15T10:30:00Z"),
    (message_ordinal_1, "2024-01-15T11:45:00Z"),
    (message_ordinal_2, "2024-01-16T09:00:00Z")
]

await timestamp_index.add_timestamps(timestamp_data)

from datetime import datetime, timezone

# Find messages in date range
start = datetime(2024, 1, 15, tzinfo=timezone.utc)
end = datetime(2024, 1, 16, tzinfo=timezone.utc)

message_ordinals = await timestamp_index.get_messages_in_range(
    start,
    end
)
# Returns: [0, 1]  (messages 0 and 1 fall in range)

# Retrieve actual messages
for ordinal in message_ordinals:
    message = await messages.get_item(ordinal)
    print(f"{message.timestamp}: {message.text}")

# Get earliest and latest timestamps
earliest, latest = await timestamp_index.get_time_bounds()

print(f"Conversation spans {earliest} to {latest}")
# Output: Conversation spans 2024-01-15T10:30:00Z to 2024-01-16T09:00:00Z

Temporal Scoping

The timestamp index enables time-based search filtering:

from typeagent.knowpro.interfaces import DateRange, WhenFilter

# Create temporal filter
when = WhenFilter(
    date_range=DateRange(
        start=datetime(2024, 1, 15, tzinfo=timezone.utc),
        end=datetime(2024, 1, 16, tzinfo=timezone.utc)
    )
)

# Apply to search
search_expr = SearchSelectExpr(
    search_term_group=term_group,
    when=when  # Restrict to this time range
)

4. Message Text Index

Purpose: Embedding-based semantic similarity search.

Structure

class MessageTextIndex(IMessageTextIndex):
    _embeddings: list[tuple[MessageOrdinal, np.ndarray]]
    # Message ordinals with their embedding vectors
    
    _embedding_model: IEmbeddingModel
    # Model for generating embeddings

Operations

Add Messages
Semantic Search
Clear Index

from typeagent.storage.memory.messageindex import MessageTextIndex
from typeagent.knowpro.convsettings import MessageTextIndexSettings

# Create index with embedding settings
settings = MessageTextIndexSettings(
    embedding_index_settings=TextEmbeddingIndexSettings(
        embedding_model=create_embedding_model("openai:text-embedding-3-small")
    )
)

message_index = MessageTextIndex(settings)

# Add messages (automatically generates embeddings)
await message_index.add_messages([
    message1,
    message2,
    message3
])

# Search by text similarity
query = "project discussion and API design"

results = await message_index.search(
    query_text=query,
    top_k=10,
    min_score=0.7  # Minimum cosine similarity
)

for result in results:
    message = await messages.get_item(result.message_ordinal)
    print(f"Score: {result.score:.3f}")
    print(f"Text: {message.text}")
    print()

# Remove all embeddings
await message_index.clear()

# Rebuild from messages
await message_index.add_messages(
    await messages.get_slice(0, await messages.size())
)

Embedding Models

TypeAgent supports multiple embedding providers:

OpenAI: "openai:text-embedding-3-small", "openai:text-embedding-3-large"
Azure: "azure:text-embedding-ada-002"
Local models via custom implementations

SQLite Storage

# SQLite table: MessageTextIndex
# Columns:
#   message_ordinal (int)
#   embedding (blob)  - Serialized numpy array
#   embedding_model (text)

# Embeddings are stored as compressed binary blobs
# Retrieved and deserialized for similarity calculations

Purpose: Fuzzy term matching and synonym resolution.

Structure

class RelatedTermsIndex(ITermToRelatedTermsIndex):
    fuzzy_index: FuzzyTermIndex | None
    # Embedding-based term similarity
    
    aliases: dict[str, list[str]]
    # Manual synonym mappings

Operations

Add Terms
Find Related
Add Aliases

# Add terms for fuzzy matching
terms = ["discuss", "talk", "speak", "converse", "chat"]

await related_terms_index.fuzzy_index.add_terms(terms)
# Each term is embedded and stored

# Find terms similar to "discuss"
related = await related_terms_index.find_related(
    "discuss",
    max_distance=0.3,  # Cosine distance threshold
    top_k=5
)

# Returns: [("talk", 0.15), ("speak", 0.22), ...]
# Lower distance = more similar

for term, distance in related:
    similarity = 1.0 - distance
    print(f"{term}: {similarity:.2f} similarity")
# Output:
# talk: 0.85 similarity
# speak: 0.78 similarity

# Manually define synonyms
await related_terms_index.add_alias("API", "interface")
await related_terms_index.add_alias("API", "endpoint")

# Later lookups will include aliases
aliases = await related_terms_index.get_aliases("API")
# Returns: ["interface", "endpoint"]

Use in Queries

Related terms expand search coverage:

# User searches for "discuss"
original_term = "discuss"

# Find related terms
related = await related_terms_index.find_related(
    original_term,
    max_distance=0.3
)

# Search for original term AND related terms
all_terms = [original_term] + [term for term, _ in related]
# ["discuss", "talk", "speak", "converse"]

# Query all variations
for term in all_terms:
    results = await semantic_ref_index.lookup_term(term)
    # Combine results

6. Conversation Threads

Purpose: Thread organization and context grouping.

Structure

class ConversationThreads(IConversationThreads):
    _threads: dict[str, list[MessageOrdinal]]
    # Thread ID → message ordinals
    
    _message_to_thread: dict[MessageOrdinal, str]
    # Message ordinal → thread ID

Operations

Create Thread
Get Thread Messages
Find Message Thread

# Create new thread
thread_id = await threads.create_thread(
    name="Project Discussion",
    initial_message_ordinal=0
)

# Add messages to thread
await threads.add_to_thread(
    thread_id,
    message_ordinals=[1, 2, 3]
)

# Retrieve all messages in thread
message_ordinals = await threads.get_thread_messages(thread_id)

# Load actual messages
thread_messages = [
    await messages.get_item(ordinal)
    for ordinal in message_ordinals
]

# Which thread does this message belong to?
thread_id = await threads.get_message_thread(
    message_ordinal=5
)

if thread_id:
    print(f"Message 5 is in thread {thread_id}")
else:
    print("Message 5 is not in any thread")

Thread-Scoped Search

# Search only within a specific thread
thread_id = "project_discussion"
message_ordinals = await threads.get_thread_messages(thread_id)

# Create scope filter
ranges_in_scope = TextRangesInScope()
for ordinal in message_ordinals:
    ranges_in_scope.add_message_ordinal(ordinal)

# Apply to search
scored_refs = await lookup_property_in_property_index(
    property_index,
    PropertyNames.EntityName.value,
    "Alice",
    semantic_refs,
    ranges_in_scope  # Only search thread messages
)

Index Update Flow

All six indexes are updated together during message ingestion:

# From conversation_base.py
async def add_messages_with_indexing(
    self,
    messages: list[TMessage]
) -> AddMessagesResult:
    async with storage:  # Transaction start
        # 1. Add messages to collection
        await self.messages.extend(messages)
        
        # 2. Metadata extraction → SemanticRefIndex + PropertyIndex
        await self._add_metadata_knowledge_incremental(...)
        
        # 3. LLM extraction → SemanticRefIndex + PropertyIndex
        if settings.auto_extract_knowledge:
            await self._add_llm_knowledge_incremental(...)
        
        # 4. Update secondary indexes
        await self._update_secondary_indexes_incremental(...)
        #    - PropertyIndex (from semantic refs)
        #    - TimestampIndex (from message timestamps)
        #    - RelatedTermsIndex (from extracted terms)
        #    - MessageTextIndex (from message text)
        
    # Transaction commit (SQLite) or complete (memory)

For SQLite storage, all six indexes are updated within a single transaction. If any update fails, all changes are rolled back atomically.

Index Persistence

Memory Storage
SQLite Storage

All indexes live in memory:

Fast access
No disk I/O
Lost on process exit
Good for:
- Testing
- Temporary conversations
- Performance-critical applications

Performance Characteristics

Index	Lookup Complexity	Add Complexity	Storage
SemanticRef	O(1)	O(1)	O(n terms × refs)
Property	O(1)	O(1)	O(n props × refs)
Timestamp	O(log n)	O(1)	O(n messages)
MessageText	O(n) similarity	O(1)	O(n messages × dim)
RelatedTerms	O(n) similarity	O(1)	O(n terms × dim)
Threads	O(1)	O(1)	O(n messages)

MessageText and RelatedTerms use linear similarity search. For large datasets, consider using approximate nearest neighbor (ANN) indexes like FAISS or Annoy.

Next Steps

Architecture

See how indexes fit in the overall architecture

Structured RAG

Learn how indexes power multi-stage queries

Knowledge Extraction

Understand what gets indexed

Storage API

Explore the storage provider API

Get Started

Core Concepts

Guides

Specialized Indexes

Index Overview

1. SemanticRef Index

Structure

Operations

What Gets Indexed

Storage Backends

2. Property Index

Structure

Property Names

Operations

Why Separate from SemanticRef Index?

3. Timestamp Index

Structure

Operations

Temporal Scoping

4. Message Text Index

Structure

Operations

Embedding Models

SQLite Storage

Structure

Operations

Use in Queries

6. Conversation Threads

Structure

Operations

Thread-Scoped Search

Index Update Flow

Index Persistence

Performance Characteristics

Next Steps

Architecture

Structured RAG

Knowledge Extraction

Storage API

Build docs developers (and LLMs) love

Get Started

Core Concepts

Guides

​Index Overview

​1. SemanticRef Index

​Structure

​Operations

​What Gets Indexed

​Storage Backends

​2. Property Index

​Structure

​Property Names

​Operations

​Why Separate from SemanticRef Index?

​3. Timestamp Index

​Structure

​Operations

​Temporal Scoping

​4. Message Text Index

​Structure

​Operations

​Embedding Models

​SQLite Storage

​5. Related Terms Index

​Structure

​Operations

​Use in Queries

​6. Conversation Threads

​Structure

​Operations

​Thread-Scoped Search

​Index Update Flow

​Index Persistence

​Performance Characteristics

​Next Steps

Architecture

Structured RAG

Knowledge Extraction

Storage API

Build docs developers (and LLMs) love

Index Overview

1. SemanticRef Index

Structure

Operations

What Gets Indexed

Storage Backends

2. Property Index

Structure

Property Names

Operations

Why Separate from SemanticRef Index?

3. Timestamp Index

Structure

Operations

Temporal Scoping

4. Message Text Index

Structure

Operations

Embedding Models

SQLite Storage

5. Related Terms Index

Structure

Operations

Use in Queries

6. Conversation Threads

Structure

Operations

Thread-Scoped Search

Index Update Flow

Index Persistence

Performance Characteristics

Next Steps