Skip to main content
TypeAgent uses AI models to extract structured knowledge from unstructured conversation text. This process transforms raw messages into queryable entities, actions, topics, and relationships stored in the semantic reference system.

Knowledge Schema

Extracted knowledge follows a strongly-typed schema defined in knowledge_schema.py:

Core Knowledge Types

ConcreteEntity represents specific, tangible people, places, institutions, or things:
@dataclass
class ConcreteEntity:
    knowledge_type: Literal["entity"] = "entity"
    
    name: str  # "Bach", "Great Gatsby", "frog", "piano"
    
    type: list[str]  # ["person", "composer"], ["book"], ["animal"]
    
    facets: list[Facet] | None  # Defining properties

@dataclass
class Facet:
    name: str   # "color", "weight", "sister"
    value: Value  # str | float | bool | Quantity | Quantifier
Examples:
# Person entity
ConcreteEntity(
    name="Alice",
    type=["person", "engineer"],
    facets=[
        Facet(name="role", value="team_lead"),
        Facet(name="experience", value=Quantity(5, "years"))
    ]
)

# Object entity  
ConcreteEntity(
    name="Model_S",
    type=["vehicle", "car"],
    facets=[
        Facet(name="color", value="blue"),
        Facet(name="electric", value=True)
    ]
)

Extraction Process

Knowledge extraction uses TypeChat to translate conversation text into structured schemas:

1. Initialize Knowledge Extractor

from typeagent.knowpro.convknowledge import KnowledgeExtractor
from typeagent.aitools.model_adapters import create_chat_model

# Create extractor with AI model
extractor = KnowledgeExtractor(
    model=create_chat_model(),
    max_chars_per_chunk=2048,
    merge_action_knowledge=False
)

# The extractor creates a TypeChat translator internally
# that converts text to KnowledgeResponse objects

2. Extract from Message Text

import typechat

message = """
Alice discussed the new API design with Bob yesterday. 
The team decided to use GraphQL instead of REST for better flexibility.
Bob suggested implementing rate limiting to prevent abuse.
"""

# Extract knowledge
result = await extractor.extract(message)

if isinstance(result, typechat.Success):
    knowledge: KnowledgeResponse = result.value
    
    # Extracted entities
    for entity in knowledge.entities:
        print(f"Entity: {entity.name} ({entity.type})")
        # Output:
        # Entity: Alice (['person'])
        # Entity: Bob (['person'])
        # Entity: API (['technology'])
    
    # Extracted actions
    for action in knowledge.actions:
        print(f"Action: {action.subject_entity_name} {action.verbs} {action.object_entity_name}")
        # Output:
        # Action: Alice ['discuss'] API_design
        # Action: team ['decide'] GraphQL
        # Action: Bob ['suggest'] rate_limiting
    
    # Extracted topics
    for topic in knowledge.topics:
        print(f"Topic: {topic}")
        # Output:
        # Topic: API design
        # Topic: GraphQL vs REST
        # Topic: rate limiting
else:
    print(f"Extraction failed: {result.message}")

3. Batch Processing

For multiple messages, use batch extraction:
from typeagent.knowpro.knowledge import extract_knowledge_from_text_batch

text_batch = [
    "Alice reviewed the pull request.",
    "Bob deployed the new feature to staging.",
    "The team celebrated the successful launch."
]

knowledge_results = await extract_knowledge_from_text_batch(
    extractor,
    text_batch,
    max_concurrent=3  # Process 3 at a time
)

for i, result in enumerate(knowledge_results):
    if isinstance(result, typechat.Success):
        print(f"Message {i}: {len(result.value.entities)} entities")
    else:
        print(f"Message {i}: extraction failed")

Indexing Extracted Knowledge

Extracted knowledge is stored in semantic references and indexed for retrieval:

Adding to Semantic Reference Index

from typeagent.storage.memory.semrefindex import (
    add_knowledge_to_semantic_ref_index
)

message_ordinal = 5  # Position in message collection
chunk_ordinal = 0    # Chunk within message

await add_knowledge_to_semantic_ref_index(
    conversation,
    message_ordinal,
    chunk_ordinal,
    knowledge  # KnowledgeResponse from extraction
)
This process:
  1. Creates SemanticRef objects for each entity, action, and topic
  2. Adds terms to SemanticRefIndex for lookup
  3. Updates PropertyIndex with structured properties
  4. Stores text ranges pointing back to original messages

SemanticRef Structure

@dataclass
class SemanticRef:
    semantic_ref_ordinal: int  # Unique ID
    range: TextRange           # Location in messages
    knowledge: Knowledge       # Entity, Action, or Topic

@dataclass  
class TextRange:
    start: TextLocation
    end: TextLocation | None

@dataclass
class TextLocation:
    message_ordinal: int  # Which message
    chunk_ordinal: int    # Which chunk in message

Entity Indexing

async def add_entity(
    entity: ConcreteEntity,
    semantic_refs: ISemanticRefCollection,
    semantic_ref_index: ITermToSemanticRefIndex,
    message_ordinal: int,
    chunk_ordinal: int
) -> None:
    # Create semantic reference
    semantic_ref_ordinal = await semantic_refs.size()
    await semantic_refs.append(
        SemanticRef(
            semantic_ref_ordinal=semantic_ref_ordinal,
            range=TextRange(
                start=TextLocation(message_ordinal, chunk_ordinal),
                end=None
            ),
            knowledge=entity
        )
    )
    
    # Index entity name
    await semantic_ref_index.add_term(
        entity.name,
        semantic_ref_ordinal
    )
    
    # Index each type
    for type_name in entity.type:
        await semantic_ref_index.add_term(
            type_name,
            semantic_ref_ordinal
        )
    
    # Index facets
    if entity.facets:
        for facet in entity.facets:
            await semantic_ref_index.add_term(
                facet.name,
                semantic_ref_ordinal
            )
            await semantic_ref_index.add_term(
                str(facet.value),
                semantic_ref_ordinal
            )

Action Indexing

async def add_action(
    action: Action,
    semantic_refs: ISemanticRefCollection,
    semantic_ref_index: ITermToSemanticRefIndex,
    message_ordinal: int,
    chunk_ordinal: int
) -> None:
    semantic_ref_ordinal = await semantic_refs.size()
    await semantic_refs.append(
        SemanticRef(
            semantic_ref_ordinal=semantic_ref_ordinal,
            range=TextRange(
                start=TextLocation(message_ordinal, chunk_ordinal),
                end=None
            ),
            knowledge=action
        )
    )
    
    # Index verbs
    await semantic_ref_index.add_term(
        " ".join(action.verbs),
        semantic_ref_ordinal
    )
    
    # Index subject, object, indirect object
    if action.subject_entity_name != "none":
        await semantic_ref_index.add_term(
            action.subject_entity_name,
            semantic_ref_ordinal
        )
    
    if action.object_entity_name != "none":
        await semantic_ref_index.add_term(
            action.object_entity_name,
            semantic_ref_ordinal
        )
    
    if action.indirect_object_entity_name != "none":
        await semantic_ref_index.add_term(
            action.indirect_object_entity_name,
            semantic_ref_ordinal
        )

Property Index Population

Structured properties are separately indexed for precise queries:
from typeagent.storage.memory.propindex import (
    add_entity_properties_to_index,
    add_action_properties_to_index,
    PropertyNames
)

# Entity properties
await property_index.add_property(
    PropertyNames.EntityName.value,  # "name"
    entity.name,
    semantic_ref_ordinal
)

for type_name in entity.type:
    await property_index.add_property(
        PropertyNames.EntityType.value,  # "type"
        type_name,
        semantic_ref_ordinal
    )

for facet in entity.facets:
    await property_index.add_property(
        PropertyNames.FacetName.value,   # "facet.name"
        facet.name,
        semantic_ref_ordinal
    )
    await property_index.add_property(
        PropertyNames.FacetValue.value,  # "facet.value"
        str(facet.value),
        semantic_ref_ordinal
    )

# Action properties
await property_index.add_property(
    PropertyNames.Verb.value,        # "verb"
    " ".join(action.verbs),
    semantic_ref_ordinal
)

await property_index.add_property(
    PropertyNames.Subject.value,     # "subject"
    action.subject_entity_name,
    semantic_ref_ordinal
)

await property_index.add_property(
    PropertyNames.Object.value,      # "object"
    action.object_entity_name,
    semantic_ref_ordinal
)
Properties are stored with a delimiter: "prop.{property_name}@@{value}" and lowercased for case-insensitive matching.

Incremental Indexing

TypeAgent supports incremental knowledge extraction as new messages arrive:
from typeagent.knowpro.conversation_base import ConversationBase

# Add messages with automatic knowledge extraction and indexing
result = await conversation.add_messages_with_indexing(
    messages=[msg1, msg2, msg3],
    source_ids=["email_123", "email_124", "email_125"]
)

print(f"Added {result.messages_added} messages")
print(f"Created {result.semrefs_added} semantic references")
The incremental process:
  1. Metadata extraction: Extract basic knowledge from message metadata
  2. LLM extraction: Extract entities, actions, topics using AI (if enabled)
  3. Update SemanticRefIndex: Add new terms and references
  4. Update PropertyIndex: Add structured properties
  5. Update secondary indexes: Timestamps, embeddings, related terms
All index updates happen within a transaction (for SQLite) or as a sequence (for memory storage). If any step fails, SQLite storage rolls back all changes.

Extraction Configuration

Control knowledge extraction behavior through settings:
from typeagent.knowpro.convsettings import (
    SemanticRefIndexSettings,
    ConversationSettings
)

semantic_settings = SemanticRefIndexSettings(
    auto_extract_knowledge=True,      # Enable LLM extraction
    batch_size=10,                    # Process 10 chunks per batch
    knowledge_extractor=extractor    # Custom extractor
)

conversation_settings = ConversationSettings(
    semantic_ref_index_settings=semantic_settings,
    # ... other settings
)

conversation = await ConversationBase.create(
    settings=conversation_settings,
    name="my_conversation"
)

Metadata vs LLM Extraction

Extracts knowledge from message structure:
class IMessage(Protocol):
    def get_knowledge(self) -> KnowledgeResponse:
        # Returns entities, actions, topics from:
        # - Message sender/participants
        # - Subject lines
        # - Metadata fields
        # - Structured data
        pass
Fast, reliable, no API calls needed.
Uses AI models to understand content:
# Requires:
# - AI model (OpenAI, Azure, local LLM)
# - API credentials
# - Network access

# Benefits:
# - Understands natural language
# - Extracts implicit relationships
# - Identifies sentiment and nuance
# - Better entity disambiguation
Slower but much more comprehensive.

Next Steps

Indexing

Learn about the six specialized indexes

Structured RAG

Understand how extracted knowledge powers queries

API Reference

Explore the knowledge extraction API

Architecture

See how extraction fits in the overall system

Build docs developers (and LLMs) love