Knowledge Extraction

TypeAgent uses AI models to extract structured knowledge from unstructured conversation text. This process transforms raw messages into queryable entities, actions, topics, and relationships stored in the semantic reference system.

Knowledge Schema

Extracted knowledge follows a strongly-typed schema defined in knowledge_schema.py:

Core Knowledge Types

Entities
Actions
Topics
Knowledge Response

ConcreteEntity represents specific, tangible people, places, institutions, or things:

@dataclass
class ConcreteEntity:
    knowledge_type: Literal["entity"] = "entity"
    
    name: str  # "Bach", "Great Gatsby", "frog", "piano"
    
    type: list[str]  # ["person", "composer"], ["book"], ["animal"]
    
    facets: list[Facet] | None  # Defining properties

@dataclass
class Facet:
    name: str   # "color", "weight", "sister"
    value: Value  # str | float | bool | Quantity | Quantifier

Examples:

# Person entity
ConcreteEntity(
    name="Alice",
    type=["person", "engineer"],
    facets=[
        Facet(name="role", value="team_lead"),
        Facet(name="experience", value=Quantity(5, "years"))
    ]
)

# Object entity  
ConcreteEntity(
    name="Model_S",
    type=["vehicle", "car"],
    facets=[
        Facet(name="color", value="blue"),
        Facet(name="electric", value=True)
    ]
)

Action represents subject-verb-object relationships with tense:

@dataclass
class Action:
    knowledge_type: Literal["action"] = "action"
    
    verbs: list[str]  # ["discuss", "talk about"]
    verb_tense: Literal["past", "present", "future"]
    
    subject_entity_name: str | Literal["none"]
    object_entity_name: str | Literal["none"] 
    indirect_object_entity_name: str | Literal["none"]
    
    params: list[str | ActionParam] | None
    subject_entity_facet: Facet | None

Examples:

# "Alice discussed the project with Bob"
Action(
    verbs=["discuss"],
    verb_tense="past",
    subject_entity_name="Alice",
    object_entity_name="project",
    indirect_object_entity_name="Bob"
)

# "The team completed the sprint"
Action(
    verbs=["complete"],
    verb_tense="past",
    subject_entity_name="team",
    object_entity_name="sprint"
)

Topics are keywords and descriptive themes:

# Topics are simple strings
topics: list[str] = [
    "project management",
    "software architecture",
    "API design",
    "performance optimization"
]

Topics are stored as Topic objects internally:

@dataclass
class Topic:
    text: str

The complete extraction output:

@dataclass
class KnowledgeResponse:
    """Detailed and comprehensive knowledge response."""
    
    entities: list[ConcreteEntity]
    
    actions: list[Action]  
    # subject/object names must match entity names
    
    inverse_actions: list[Action]
    # Reverse forms: (A give B) → (B receive A)
    
    topics: list[str]
    # Detailed, descriptive topics and keywords

Extraction Process

Knowledge extraction uses TypeChat to translate conversation text into structured schemas:

1. Initialize Knowledge Extractor

from typeagent.knowpro.convknowledge import KnowledgeExtractor
from typeagent.aitools.model_adapters import create_chat_model

# Create extractor with AI model
extractor = KnowledgeExtractor(
    model=create_chat_model(),
    max_chars_per_chunk=2048,
    merge_action_knowledge=False
)

# The extractor creates a TypeChat translator internally
# that converts text to KnowledgeResponse objects

2. Extract from Message Text

import typechat

message = """
Alice discussed the new API design with Bob yesterday. 
The team decided to use GraphQL instead of REST for better flexibility.
Bob suggested implementing rate limiting to prevent abuse.
"""

# Extract knowledge
result = await extractor.extract(message)

if isinstance(result, typechat.Success):
    knowledge: KnowledgeResponse = result.value
    
    # Extracted entities
    for entity in knowledge.entities:
        print(f"Entity: {entity.name} ({entity.type})")
        # Output:
        # Entity: Alice (['person'])
        # Entity: Bob (['person'])
        # Entity: API (['technology'])
    
    # Extracted actions
    for action in knowledge.actions:
        print(f"Action: {action.subject_entity_name} {action.verbs} {action.object_entity_name}")
        # Output:
        # Action: Alice ['discuss'] API_design
        # Action: team ['decide'] GraphQL
        # Action: Bob ['suggest'] rate_limiting
    
    # Extracted topics
    for topic in knowledge.topics:
        print(f"Topic: {topic}")
        # Output:
        # Topic: API design
        # Topic: GraphQL vs REST
        # Topic: rate limiting
else:
    print(f"Extraction failed: {result.message}")

3. Batch Processing

For multiple messages, use batch extraction:

from typeagent.knowpro.knowledge import extract_knowledge_from_text_batch

text_batch = [
    "Alice reviewed the pull request.",
    "Bob deployed the new feature to staging.",
    "The team celebrated the successful launch."
]

knowledge_results = await extract_knowledge_from_text_batch(
    extractor,
    text_batch,
    max_concurrent=3  # Process 3 at a time
)

for i, result in enumerate(knowledge_results):
    if isinstance(result, typechat.Success):
        print(f"Message {i}: {len(result.value.entities)} entities")
    else:
        print(f"Message {i}: extraction failed")

Indexing Extracted Knowledge

Extracted knowledge is stored in semantic references and indexed for retrieval:

Adding to Semantic Reference Index

from typeagent.storage.memory.semrefindex import (
    add_knowledge_to_semantic_ref_index
)

message_ordinal = 5  # Position in message collection
chunk_ordinal = 0    # Chunk within message

await add_knowledge_to_semantic_ref_index(
    conversation,
    message_ordinal,
    chunk_ordinal,
    knowledge  # KnowledgeResponse from extraction
)

This process:

Creates SemanticRef objects for each entity, action, and topic
Adds terms to SemanticRefIndex for lookup
Updates PropertyIndex with structured properties
Stores text ranges pointing back to original messages

SemanticRef Structure

@dataclass
class SemanticRef:
    semantic_ref_ordinal: int  # Unique ID
    range: TextRange           # Location in messages
    knowledge: Knowledge       # Entity, Action, or Topic

@dataclass  
class TextRange:
    start: TextLocation
    end: TextLocation | None

@dataclass
class TextLocation:
    message_ordinal: int  # Which message
    chunk_ordinal: int    # Which chunk in message

Entity Indexing

async def add_entity(
    entity: ConcreteEntity,
    semantic_refs: ISemanticRefCollection,
    semantic_ref_index: ITermToSemanticRefIndex,
    message_ordinal: int,
    chunk_ordinal: int
) -> None:
    # Create semantic reference
    semantic_ref_ordinal = await semantic_refs.size()
    await semantic_refs.append(
        SemanticRef(
            semantic_ref_ordinal=semantic_ref_ordinal,
            range=TextRange(
                start=TextLocation(message_ordinal, chunk_ordinal),
                end=None
            ),
            knowledge=entity
        )
    )
    
    # Index entity name
    await semantic_ref_index.add_term(
        entity.name,
        semantic_ref_ordinal
    )
    
    # Index each type
    for type_name in entity.type:
        await semantic_ref_index.add_term(
            type_name,
            semantic_ref_ordinal
        )
    
    # Index facets
    if entity.facets:
        for facet in entity.facets:
            await semantic_ref_index.add_term(
                facet.name,
                semantic_ref_ordinal
            )
            await semantic_ref_index.add_term(
                str(facet.value),
                semantic_ref_ordinal
            )

Action Indexing

async def add_action(
    action: Action,
    semantic_refs: ISemanticRefCollection,
    semantic_ref_index: ITermToSemanticRefIndex,
    message_ordinal: int,
    chunk_ordinal: int
) -> None:
    semantic_ref_ordinal = await semantic_refs.size()
    await semantic_refs.append(
        SemanticRef(
            semantic_ref_ordinal=semantic_ref_ordinal,
            range=TextRange(
                start=TextLocation(message_ordinal, chunk_ordinal),
                end=None
            ),
            knowledge=action
        )
    )
    
    # Index verbs
    await semantic_ref_index.add_term(
        " ".join(action.verbs),
        semantic_ref_ordinal
    )
    
    # Index subject, object, indirect object
    if action.subject_entity_name != "none":
        await semantic_ref_index.add_term(
            action.subject_entity_name,
            semantic_ref_ordinal
        )
    
    if action.object_entity_name != "none":
        await semantic_ref_index.add_term(
            action.object_entity_name,
            semantic_ref_ordinal
        )
    
    if action.indirect_object_entity_name != "none":
        await semantic_ref_index.add_term(
            action.indirect_object_entity_name,
            semantic_ref_ordinal
        )

Property Index Population

Structured properties are separately indexed for precise queries:

from typeagent.storage.memory.propindex import (
    add_entity_properties_to_index,
    add_action_properties_to_index,
    PropertyNames
)

# Entity properties
await property_index.add_property(
    PropertyNames.EntityName.value,  # "name"
    entity.name,
    semantic_ref_ordinal
)

for type_name in entity.type:
    await property_index.add_property(
        PropertyNames.EntityType.value,  # "type"
        type_name,
        semantic_ref_ordinal
    )

for facet in entity.facets:
    await property_index.add_property(
        PropertyNames.FacetName.value,   # "facet.name"
        facet.name,
        semantic_ref_ordinal
    )
    await property_index.add_property(
        PropertyNames.FacetValue.value,  # "facet.value"
        str(facet.value),
        semantic_ref_ordinal
    )

# Action properties
await property_index.add_property(
    PropertyNames.Verb.value,        # "verb"
    " ".join(action.verbs),
    semantic_ref_ordinal
)

await property_index.add_property(
    PropertyNames.Subject.value,     # "subject"
    action.subject_entity_name,
    semantic_ref_ordinal
)

await property_index.add_property(
    PropertyNames.Object.value,      # "object"
    action.object_entity_name,
    semantic_ref_ordinal
)

Properties are stored with a delimiter: "prop.{property_name}@@{value}" and lowercased for case-insensitive matching.

Incremental Indexing

TypeAgent supports incremental knowledge extraction as new messages arrive:

from typeagent.knowpro.conversation_base import ConversationBase

# Add messages with automatic knowledge extraction and indexing
result = await conversation.add_messages_with_indexing(
    messages=[msg1, msg2, msg3],
    source_ids=["email_123", "email_124", "email_125"]
)

print(f"Added {result.messages_added} messages")
print(f"Created {result.semrefs_added} semantic references")

The incremental process:

Metadata extraction: Extract basic knowledge from message metadata
LLM extraction: Extract entities, actions, topics using AI (if enabled)
Update SemanticRefIndex: Add new terms and references
Update PropertyIndex: Add structured properties
Update secondary indexes: Timestamps, embeddings, related terms

All index updates happen within a transaction (for SQLite) or as a sequence (for memory storage). If any step fails, SQLite storage rolls back all changes.

Extraction Configuration

Control knowledge extraction behavior through settings:

from typeagent.knowpro.convsettings import (
    SemanticRefIndexSettings,
    ConversationSettings
)

semantic_settings = SemanticRefIndexSettings(
    auto_extract_knowledge=True,      # Enable LLM extraction
    batch_size=10,                    # Process 10 chunks per batch
    knowledge_extractor=extractor    # Custom extractor
)

conversation_settings = ConversationSettings(
    semantic_ref_index_settings=semantic_settings,
    # ... other settings
)

conversation = await ConversationBase.create(
    settings=conversation_settings,
    name="my_conversation"
)

Metadata vs LLM Extraction

Metadata Extraction (Always Enabled)

Extracts knowledge from message structure:

class IMessage(Protocol):
    def get_knowledge(self) -> KnowledgeResponse:
        # Returns entities, actions, topics from:
        # - Message sender/participants
        # - Subject lines
        # - Metadata fields
        # - Structured data
        pass

Fast, reliable, no API calls needed.

LLM Extraction (Optional)

Uses AI models to understand content:

# Requires:
# - AI model (OpenAI, Azure, local LLM)
# - API credentials
# - Network access

# Benefits:
# - Understands natural language
# - Extracts implicit relationships
# - Identifies sentiment and nuance
# - Better entity disambiguation

Slower but much more comprehensive.

Next Steps

Indexing

Learn about the six specialized indexes

Structured RAG

Understand how extracted knowledge powers queries

API Reference

Explore the knowledge extraction API

Architecture

See how extraction fits in the overall system

Get Started

Core Concepts

Guides

Knowledge Extraction

Knowledge Schema

Core Knowledge Types

Extraction Process

1. Initialize Knowledge Extractor

2. Extract from Message Text

3. Batch Processing

Indexing Extracted Knowledge

Adding to Semantic Reference Index

SemanticRef Structure

Entity Indexing

Action Indexing

Property Index Population

Incremental Indexing

Extraction Configuration

Metadata vs LLM Extraction

Next Steps

Indexing

Structured RAG

API Reference

Architecture

Build docs developers (and LLMs) love

Get Started

Core Concepts

Guides

​Knowledge Schema

​Core Knowledge Types

​Extraction Process

​1. Initialize Knowledge Extractor

​2. Extract from Message Text

​3. Batch Processing

​Indexing Extracted Knowledge

​Adding to Semantic Reference Index

​SemanticRef Structure

​Entity Indexing

​Action Indexing

​Property Index Population

​Incremental Indexing

​Extraction Configuration

​Metadata vs LLM Extraction

​Next Steps

Indexing

Structured RAG

API Reference

Architecture

Build docs developers (and LLMs) love

Knowledge Schema

Core Knowledge Types

Extraction Process

1. Initialize Knowledge Extractor

2. Extract from Message Text

3. Batch Processing

Indexing Extracted Knowledge

Adding to Semantic Reference Index

SemanticRef Structure

Entity Indexing

Action Indexing

Property Index Population

Incremental Indexing

Extraction Configuration

Metadata vs LLM Extraction

Next Steps