Overview
TheTextUnit class represents a chunk of text from a source document. Text units are the atomic pieces of text from which entities, relationships, and claims are extracted. They serve as the bridge between the original source documents and the knowledge graph.
Each text unit maintains links to the entities, relationships, and covariates (claims) that were extracted from it, enabling source attribution and context retrieval.
Text units inherit from the Identified base class, which provides id and short_id fields.
Schema
Core fields
Unique identifier for the text unit.
Human-readable ID used to refer to this text unit in prompts or texts displayed to users.
The actual text content of the unit. This is the chunk of text from the source document.
Relationships
List of entity IDs that were extracted from or mentioned in this text unit. Links the text to entities in the knowledge graph.
List of relationship IDs that were extracted from this text unit. Links the text to relationships in the knowledge graph.
Dictionary mapping covariate types to lists of covariate IDs. For example,
{"claim": ["claim1", "claim2"]} indicates which claims were extracted from this text.Document reference
ID of the source document from which this text unit was extracted. Enables tracing back to the original document.
Metadata
The number of tokens in the text. Used for chunking strategies, cost estimation, and context window management.
A dictionary of additional attributes associated with the text unit. May include:
chunk_id: Position of this chunk in the documentpage_number: Page number in the source documentsection: Section or chapter name- Custom metadata specific to your use case
Example
Creating from dictionary
TheTextUnit class provides a from_dict() class method to create instances from dictionary data:
Text chunking
Text units are created by chunking source documents into smaller pieces. The chunking strategy affects:- Extraction quality: Smaller chunks may miss relationships across boundaries; larger chunks may dilute entity detection
- Context window: Chunk size should fit within LLM context windows
- Token count: The
n_tokensfield helps manage context and costs
Role in the knowledge graph
Text units serve several critical functions:Source attribution
Every entity, relationship, and claim maintainstext_unit_ids that point back to the text units from which they were extracted. This enables:
- Verifying extracted information
- Showing evidence for claims
- Providing context for search results
Bidirectional linking
Text units maintain forward links to extracted graph elements viaentity_ids, relationship_ids, and covariate_ids, while those elements maintain backward links via their text_unit_ids fields.
Document traceability
Thedocument_id field enables tracing from any graph element back through text units to the original source document.
Use cases
- Context retrieval: Fetch original text for entities or relationships
- Evidence display: Show source text snippets to users
- Quality assurance: Verify extraction accuracy against source
- Incremental updates: Re-process specific text units when documents change
- Token budgeting: Calculate context window usage using
n_tokens