Overview
TheEpisodicGistExtraction class performs two-stage extraction to capture both detailed episodic facts and high-level semantic gists. This dual extraction provides both granular information and abstract summaries of episodic content.
Class: EpisodicGistExtraction
Constructor
The language model instance used for extraction
Optional global configuration containing:
dataset(str): Dataset name for template selectionseed(int): Random seed for reproducibilitytemperature(float): LLM temperature parameter
Methods
batch_openie()
Extract both gists and facts from multiple episodes in a two-stage process.Parameters
Dictionary of chunks to process. Each key is a chunk ID, and each value contains:
metadata(dict): Episode metadata for passage construction- Other ChunkInfo fields (see OpenIE documentation)
Returns
A dictionary ofEpisodeRawOutput objects:
Combined extraction results containing:
chunk_id(str): The chunk identifierverbatim(str): Original passage textfacts(List[str]): Detailed episodic facts extractedgists(List[str]): High-level semantic summariesresponse(str): Raw LLM responsemetadata(dict): Token usage and processing information
Example Usage
Two-Stage Extraction Process
Stage 1: Gist Extraction
- Uses
episodic_gist_extractionprompt template - Extracts high-level semantic summaries
- Captures abstract themes and main points
- Processes all chunks in parallel
Stage 2: Fact Extraction
- Uses
episodic_fact_extractionprompt template - Includes previously extracted gists as context
- Extracts detailed, specific facts
- Leverages gists to inform fact granularity
Context Enhancement
When extracting facts, the previously extracted gists are included in the prompt:- Maintain consistency between abstractions and details
- Extract facts at an appropriate level of granularity
- Avoid redundancy between gists and facts
Template Selection
Automatic template selection based on dataset:- Wikipedia-based datasets: Uses
*_wikipediatemplates- menatqa, timeqa, musique, complex_tr, 2wikimultihopqa
- Conversation datasets: Uses
*_locomotemplates (default) - Custom datasets: Matches by dataset prefix
Output Structure
Gists
High-level semantic summaries that:- Capture the main themes or topics
- Abstract away specific details
- Provide context for understanding the episode
- Are typically 1-3 sentences each
Facts
Detailed, specific information that:- Reference concrete entities, dates, numbers
- Capture precise relationships and actions
- Can be verified against the verbatim text
- Are more granular than gists
Performance Considerations
- Two-stage processing: Gists extracted first, then facts
- Parallel execution: Each stage processes all chunks concurrently
- Configurable workers: Default
max_workers=10for ThreadPoolExecutor - Ordered results: Maintains original chunk ordering
- Progress tracking: Separate progress bars for gists and facts
- Token metrics: Tracks usage for both extraction stages
Error Handling
- Graceful handling of JSON parsing errors
- Automatic fixing of truncated JSON responses
- Null value replacement with empty strings
- Error metadata attached to results
- Logging of extraction exceptions
- Returns empty lists on failure while preserving chunk structure
Use Cases
- Memory systems: Store both detailed and summarized episode representations
- Question answering: Retrieve relevant gists first, then drill down to facts
- Conversation analysis: Understand both high-level topics and specific details
- Long-term memory: Compress episodic information at multiple granularities
- Hierarchical retrieval: Enable multi-level semantic search