Episodic Gist Extraction

Overview

The EpisodicGistExtraction class performs two-stage extraction to capture both detailed episodic facts and high-level semantic gists. This dual extraction provides both granular information and abstract summaries of episodic content.

Class: EpisodicGistExtraction

Constructor

from remem.information_extraction.episodic_gist_extraction_openai import EpisodicGistExtraction
from remem.llm.openai_gpt import CacheOpenAI

llm_model = CacheOpenAI(model="gpt-4")
extractor = EpisodicGistExtraction(llm_model=llm_model, global_config=config)

llm_model

CacheOpenAI

required

The language model instance used for extraction

global_config

object

default:"None"

Optional global configuration containing:

dataset (str): Dataset name for template selection
seed (int): Random seed for reproducibility
temperature (float): LLM temperature parameter

Methods

batch_openie()

Extract both gists and facts from multiple episodes in a two-stage process.

def batch_openie(
    self,
    chunks: Dict[str, ChunkInfo]
) -> Tuple[Dict[str, EpisodeRawOutput]]

Parameters

chunks

Dict[str, ChunkInfo]

required

Dictionary of chunks to process. Each key is a chunk ID, and each value contains:

metadata (dict): Episode metadata for passage construction
Other ChunkInfo fields (see OpenIE documentation)

Returns

A dictionary of EpisodeRawOutput objects:

EpisodeRawOutput

object

Combined extraction results containing:

chunk_id (str): The chunk identifier
verbatim (str): Original passage text
facts (List[str]): Detailed episodic facts extracted
gists (List[str]): High-level semantic summaries
response (str): Raw LLM response
metadata (dict): Token usage and processing information

Example Usage

from remem.information_extraction.episodic_gist_extraction_openai import EpisodicGistExtraction
from remem.llm.openai_gpt import CacheOpenAI

# Initialize the extractor
llm_model = CacheOpenAI(model="gpt-4")
config = type('Config', (), {
    'dataset': 'locomo',
    'seed': 42,
    'temperature': 0.0
})()

extractor = EpisodicGistExtraction(llm_model=llm_model, global_config=config)

# Prepare episodic chunks
chunks = {
    "episode_1": {
        "metadata": {
            "timestamp": "2024-01-15 14:30",
            "content": "User discussed their travel plans to Japan, mentioning interest in visiting Kyoto temples and trying authentic ramen. They have a budget of $3000 and plan to go in April."
        },
        "num_tokens": 80,
        "content": "User discussed travel plans...",
        "chunk_order": [(0, 1)],
        "full_doc_ids": ["session_456"]
    }
}

# Extract gists and facts
results = extractor.batch_openie(chunks)

# Access results
for chunk_id, episode_output in results.items():
    print(f"\nEpisode: {chunk_id}")
    print(f"Verbatim: {episode_output.verbatim}")
    
    print("\nGists:")
    for gist in episode_output.gists:
        print(f"  - {gist}")
    # Output:
    #   - User is planning a trip to Japan
    #   - User wants cultural and culinary experiences
    #   - User has specific budget and timing constraints
    
    print("\nFacts:")
    for fact in episode_output.facts:
        print(f"  - {fact}")
    # Output:
    #   - User plans to visit Japan
    #   - User wants to visit Kyoto temples
    #   - User wants to try authentic ramen
    #   - User has a budget of $3000
    #   - User plans to travel in April

Two-Stage Extraction Process

Stage 1: Gist Extraction

Uses episodic_gist_extraction prompt template
Extracts high-level semantic summaries
Captures abstract themes and main points
Processes all chunks in parallel

Stage 2: Fact Extraction

Uses episodic_fact_extraction prompt template
Includes previously extracted gists as context
Extracts detailed, specific facts
Leverages gists to inform fact granularity

Context Enhancement

When extracting facts, the previously extracted gists are included in the prompt:

Previously extracted gists for this session:
- [Gist 1]
- [Gist 2]
...

This helps the model:

Maintain consistency between abstractions and details
Extract facts at an appropriate level of granularity
Avoid redundancy between gists and facts

Template Selection

Automatic template selection based on dataset:

Wikipedia-based datasets: Uses *_wikipedia templates
- menatqa, timeqa, musique, complex_tr, 2wikimultihopqa
Conversation datasets: Uses *_locomo templates (default)
Custom datasets: Matches by dataset prefix

Output Structure

Gists

High-level semantic summaries that:

Capture the main themes or topics
Abstract away specific details
Provide context for understanding the episode
Are typically 1-3 sentences each

Facts

Detailed, specific information that:

Reference concrete entities, dates, numbers
Capture precise relationships and actions
Can be verified against the verbatim text
Are more granular than gists

Performance Considerations

Two-stage processing: Gists extracted first, then facts
Parallel execution: Each stage processes all chunks concurrently
Configurable workers: Default max_workers=10 for ThreadPoolExecutor
Ordered results: Maintains original chunk ordering
Progress tracking: Separate progress bars for gists and facts
Token metrics: Tracks usage for both extraction stages

Error Handling

Graceful handling of JSON parsing errors
Automatic fixing of truncated JSON responses
Null value replacement with empty strings
Error metadata attached to results
Logging of extraction exceptions
Returns empty lists on failure while preserving chunk structure

Use Cases

Memory systems: Store both detailed and summarized episode representations
Question answering: Retrieve relevant gists first, then drill down to facts
Conversation analysis: Understand both high-level topics and specific details
Long-term memory: Compress episodic information at multiple granularities
Hierarchical retrieval: Enable multi-level semantic search

Core API

Information Extraction

RAG Strategies

Embeddings

LLM Backends

Evaluation

Episodic Gist Extraction

Overview

Class: EpisodicGistExtraction

Constructor

Methods

batch_openie()

Parameters

Returns

Example Usage

Two-Stage Extraction Process

Stage 1: Gist Extraction

Stage 2: Fact Extraction

Context Enhancement

Template Selection

Output Structure

Gists

Facts

Performance Considerations

Error Handling

Use Cases

Build docs developers (and LLMs) love

Core API

Information Extraction

RAG Strategies

Embeddings

LLM Backends

Evaluation

​Overview

​Class: EpisodicGistExtraction

​Constructor

​Methods

​batch_openie()

​Parameters

​Returns

​Example Usage

​Two-Stage Extraction Process

​Stage 1: Gist Extraction

​Stage 2: Fact Extraction

​Context Enhancement

​Template Selection

​Output Structure

​Gists

​Facts

​Performance Considerations

​Error Handling

​Use Cases

Build docs developers (and LLMs) love

Overview

Class: EpisodicGistExtraction

Constructor

Methods

batch_openie()

Parameters

Returns

Example Usage

Two-Stage Extraction Process

Stage 1: Gist Extraction

Stage 2: Fact Extraction

Context Enhancement

Template Selection

Output Structure

Gists

Facts

Performance Considerations

Error Handling

Use Cases