Skip to main content

Overview

The EpisodicGistExtraction class performs two-stage extraction to capture both detailed episodic facts and high-level semantic gists. This dual extraction provides both granular information and abstract summaries of episodic content.

Class: EpisodicGistExtraction

Constructor

from remem.information_extraction.episodic_gist_extraction_openai import EpisodicGistExtraction
from remem.llm.openai_gpt import CacheOpenAI

llm_model = CacheOpenAI(model="gpt-4")
extractor = EpisodicGistExtraction(llm_model=llm_model, global_config=config)
llm_model
CacheOpenAI
required
The language model instance used for extraction
global_config
object
default:"None"
Optional global configuration containing:
  • dataset (str): Dataset name for template selection
  • seed (int): Random seed for reproducibility
  • temperature (float): LLM temperature parameter

Methods

batch_openie()

Extract both gists and facts from multiple episodes in a two-stage process.
def batch_openie(
    self,
    chunks: Dict[str, ChunkInfo]
) -> Tuple[Dict[str, EpisodeRawOutput]]

Parameters

chunks
Dict[str, ChunkInfo]
required
Dictionary of chunks to process. Each key is a chunk ID, and each value contains:
  • metadata (dict): Episode metadata for passage construction
  • Other ChunkInfo fields (see OpenIE documentation)

Returns

A dictionary of EpisodeRawOutput objects:
EpisodeRawOutput
object
Combined extraction results containing:
  • chunk_id (str): The chunk identifier
  • verbatim (str): Original passage text
  • facts (List[str]): Detailed episodic facts extracted
  • gists (List[str]): High-level semantic summaries
  • response (str): Raw LLM response
  • metadata (dict): Token usage and processing information

Example Usage

from remem.information_extraction.episodic_gist_extraction_openai import EpisodicGistExtraction
from remem.llm.openai_gpt import CacheOpenAI

# Initialize the extractor
llm_model = CacheOpenAI(model="gpt-4")
config = type('Config', (), {
    'dataset': 'locomo',
    'seed': 42,
    'temperature': 0.0
})()

extractor = EpisodicGistExtraction(llm_model=llm_model, global_config=config)

# Prepare episodic chunks
chunks = {
    "episode_1": {
        "metadata": {
            "timestamp": "2024-01-15 14:30",
            "content": "User discussed their travel plans to Japan, mentioning interest in visiting Kyoto temples and trying authentic ramen. They have a budget of $3000 and plan to go in April."
        },
        "num_tokens": 80,
        "content": "User discussed travel plans...",
        "chunk_order": [(0, 1)],
        "full_doc_ids": ["session_456"]
    }
}

# Extract gists and facts
results = extractor.batch_openie(chunks)

# Access results
for chunk_id, episode_output in results.items():
    print(f"\nEpisode: {chunk_id}")
    print(f"Verbatim: {episode_output.verbatim}")
    
    print("\nGists:")
    for gist in episode_output.gists:
        print(f"  - {gist}")
    # Output:
    #   - User is planning a trip to Japan
    #   - User wants cultural and culinary experiences
    #   - User has specific budget and timing constraints
    
    print("\nFacts:")
    for fact in episode_output.facts:
        print(f"  - {fact}")
    # Output:
    #   - User plans to visit Japan
    #   - User wants to visit Kyoto temples
    #   - User wants to try authentic ramen
    #   - User has a budget of $3000
    #   - User plans to travel in April

Two-Stage Extraction Process

Stage 1: Gist Extraction

  1. Uses episodic_gist_extraction prompt template
  2. Extracts high-level semantic summaries
  3. Captures abstract themes and main points
  4. Processes all chunks in parallel

Stage 2: Fact Extraction

  1. Uses episodic_fact_extraction prompt template
  2. Includes previously extracted gists as context
  3. Extracts detailed, specific facts
  4. Leverages gists to inform fact granularity

Context Enhancement

When extracting facts, the previously extracted gists are included in the prompt:
Previously extracted gists for this session:
- [Gist 1]
- [Gist 2]
...
This helps the model:
  • Maintain consistency between abstractions and details
  • Extract facts at an appropriate level of granularity
  • Avoid redundancy between gists and facts

Template Selection

Automatic template selection based on dataset:
  • Wikipedia-based datasets: Uses *_wikipedia templates
    • menatqa, timeqa, musique, complex_tr, 2wikimultihopqa
  • Conversation datasets: Uses *_locomo templates (default)
  • Custom datasets: Matches by dataset prefix

Output Structure

Gists

High-level semantic summaries that:
  • Capture the main themes or topics
  • Abstract away specific details
  • Provide context for understanding the episode
  • Are typically 1-3 sentences each

Facts

Detailed, specific information that:
  • Reference concrete entities, dates, numbers
  • Capture precise relationships and actions
  • Can be verified against the verbatim text
  • Are more granular than gists

Performance Considerations

  • Two-stage processing: Gists extracted first, then facts
  • Parallel execution: Each stage processes all chunks concurrently
  • Configurable workers: Default max_workers=10 for ThreadPoolExecutor
  • Ordered results: Maintains original chunk ordering
  • Progress tracking: Separate progress bars for gists and facts
  • Token metrics: Tracks usage for both extraction stages

Error Handling

  • Graceful handling of JSON parsing errors
  • Automatic fixing of truncated JSON responses
  • Null value replacement with empty strings
  • Error metadata attached to results
  • Logging of extraction exceptions
  • Returns empty lists on failure while preserving chunk structure

Use Cases

  • Memory systems: Store both detailed and summarized episode representations
  • Question answering: Retrieve relevant gists first, then drill down to facts
  • Conversation analysis: Understand both high-level topics and specific details
  • Long-term memory: Compress episodic information at multiple granularities
  • Hierarchical retrieval: Enable multi-level semantic search

Build docs developers (and LLMs) love