Skip to main content

Overview

The EpisodicExtraction class extracts episodic facts from conversational or sequential text. It specializes in identifying entities and relationships specific to episodic memory contexts, such as user interactions, conversations, and temporal sequences.

Class: EpisodicExtraction

Constructor

from remem.information_extraction.episodic_extraction_openai import EpisodicExtraction
from remem.llm.openai_gpt import CacheOpenAI

llm_model = CacheOpenAI(model="gpt-4")
extractor = EpisodicExtraction(llm_model=llm_model, global_config=config)
llm_model
CacheOpenAI
required
The language model instance used for episodic extraction
global_config
object
default:"None"
Optional global configuration containing:
  • dataset (str): Dataset name to select appropriate prompt template
  • seed (int): Random seed for reproducibility
  • temperature (float): LLM temperature parameter

Methods

batch_openie()

Extract episodic facts from multiple chunks using multi-threaded processing.
def batch_openie(
    self,
    chunks: Dict[str, ChunkInfo]
) -> Tuple[Dict[str, NerRawOutput], Dict[str, TripleRawOutput]]

Parameters

chunks
Dict[str, ChunkInfo]
required
Dictionary of chunks to process. Each key is a chunk ID, and each value contains:
  • metadata (dict): Episode metadata used to construct the passage
  • Other ChunkInfo fields (see OpenIE documentation)

Returns

A tuple containing two dictionaries:
  1. NER Results (Dict[str, NerRawOutput]): Entities extracted from episodic content
    • chunk_id (str): The chunk identifier
    • response (str): Raw LLM response
    • unique_entities (List[str]): Unique entities found (subjects and objects from triples)
    • metadata (dict): Token usage and cache hit information
  2. Triple Results (Dict[str, TripleRawOutput]): Episodic fact triples
    • chunk_id (str): The chunk identifier
    • response (str): Raw LLM response
    • triples (List[Tuple]): List of (subject, predicate, object) episodic facts
    • metadata (dict): Token usage and cache hit information

Example Usage

from remem.information_extraction.episodic_extraction_openai import EpisodicExtraction
from remem.llm.openai_gpt import CacheOpenAI

# Initialize the extractor
llm_model = CacheOpenAI(model="gpt-4")
config = type('Config', (), {
    'dataset': 'longmemeval',
    'seed': 42,
    'temperature': 0.0
})()

extractor = EpisodicExtraction(llm_model=llm_model, global_config=config)

# Prepare episodic chunks
chunks = {
    "episode_1": {
        "metadata": {
            "session_id": "conv_123",
            "timestamp": "2024-01-15 10:30:00",
            "speaker": "user",
            "content": "I visited Paris last summer and saw the Eiffel Tower."
        },
        "num_tokens": 50,
        "content": "User: I visited Paris last summer and saw the Eiffel Tower.",
        "chunk_order": [(0, 1)],
        "full_doc_ids": ["conv_123"]
    }
}

# Extract episodic facts
ner_results, triple_results = extractor.batch_openie(chunks)

# Access results
for chunk_id, triples in triple_results.items():
    print(f"Episodic facts in {chunk_id}:")
    for triple in triples.triples:
        print(f"  {triple}")
    # Output:
    #   ("user", "visited", "Paris")
    #   ("user", "visited in", "last summer")
    #   ("user", "saw", "Eiffel Tower")

for chunk_id, ner in ner_results.items():
    print(f"Entities: {ner.unique_entities}")
    # Output: ["user", "Paris", "last summer", "Eiffel Tower"]

Extraction Process

Template Selection

The extractor automatically selects the appropriate prompt template based on the dataset:
  • Default: episodic_triple_extraction_longmemeval
  • Dataset-specific templates available for different memory evaluation benchmarks
  • Templates are matched by dataset prefix in global_config.dataset

JSON Mode

Extracts structured JSON output containing:
  • triples: List of episodic relationship triples
  • Entities are derived from triple subjects and objects

Paraphrasing (Optional)

When paraphrasing=True in the episodic_extraction() method:
  • Returns a third output: ParaphraseRawOutput
  • Contains paraphrased versions of the episodic facts
  • Useful for data augmentation and semantic variation

Specialized Features

Episodic Content Construction

Uses make_chunk_content("episodic", metadata) to construct passages from chunk metadata:
  • Formats episode information (speaker, timestamp, content)
  • Structures conversational context appropriately
  • Handles different episode formats based on metadata structure

Dataset-Specific Prompts

Supports specialized extraction for various memory benchmarks:
  • LongMemEval
  • MenatQA
  • TimeQA
  • MuSiQue
  • Complex Temporal Reasoning
  • 2WikiMultiHopQA

Performance Considerations

  • Single-stage extraction (combines entity and triple extraction)
  • Parallel processing with ThreadPoolExecutor
  • Progress tracking with token usage metrics
  • Cache-aware processing for repeated chunks
  • Entities derived from extracted triples (no separate NER step)

Error Handling

  • Graceful handling of extraction failures
  • Returns empty results with error metadata on exceptions
  • Logs warnings for debugging
  • Handles malformed JSON responses
  • Replaces null values with empty strings

Build docs developers (and LLMs) love