Skip to main content

Overview

REMem supports multiple extraction methods that transform raw documents into structured memory units (entities, facts, gists). You can add custom extraction methods by creating new extraction classes and registering them with the system.

Built-in Extraction Methods

MethodModuleWhat It Extracts
openieopenie_openai.pyEntities + triples
episodicepisodic_extraction_openai.pyEpisodic facts
episodic_gistepisodic_gist_extraction_openai.pyFacts + paraphrased gist memories
temporaltemporal_extraction_openai.pyTemporal facts with time anchors
Each method has both OpenAI and vLLM variants (e.g., openie_openai.py and openie_vllm_offline.py).

Creating a Custom Extraction Method

1. Basic Extraction Class

Create a new file in src/remem/information_extraction/:
# src/remem/information_extraction/my_custom_extraction_openai.py
from typing import Dict, Tuple
from concurrent.futures import ThreadPoolExecutor, as_completed
from tqdm import tqdm

from remem.llm.openai_gpt import CacheOpenAI
from remem.prompts import PromptTemplateManager
from remem.utils.misc_utils import NerRawOutput, TripleRawOutput
from remem.information_extraction.openie_openai import ChunkInfo

class MyCustomExtraction:
    def __init__(self, llm_model: CacheOpenAI, global_config=None):
        # Initialize prompt template manager
        self.prompt_template_manager = PromptTemplateManager(
            role_mapping={"system": "system", "user": "user", "assistant": "assistant"}
        )
        self.llm_model = llm_model
        self.global_config = global_config

    def extract_chunk(self, chunk_key: str, passage: str) -> Dict:
        """Extract information from a single chunk."""
        # Render your custom prompt template
        input_message = self.prompt_template_manager.render(
            name="my_custom_template", 
            passage=passage
        )
        
        # Call LLM
        raw_response, metadata, cache_hit = self.llm_model.infer(
            messages=input_message,
            response_format={"type": "json_object"}
        )
        
        # Parse response and return structured output
        # ... your parsing logic ...
        
        return {
            "chunk_id": chunk_key,
            "response": raw_response,
            "metadata": metadata,
            # ... your extracted data ...
        }

    def batch_openie(self, chunks: Dict[str, ChunkInfo]) -> Tuple:
        """Batch extraction using multi-threading."""
        chunk_passages = {chunk_key: chunk["content"] for chunk_key, chunk in chunks.items()}
        
        results_list = []
        with ThreadPoolExecutor() as executor:
            futures = {
                executor.submit(self.extract_chunk, chunk_key, passage): chunk_key
                for chunk_key, passage in chunk_passages.items()
            }
            
            pbar = tqdm(as_completed(futures), total=len(futures), desc="Custom extraction")
            for future in pbar:
                result = future.result()
                results_list.append(result)
        
        # Convert to expected output format
        results_dict = {res["chunk_id"]: res for res in results_list}
        return results_dict

2. Key Methods to Implement

Required: batch_openie(chunks)

The orchestrator (remem.py) calls batch_openie() during indexing:
def batch_openie(self, chunks: Dict[str, ChunkInfo]) -> Tuple[Dict, ...]:
    """
    Args:
        chunks: Dict mapping chunk_id to ChunkInfo with keys:
                - 'content': str
                - 'num_tokens': int
                - 'chunk_order': List[Tuple]
                - 'full_doc_ids': List[str]
    
    Returns:
        Tuple of dicts mapping chunk_id to extraction results
        e.g., (ner_results_dict, triple_results_dict) for OpenIE
    """

Optional: Individual extraction methods

def ner(self, chunk_key: str, passage: str) -> NerRawOutput:
    """Extract named entities."""
    pass

def triple_extraction(self, chunk_key: str, passage: str, entities: List[str]) -> TripleRawOutput:
    """Extract relational triples."""
    pass

3. Example: Episodic Gist Extraction

From episodic_gist_extraction_openai.py:16-64:
class EpisodicGistExtraction:
    def __init__(self, llm_model: CacheOpenAI, global_config=None):
        self.prompt_template_manager = PromptTemplateManager(
            role_mapping={"system": "system", "user": "user", "assistant": "assistant"}
        )
        self.llm_model = llm_model
        self.global_config = global_config

    def batch_openie(self, chunks: Dict[str, ChunkInfo]) -> Tuple[Dict[str, EpisodeRawOutput]]:
        # First extract gists for all chunks
        gist_outputs = self.batch_extraction(
            chunk_passages, 
            template="episodic_gist_extraction", 
            target="gists"
        )
        
        # Create gist map for fact extraction
        gist_map = {output["chunk_id"]: output.get("gists", []) for output in gist_outputs}
        
        # Then extract facts, leveraging the previously extracted gists
        fact_outputs = self.batch_extraction(
            chunk_passages, 
            template="episodic_fact_extraction", 
            target="facts", 
            gist_map=gist_map
        )
        
        # Combine results
        results = []
        for fact_dict, gist_dict in zip(fact_outputs, gist_outputs):
            results.append(EpisodeRawOutput(
                chunk_id=fact_dict["chunk_id"],
                verbatim=fact_dict["verbatim"],
                facts=fact_dict.get("facts", []),
                gists=gist_dict.get("gists", []),
                response=None,
                metadata=None,
            ))
        
        return {item.chunk_id: item for item in results}

Output Data Classes

Use these dataclasses for structured outputs:
from remem.utils.misc_utils import (
    NerRawOutput,        # For entity extraction
    TripleRawOutput,     # For triple/fact extraction
    ParaphraseRawOutput, # For gist/paraphrase extraction
    EpisodeRawOutput,    # For episodic memories
)
NerRawOutput (openie_openai.py:48-79):
NerRawOutput(
    chunk_id=chunk_key,
    response=raw_response,
    unique_entities=["Alan Turing", "Grace Hopper", ...],
    metadata={"prompt_tokens": 150, "completion_tokens": 50, "cache_hit": False}
)
TripleRawOutput (openie_openai.py:81-116):
TripleRawOutput(
    chunk_id=chunk_key,
    response=raw_response,
    triples=[
        ["Alan Turing", "proposed", "Turing Test"],
        ["Turing Test", "proposed in", "1950"]
    ],
    metadata={...}
)

Integrating with ReMem

1. Register in __init__.py

# src/remem/information_extraction/__init__.py
from .my_custom_extraction_openai import MyCustomExtraction

2. Update the Factory in remem.py

In the ReMem.__init__() method, add your extraction method:
if self.global_config.extract_method == "my_custom":
    from remem.information_extraction import MyCustomExtraction
    self.openie = MyCustomExtraction(llm_model=self.llm, global_config=self.global_config)
elif self.global_config.extract_method == "openie":
    from remem.information_extraction import OpenIE
    self.openie = OpenIE(llm_model=self.llm)
# ... other methods

3. Create Prompt Templates

Add your prompt template in src/remem/prompts/templates/:
# src/remem/prompts/templates/my_custom_template.py
from string import Template

system_prompt = """You are an expert at extracting custom information."""

user_prompt = """Extract custom information from the following passage:

$passage

Return a JSON object with your extracted data."""

prompt_template = [
    {"role": "system", "content": system_prompt},
    {"role": "user", "content": Template(user_prompt)}
]
See Custom Prompts for more details.

Advanced: Multi-Backend Support

Follow the pattern of creating both OpenAI and vLLM variants:
  • my_custom_extraction_openai.py - Uses OpenAI API
  • my_custom_extraction_vllm_offline.py - Uses vLLM for local inference
Switch via global_config.llm_backend.

Testing Your Extraction Method

from remem.remem import ReMem
from remem.utils.config_utils import BaseConfig

config = BaseConfig(
    dataset="test",
    extract_method="my_custom",  # Your custom method
    llm_name="gpt-4o-mini",
)

rag = ReMem(global_config=config)
docs = ["Alan Turing proposed the Turing Test in 1950."]
rag.index(docs)

# Check extraction results
print(rag.chunk_embedding_store.hash_id_to_row)

Next Steps

Build docs developers (and LLMs) love