Benchmarks Overview

ReMem includes a comprehensive benchmarking suite to evaluate its performance on various long-context question answering tasks. The benchmark framework supports multiple datasets, extraction methods, and baseline comparisons.

Supported Datasets

ReMem has been evaluated on the following datasets:

MuSiQue

Multi-hop question answering requiring reasoning across multiple documents

2WikiMultiHopQA

Wikipedia-based multi-hop reasoning dataset

LoCoMo

Long conversation memory evaluation with episodic and temporal reasoning

LongMemEval

Long-term memory evaluation across conversation sessions

Complex TR

Complex temporal reasoning over temporal facts

TimeQA

Temporal question answering dataset

RealTalk

Real conversation analysis and understanding

Semantic QA

Semantic reasoning benchmarks

Benchmark Architecture

Directory Structure

examples/
├── Benchmark Scripts          # Main evaluation scripts
│   ├── locomo.py             # LoCoMo dataset evaluation
│   ├── longmemeval.py        # LongMemEval evaluation
│   ├── semantic_qa.py        # Semantic QA benchmarks
│   ├── complex_tr.py         # Complex temporal reasoning
│   ├── timeqa.py             # TimeQA evaluation
│   └── realtalk.py           # RealTalk conversations
│
├── Analysis Tools             # Post-processing utilities
│   ├── *_overall_eval.py     # Aggregate evaluation
│   ├── analyze_*.py          # Result analysis
│   └── igraph_graph_*.py     # Graph visualization
│
baselines/                     # Baseline comparison methods
│   ├── *_dense.py            # Dense retrieval baselines
│   ├── *_long_context.py     # Long context baselines
│   └── tiser.py              # TISER baseline
│
reproduce/dataset/             # Dataset files
│   ├── musique/
│   ├── locomo/
│   ├── longmemeval/
│   └── ...

Extraction Methods

ReMem supports multiple extraction methods for building the knowledge graph:

openie

Open Information Extraction using LLM-based entity and relation extraction

episodic

Episode-based extraction for conversational data

episodic_gist

Episode-based extraction with gist summarization (default for conversations)

temporal

Temporal-aware extraction for time-based reasoning

Evaluation Metrics

Benchmarks track multiple metrics:

QA Metrics

qa_em: Exact Match score
qa_f1: Token-level F1 score
qa_bleu1: BLEU-1 score
qa_mem0_llm_judge: LLM-as-judge evaluation
qa_longmemeval: LongMemEval-specific metric

Retrieval Metrics

retrieval_recall: Recall of gold documents
retrieval_recall_all: Recall across all retrieved chunks
retrieval_ndcg_any: NDCG score
retrieval_recall_locomo: LoCoMo-specific recall

Output Structure

Benchmark results are saved to outputs/{dataset}/ with the following structure:

outputs/
└── {dataset}/
    └── {dataset}_{llm}_{embedding}/
        ├── rag_results_*.json          # Individual run results
        ├── overall_results_*.json      # Aggregated results
        ├── retrieval_results.json      # Retrieved passages
        ├── vdb_*.pkl                   # Cached embeddings
        └── graph.pkl                   # Memory graph structure

Performance Optimization

Caching

Benchmarks use extensive caching to speed up repeated runs:

Embedding Cache: vdb_*.pkl files store computed embeddings
OpenIE Cache: Extracted entities and relations are cached
LLM Cache: API responses are cached to reduce costs

Parallel Processing

Many benchmarks support parallel processing:

python examples/semantic_qa.py \
    --llm_name gpt-4o-mini \
    --parallel \
    --num_workers 5

Force Flags

Control cache behavior with force flags:

--force_index_from_scratch / -fi: Rebuild index
--force_openie_from_scratch / -fo: Rerun extraction
--force_rag / -fr: Rerun QA evaluation

Next Steps

Dataset Details

Learn about each dataset’s structure and requirements

Running Benchmarks

Step-by-step guide to running benchmarks

Baseline Methods

Compare ReMem against baseline methods

Get Started

Core Concepts

Guides

Customization

Benchmarks

Benchmarks Overview

Supported Datasets

MuSiQue

2WikiMultiHopQA

LoCoMo

LongMemEval

Complex TR

TimeQA

RealTalk

Semantic QA

Benchmark Architecture

Directory Structure

Extraction Methods

Evaluation Metrics

QA Metrics

Retrieval Metrics

Output Structure

Performance Optimization

Caching

Parallel Processing

Force Flags

Next Steps

Dataset Details

Running Benchmarks

Baseline Methods

Build docs developers (and LLMs) love

Get Started

Core Concepts

Guides

Customization

Benchmarks

​Supported Datasets

MuSiQue

2WikiMultiHopQA

LoCoMo

LongMemEval

Complex TR

TimeQA

RealTalk

Semantic QA

​Benchmark Architecture

​Directory Structure

​Extraction Methods

​Evaluation Metrics

​QA Metrics

​Retrieval Metrics

​Output Structure

​Performance Optimization

​Caching

​Parallel Processing

​Force Flags

​Next Steps

Dataset Details

Running Benchmarks

Baseline Methods

Build docs developers (and LLMs) love

Supported Datasets

Benchmark Architecture

Directory Structure

Extraction Methods

Evaluation Metrics

QA Metrics

Retrieval Metrics

Output Structure

Performance Optimization

Caching

Parallel Processing

Force Flags

Next Steps