Skip to main content
ReMem includes a comprehensive benchmarking suite to evaluate its performance on various long-context question answering tasks. The benchmark framework supports multiple datasets, extraction methods, and baseline comparisons.

Supported Datasets

ReMem has been evaluated on the following datasets:

MuSiQue

Multi-hop question answering requiring reasoning across multiple documents

2WikiMultiHopQA

Wikipedia-based multi-hop reasoning dataset

LoCoMo

Long conversation memory evaluation with episodic and temporal reasoning

LongMemEval

Long-term memory evaluation across conversation sessions

Complex TR

Complex temporal reasoning over temporal facts

TimeQA

Temporal question answering dataset

RealTalk

Real conversation analysis and understanding

Semantic QA

Semantic reasoning benchmarks

Benchmark Architecture

Directory Structure

examples/
├── Benchmark Scripts          # Main evaluation scripts
│   ├── locomo.py             # LoCoMo dataset evaluation
│   ├── longmemeval.py        # LongMemEval evaluation
│   ├── semantic_qa.py        # Semantic QA benchmarks
│   ├── complex_tr.py         # Complex temporal reasoning
│   ├── timeqa.py             # TimeQA evaluation
│   └── realtalk.py           # RealTalk conversations

├── Analysis Tools             # Post-processing utilities
│   ├── *_overall_eval.py     # Aggregate evaluation
│   ├── analyze_*.py          # Result analysis
│   └── igraph_graph_*.py     # Graph visualization

baselines/                     # Baseline comparison methods
│   ├── *_dense.py            # Dense retrieval baselines
│   ├── *_long_context.py     # Long context baselines
│   └── tiser.py              # TISER baseline

reproduce/dataset/             # Dataset files
│   ├── musique/
│   ├── locomo/
│   ├── longmemeval/
│   └── ...

Extraction Methods

ReMem supports multiple extraction methods for building the knowledge graph:
Open Information Extraction using LLM-based entity and relation extraction
Episode-based extraction for conversational data
Episode-based extraction with gist summarization (default for conversations)
Temporal-aware extraction for time-based reasoning

Evaluation Metrics

Benchmarks track multiple metrics:

QA Metrics

  • qa_em: Exact Match score
  • qa_f1: Token-level F1 score
  • qa_bleu1: BLEU-1 score
  • qa_mem0_llm_judge: LLM-as-judge evaluation
  • qa_longmemeval: LongMemEval-specific metric

Retrieval Metrics

  • retrieval_recall: Recall of gold documents
  • retrieval_recall_all: Recall across all retrieved chunks
  • retrieval_ndcg_any: NDCG score
  • retrieval_recall_locomo: LoCoMo-specific recall

Output Structure

Benchmark results are saved to outputs/{dataset}/ with the following structure:
outputs/
└── {dataset}/
    └── {dataset}_{llm}_{embedding}/
        ├── rag_results_*.json          # Individual run results
        ├── overall_results_*.json      # Aggregated results
        ├── retrieval_results.json      # Retrieved passages
        ├── vdb_*.pkl                   # Cached embeddings
        └── graph.pkl                   # Memory graph structure

Performance Optimization

Caching

Benchmarks use extensive caching to speed up repeated runs:
  • Embedding Cache: vdb_*.pkl files store computed embeddings
  • OpenIE Cache: Extracted entities and relations are cached
  • LLM Cache: API responses are cached to reduce costs

Parallel Processing

Many benchmarks support parallel processing:
python examples/semantic_qa.py \
    --llm_name gpt-4o-mini \
    --parallel \
    --num_workers 5

Force Flags

Control cache behavior with force flags:
  • --force_index_from_scratch / -fi: Rebuild index
  • --force_openie_from_scratch / -fo: Rerun extraction
  • --force_rag / -fr: Rerun QA evaluation

Next Steps

Dataset Details

Learn about each dataset’s structure and requirements

Running Benchmarks

Step-by-step guide to running benchmarks

Baseline Methods

Compare ReMem against baseline methods

Build docs developers (and LLMs) love