Skip to main content

Dataset Overview

ReMem supports evaluation on multiple long-context QA datasets, each with unique characteristics and challenges.

MuSiQue

Multi-hop question answering requiring reasoning across multiple documents

Dataset Structure

  • Location: reproduce/dataset/musique/
  • Files:
    • musique.json - Questions and answers
    • musique_corpus.json - Document corpus
  • Format: Each sample contains:
    • question: The question text
    • answer: Gold answer
    • paragraphs: Supporting paragraphs with is_supporting flag

Running MuSiQue

python main.py \
    --dataset musique \
    --llm_name gpt-4o-mini \
    --embedding_name nvidia/NV-Embed-v2 \
    --extract_method openie

2WikiMultiHopQA

Wikipedia-based multi-hop reasoning dataset with supporting facts

Dataset Structure

  • Location: reproduce/dataset/2wikimultihopqa/
  • Files:
    • 2wikimultihopqa.json - Questions with supporting facts
    • 2wikimultihopqa_corpus.json - Wikipedia passages
  • Format: Includes supporting_facts field with document titles and sentence indices

Running 2WikiMultiHopQA

python main.py \
    --dataset 2wikimultihopqa \
    --llm_name gpt-4o-mini \
    --embedding_name nvidia/NV-Embed-v2

LoCoMo (Long Conversation Memory)

Episodic and temporal reasoning over long conversation sessions

Dataset Structure

  • Location: reproduce/dataset/locomo/
  • Files:
    • locomo_episodic.json - Episodic QA samples
    • locomo_temporal.json - Temporal QA samples
    • locomo10.json - 10-sample subset
    • msc_personas_all.json - Persona data

Data Format

Each sample contains:
  • conversation: Multi-session conversation data
    • session_{idx}: Dialogue turns with speaker and text
    • session_{idx}_date_time: Timestamp for each session
  • qa: List of question-answer pairs with:
    • question: Question text
    • answer: Gold answer
    • evidence: Dialogue IDs containing the answer
    • category: Question type (e.g., “factual”, “opinion”)
    • temporal_category: Temporal reasoning type (e.g., “none”, “before”, “after”)

Running LoCoMo

python examples/locomo.py \
    --dataset locomo_episodic \
    --llm_name gpt-4o-mini \
    --embedding_name nvidia/NV-Embed-v2 \
    --extract_method episodic_gist

LongMemEval

Long-term memory evaluation across extended conversation sessions

Dataset Structure

  • Location: reproduce/dataset/longmemeval/
  • Files:
    • longmemeval_s.json - Small version (500+ samples)
    • longmemeval_m.json - Medium version
    • longmemeval_oracle.json - Oracle version with gold context

Data Format

Each sample contains:
  • haystack_sessions: List of conversation sessions with:
    • id: Session identifier
    • date: Session timestamp
    • messages: List of messages with role, content, and has_answer flag
  • question: Question to answer
  • question_date: When the question was asked
  • question_type: Type of question
  • answer: Gold answer
  • answer_session_ids: IDs of sessions containing the answer

Running LongMemEval

python examples/longmemeval.py \
    --dataset longmemeval_s.json \
    --llm_name gpt-4o-mini \
    --embedding_name nvidia/NV-Embed-v2 \
    --extract_method episodic_gist

Complex TR (Temporal Reasoning)

Complex temporal reasoning over temporal facts and events

Dataset Structure

  • Location: reproduce/dataset/complex-tr/
  • Files:
    • complex_tr_1000.json - 1000 temporal reasoning questions
    • complex_tr_1000_corpus.json - Temporal facts corpus
    • complex_tr_3993.json - Full dataset (3993 questions)

Data Format

  • question: Temporal reasoning question
  • answers: List of valid answers
  • fact_context: Relevant temporal facts for the question
  • Corpus contains temporal statements like “Event A happened before Event B”

Running Complex TR

python examples/complex_tr.py \
    --llm_name gpt-4o-mini \
    --embedding_name nvidia/NV-Embed-v2 \
    --extract_method episodic_gist

TimeQA

Temporal question answering with time-sensitive information

Dataset Structure

  • Location: reproduce/dataset/timeqa/
  • Files: dev.easy.json, dev.hard.json
  • Format: Questions with temporal context and multiple target answers

Running TimeQA

python examples/timeqa.py \
    --llm_name gpt-4o-mini \
    --dataset_file reproduce/dataset/timeqa/dev.easy.json \
    --embedding_name nvidia/NV-Embed-v2 \
    --extract_method temporal \
    --qa_top_k 5

RealTalk

Real conversation dataset for natural dialogue understanding

Dataset Structure

  • Location: reproduce/dataset/realtalk/
  • Files: Multiple chat files (Chat_1_*.json, Chat_2_*.json, etc.)
  • Format: Real conversation sessions with participants and messages

Running RealTalk

python examples/realtalk.py \
    --llm_name gpt-4o-mini \
    --embedding_name nvidia/NV-Embed-v2 \
    --extract_method episodic_gist

Semantic QA

Semantic reasoning benchmarks with multi-threaded support

Running Semantic QA

python examples/semantic_qa.py \
    --llm_name gpt-4o-mini \
    --embedding_name nvidia/NV-Embed-v2

Dataset Comparison

DatasetSizeTypeContext LengthKey Challenge
MuSiQue2,417Multi-hopMediumMulti-document reasoning
2WikiMultiHopQA12,576Multi-hopMediumWikipedia knowledge
LoCoMo300+ sessionsConversationalLongEpisodic memory
LongMemEval500+ConversationalVery LongLong-term memory
Complex TR1,000-3,993TemporalShortTemporal reasoning
TimeQAVariableTemporalMediumTime-sensitive facts
RealTalk10 chatsConversationalMediumNatural dialogue

Common Dataset Arguments

All benchmark scripts support these common arguments:
--llm_name
string
default:"gpt-4o-mini"
LLM model for QA and extraction
--embedding_name
string
default:"nvidia/NV-Embed-v2"
Embedding model for dense retrieval
--extract_method
string
default:"openie"
Extraction strategy: openie, episodic, episodic_gist, or temporal
--llm_base_url
string
default:"https://api.openai.com/v1"
Custom API endpoint for LLM
--qa_top_k
integer
default:"5-10"
Number of top passages to use for QA
--linking_top_k
integer
default:"5"
Number of linked passages for graph traversal

Build docs developers (and LLMs) love