Dataset Overview
ReMem supports evaluation on multiple long-context QA datasets, each with unique characteristics and challenges.MuSiQue
Multi-hop question answering requiring reasoning across multiple documents
Dataset Structure
- Location:
reproduce/dataset/musique/ - Files:
musique.json- Questions and answersmusique_corpus.json- Document corpus
- Format: Each sample contains:
question: The question textanswer: Gold answerparagraphs: Supporting paragraphs withis_supportingflag
Running MuSiQue
2WikiMultiHopQA
Wikipedia-based multi-hop reasoning dataset with supporting facts
Dataset Structure
- Location:
reproduce/dataset/2wikimultihopqa/ - Files:
2wikimultihopqa.json- Questions with supporting facts2wikimultihopqa_corpus.json- Wikipedia passages
- Format: Includes
supporting_factsfield with document titles and sentence indices
Running 2WikiMultiHopQA
LoCoMo (Long Conversation Memory)
Episodic and temporal reasoning over long conversation sessions
Dataset Structure
- Location:
reproduce/dataset/locomo/ - Files:
locomo_episodic.json- Episodic QA sampleslocomo_temporal.json- Temporal QA sampleslocomo10.json- 10-sample subsetmsc_personas_all.json- Persona data
Data Format
Each sample contains:conversation: Multi-session conversation datasession_{idx}: Dialogue turns with speaker and textsession_{idx}_date_time: Timestamp for each session
qa: List of question-answer pairs with:question: Question textanswer: Gold answerevidence: Dialogue IDs containing the answercategory: Question type (e.g., “factual”, “opinion”)temporal_category: Temporal reasoning type (e.g., “none”, “before”, “after”)
Running LoCoMo
LongMemEval
Long-term memory evaluation across extended conversation sessions
Dataset Structure
- Location:
reproduce/dataset/longmemeval/ - Files:
longmemeval_s.json- Small version (500+ samples)longmemeval_m.json- Medium versionlongmemeval_oracle.json- Oracle version with gold context
Data Format
Each sample contains:haystack_sessions: List of conversation sessions with:id: Session identifierdate: Session timestampmessages: List of messages withrole,content, andhas_answerflag
question: Question to answerquestion_date: When the question was askedquestion_type: Type of questionanswer: Gold answeranswer_session_ids: IDs of sessions containing the answer
Running LongMemEval
Complex TR (Temporal Reasoning)
Complex temporal reasoning over temporal facts and events
Dataset Structure
- Location:
reproduce/dataset/complex-tr/ - Files:
complex_tr_1000.json- 1000 temporal reasoning questionscomplex_tr_1000_corpus.json- Temporal facts corpuscomplex_tr_3993.json- Full dataset (3993 questions)
Data Format
question: Temporal reasoning questionanswers: List of valid answersfact_context: Relevant temporal facts for the question- Corpus contains temporal statements like “Event A happened before Event B”
Running Complex TR
TimeQA
Temporal question answering with time-sensitive information
Dataset Structure
- Location:
reproduce/dataset/timeqa/ - Files:
dev.easy.json,dev.hard.json - Format: Questions with temporal context and multiple target answers
Running TimeQA
RealTalk
Real conversation dataset for natural dialogue understanding
Dataset Structure
- Location:
reproduce/dataset/realtalk/ - Files: Multiple chat files (
Chat_1_*.json,Chat_2_*.json, etc.) - Format: Real conversation sessions with participants and messages
Running RealTalk
Semantic QA
Semantic reasoning benchmarks with multi-threaded support
Running Semantic QA
Dataset Comparison
| Dataset | Size | Type | Context Length | Key Challenge |
|---|---|---|---|---|
| MuSiQue | 2,417 | Multi-hop | Medium | Multi-document reasoning |
| 2WikiMultiHopQA | 12,576 | Multi-hop | Medium | Wikipedia knowledge |
| LoCoMo | 300+ sessions | Conversational | Long | Episodic memory |
| LongMemEval | 500+ | Conversational | Very Long | Long-term memory |
| Complex TR | 1,000-3,993 | Temporal | Short | Temporal reasoning |
| TimeQA | Variable | Temporal | Medium | Time-sensitive facts |
| RealTalk | 10 chats | Conversational | Medium | Natural dialogue |
Common Dataset Arguments
All benchmark scripts support these common arguments:LLM model for QA and extraction
Embedding model for dense retrieval
Extraction strategy:
openie, episodic, episodic_gist, or temporalCustom API endpoint for LLM
Number of top passages to use for QA
Number of linked passages for graph traversal