Supported Datasets
ReMem has been evaluated on the following datasets:MuSiQue
Multi-hop question answering requiring reasoning across multiple documents
2WikiMultiHopQA
Wikipedia-based multi-hop reasoning dataset
LoCoMo
Long conversation memory evaluation with episodic and temporal reasoning
LongMemEval
Long-term memory evaluation across conversation sessions
Complex TR
Complex temporal reasoning over temporal facts
TimeQA
Temporal question answering dataset
RealTalk
Real conversation analysis and understanding
Semantic QA
Semantic reasoning benchmarks
Benchmark Architecture
Directory Structure
Extraction Methods
ReMem supports multiple extraction methods for building the knowledge graph:openie
openie
Open Information Extraction using LLM-based entity and relation extraction
episodic
episodic
Episode-based extraction for conversational data
episodic_gist
episodic_gist
Episode-based extraction with gist summarization (default for conversations)
temporal
temporal
Temporal-aware extraction for time-based reasoning
Evaluation Metrics
Benchmarks track multiple metrics:QA Metrics
- qa_em: Exact Match score
- qa_f1: Token-level F1 score
- qa_bleu1: BLEU-1 score
- qa_mem0_llm_judge: LLM-as-judge evaluation
- qa_longmemeval: LongMemEval-specific metric
Retrieval Metrics
- retrieval_recall: Recall of gold documents
- retrieval_recall_all: Recall across all retrieved chunks
- retrieval_ndcg_any: NDCG score
- retrieval_recall_locomo: LoCoMo-specific recall
Output Structure
Benchmark results are saved tooutputs/{dataset}/ with the following structure:
Performance Optimization
Caching
Benchmarks use extensive caching to speed up repeated runs:- Embedding Cache:
vdb_*.pklfiles store computed embeddings - OpenIE Cache: Extracted entities and relations are cached
- LLM Cache: API responses are cached to reduce costs
Parallel Processing
Many benchmarks support parallel processing:Force Flags
Control cache behavior with force flags:--force_index_from_scratch/-fi: Rebuild index--force_openie_from_scratch/-fo: Rerun extraction--force_rag/-fr: Rerun QA evaluation
Next Steps
Dataset Details
Learn about each dataset’s structure and requirements
Running Benchmarks
Step-by-step guide to running benchmarks
Baseline Methods
Compare ReMem against baseline methods