Overview
ReMem provides baseline implementations for comparison against the full graph-based approach. These baselines help validate the benefits of the knowledge graph architecture.Baseline Methods
Dense Retrieval
Standard dense passage retrieval without graph structure
Long Context
Direct long-context prompting with full document context
TISER
Temporal reasoning with chain-of-thought reflection
Dense Retrieval Baseline
Standard dense passage retrieval using embeddings, without knowledge graph linking.
Method Description
The dense baseline:- Embeds all passages using the same embedding model as ReMem
- Retrieves top-k most similar passages to the query
- Uses retrieved passages directly for QA (no graph traversal)
- Uses the same LLM for answer generation
Running Dense Baselines
Dense Baseline Arguments
LLM for answer generation
Embedding model for dense retrieval
Number of passages to retrieve for QA
Use gold documents directly (oracle retrieval for upper bound)
Gold Retrieval (Oracle)
Test with perfect retrieval to measure QA upper bound:Long Context Baseline
Direct prompting with full document context, testing long-context capabilities of LLMs.
Method Description
The long context baseline:- Concatenates all documents into a single long context
- Prompts the LLM with the full context and question
- No retrieval or graph traversal required
- Tests the model’s ability to handle long contexts
Running Long Context Baselines
TISER Baseline
Temporal Information-aware Sequential Retrieval with chain-of-thought reasoning and reflection.
Method Description
TISER uses:- Chain-of-Thought Reasoning: Step-by-step reasoning within
<reasoning>tags - Timeline Construction: Identifies relevant temporal events in
<timeline>tags - Reflection: Self-evaluation and error checking in
<reflection>tags - Final Answer: Concise answer in
<answer>tags
TISER Prompt Structure
Running TISER Baselines
TISER Response Format
LLM-only Baseline
Direct prompting without any retrieval, testing pure LLM reasoning capability.
Running LLM-only Baselines
- No retrieval or context provided
- Tests parametric knowledge of LLM
- Useful for measuring knowledge memorization vs. retrieval needs
Comparison Guidelines
Fair Comparison Setup
For fair comparison, use identical configurations:Key Configuration Matching
Match these parameters for fair comparison:
- Same
--llm_namefor answer generation - Same
--embedding_namefor retrieval baselines - Same
--qa_top_kfor number of passages used - Same dataset split (use
--indicesfor consistent sampling)
Expected Performance Differences
When ReMem Outperforms Baselines
Multi-hop Reasoning
Multi-hop Reasoning
Datasets: MuSiQue, 2WikiMultiHopQAReMem’s graph structure enables:
- Following connections between related passages
- Aggregating information across multiple hops
- Better recall of supporting documents
Temporal Reasoning
Temporal Reasoning
Datasets: LoCoMo Temporal, Complex TR, TimeQAReMem’s temporal edges:
- Track temporal relationships explicitly
- Enable before/after reasoning
- Better handle time-dependent queries
Long Conversations
Long Conversations
Datasets: LoCoMo, LongMemEval, RealTalkReMem’s episodic memory:
- Organizes by conversation sessions
- Links related mentions across time
- Better recall of episodic details
When Baselines May Compete
Single-hop Queries
Single-hop Queries
For simple factual questions answerable from one passage, dense retrieval may be sufficient.
Short Contexts
Short Contexts
When full context fits in model window, long-context baseline can leverage all information directly.
Perfect Retrieval
Perfect Retrieval
With oracle retrieval (—gold_retrieval), QA performance differences shrink as graph traversal advantage diminishes.
Output Comparison
Results File Location
Metrics to Compare
| Metric | Description | Higher is Better |
|---|---|---|
qa_em | Exact Match | ✓ |
qa_f1 | Token-level F1 | ✓ |
qa_bleu1 | BLEU-1 score | ✓ |
qa_mem0_llm_judge | LLM-as-judge evaluation | ✓ |
retrieval_recall | Gold document recall | ✓ |
retrieval_ndcg_any | NDCG ranking metric | ✓ |
Example Comparison Results
Analysis Scripts
Compare Multiple Methods
Create a comparison script:comparison.py