Skip to main content

Overview

ReMem provides baseline implementations for comparison against the full graph-based approach. These baselines help validate the benefits of the knowledge graph architecture.

Baseline Methods

Dense Retrieval

Standard dense passage retrieval without graph structure

Long Context

Direct long-context prompting with full document context

TISER

Temporal reasoning with chain-of-thought reflection

Dense Retrieval Baseline

Standard dense passage retrieval using embeddings, without knowledge graph linking.

Method Description

The dense baseline:
  1. Embeds all passages using the same embedding model as ReMem
  2. Retrieves top-k most similar passages to the query
  3. Uses retrieved passages directly for QA (no graph traversal)
  4. Uses the same LLM for answer generation

Running Dense Baselines

python baselines/locomo_dense.py \
    --llm_name gpt-4.1-mini-2025-04-14 \
    --dataset locomo_episodic \
    --embedding_name nvidia/NV-Embed-v2 \
    --qa_top_k 10

Dense Baseline Arguments

--llm_name
string
required
LLM for answer generation
--embedding_name
string
required
Embedding model for dense retrieval
--qa_top_k
integer
default:"10"
Number of passages to retrieve for QA
--gold_retrieval
boolean
default:"false"
Use gold documents directly (oracle retrieval for upper bound)

Gold Retrieval (Oracle)

Test with perfect retrieval to measure QA upper bound:
python baselines/locomo_dense.py \
    --llm_name gpt-4.1-mini-2025-04-14 \
    --dataset locomo_episodic \
    --embedding_name nvidia/NV-Embed-v2 \
    --gold_retrieval  # Use gold documents directly

Long Context Baseline

Direct prompting with full document context, testing long-context capabilities of LLMs.

Method Description

The long context baseline:
  1. Concatenates all documents into a single long context
  2. Prompts the LLM with the full context and question
  3. No retrieval or graph traversal required
  4. Tests the model’s ability to handle long contexts

Running Long Context Baselines

python baselines/locomo_long_context.py \
    --llm_name gpt-4o-mini \
    --dataset locomo_episodic
Context Length Limits: Long context baselines may exceed model context windows for very long documents. Some datasets may need to be truncated.

TISER Baseline

Temporal Information-aware Sequential Retrieval with chain-of-thought reasoning and reflection.

Method Description

TISER uses:
  1. Chain-of-Thought Reasoning: Step-by-step reasoning within <reasoning> tags
  2. Timeline Construction: Identifies relevant temporal events in <timeline> tags
  3. Reflection: Self-evaluation and error checking in <reflection> tags
  4. Final Answer: Concise answer in <answer> tags

TISER Prompt Structure

tiser_developer = """You are an AI assistant that uses a Chain of Thought (CoT) approach with reflection to answer queries. Follow these steps:
1. Reason through the problem step by step within the <reasoning> tags.
2. Given your previous reasoning, identify relevant temporal events in the given context for answering the given question within <timeline> tags. Assume relations in the context are unidirectional.
3. Reflect on your reasoning and the timeline to check for any errors or improvements within the <reflection> tags.
4. Make any necessary adjustments based on your reflection. If there is additional reasoning required, go back to Step 1 (reason through the problem step-by-step), otherwise move to the next step (Step 5).
5. Provide your final, concise answer within the <answer> tags.
"""

tiser_user = """Question: {question}
Temporal Context: {temporal_context}"""

Running TISER Baselines

python baselines/complex_tr_tiser.py \
    --llm_name gpt-4o-mini \
    --embedding_name nvidia/NV-Embed-v2

TISER Response Format

<reasoning>
[Step-by-step reasoning process]
<timeline>
[Relevant temporal events for answering the question]
</timeline>
<reflection>
[Self-evaluation of reasoning and timeline]
</reflection>
[Any adjustments based on reflection]
</reasoning>
<answer>
[Final concise answer]
</answer>

LLM-only Baseline

Direct prompting without any retrieval, testing pure LLM reasoning capability.

Running LLM-only Baselines

python baselines/tot_semantic_llm.py \
    --llm_name gpt-4o-mini
This baseline:
  • No retrieval or context provided
  • Tests parametric knowledge of LLM
  • Useful for measuring knowledge memorization vs. retrieval needs

Comparison Guidelines

Fair Comparison Setup

For fair comparison, use identical configurations:
python examples/locomo.py \
    --dataset locomo_episodic \
    --llm_name gpt-4o-mini \
    --embedding_name nvidia/NV-Embed-v2 \
    --extract_method episodic_gist \
    --qa_top_k 10

Key Configuration Matching

Match these parameters for fair comparison:
  • Same --llm_name for answer generation
  • Same --embedding_name for retrieval baselines
  • Same --qa_top_k for number of passages used
  • Same dataset split (use --indices for consistent sampling)

Expected Performance Differences

When ReMem Outperforms Baselines

Datasets: MuSiQue, 2WikiMultiHopQAReMem’s graph structure enables:
  • Following connections between related passages
  • Aggregating information across multiple hops
  • Better recall of supporting documents
Datasets: LoCoMo Temporal, Complex TR, TimeQAReMem’s temporal edges:
  • Track temporal relationships explicitly
  • Enable before/after reasoning
  • Better handle time-dependent queries
Datasets: LoCoMo, LongMemEval, RealTalkReMem’s episodic memory:
  • Organizes by conversation sessions
  • Links related mentions across time
  • Better recall of episodic details

When Baselines May Compete

For simple factual questions answerable from one passage, dense retrieval may be sufficient.
When full context fits in model window, long-context baseline can leverage all information directly.
With oracle retrieval (—gold_retrieval), QA performance differences shrink as graph traversal advantage diminishes.

Output Comparison

Results File Location

# ReMem results
outputs/locomo/locomo_0_gpt-4o-mini_nvidia_NV-Embed-v2/
    └── rag_results_agent_max_step_3.json

# Dense baseline results
outputs/locomo_dense/locomo_0_gpt-4o-mini_nvidia_NV-Embed-v2/
    └── rag_results.json

Metrics to Compare

MetricDescriptionHigher is Better
qa_emExact Match
qa_f1Token-level F1
qa_bleu1BLEU-1 score
qa_mem0_llm_judgeLLM-as-judge evaluation
retrieval_recallGold document recall
retrieval_ndcg_anyNDCG ranking metric

Example Comparison Results

{
  "ReMem": {
    "qa_em": 0.7234,
    "qa_f1": 0.8456,
    "retrieval_recall": 0.8901
  },
  "Dense": {
    "qa_em": 0.6512,
    "qa_f1": 0.7823,
    "retrieval_recall": 0.7234
  },
  "Long Context": {
    "qa_em": 0.6890,
    "qa_f1": 0.8012
  }
}

Analysis Scripts

Compare Multiple Methods

Create a comparison script:
comparison.py
import json
import glob

results = {}
for method in ["remem", "dense", "long_context"]:
    pattern = f"outputs/{method}/**/rag_results*.json"
    files = glob.glob(pattern, recursive=True)
    
    metrics = {"qa_em": [], "qa_f1": []}
    for f in files:
        data = json.load(open(f))
        if "overall_metrics" in data:
            for k in metrics:
                if k in data["overall_metrics"]:
                    metrics[k].append(data["overall_metrics"][k])
    
    results[method] = {
        k: sum(v)/len(v) for k, v in metrics.items() if v
    }

print(json.dumps(results, indent=2))
Run:
python comparison.py

Best Practices

Run Multiple Seeds: For robust comparison, run experiments with different random seeds or data shuffles.
Consistent Infrastructure: Run all methods on the same hardware to ensure fair timing comparisons.
Report All Metrics: Include both retrieval and QA metrics to understand where improvements come from.
Cache Considerations: When comparing timing, ensure fair cache usage. Either disable caching for all methods or enable for all.

Build docs developers (and LLMs) love