Baseline Methods

Overview

ReMem provides baseline implementations for comparison against the full graph-based approach. These baselines help validate the benefits of the knowledge graph architecture.

Baseline Methods

Dense Retrieval

Standard dense passage retrieval without graph structure

Long Context

Direct long-context prompting with full document context

TISER

Temporal reasoning with chain-of-thought reflection

Dense Retrieval Baseline

Standard dense passage retrieval using embeddings, without knowledge graph linking.

Method Description

The dense baseline:

Embeds all passages using the same embedding model as ReMem
Retrieves top-k most similar passages to the query
Uses retrieved passages directly for QA (no graph traversal)
Uses the same LLM for answer generation

Running Dense Baselines

python baselines/locomo_dense.py \
    --llm_name gpt-4.1-mini-2025-04-14 \
    --dataset locomo_episodic \
    --embedding_name nvidia/NV-Embed-v2 \
    --qa_top_k 10

Dense Baseline Arguments

--llm_name

string

required

LLM for answer generation

--embedding_name

string

required

Embedding model for dense retrieval

--qa_top_k

integer

default:"10"

Number of passages to retrieve for QA

--gold_retrieval

boolean

default:"false"

Use gold documents directly (oracle retrieval for upper bound)

Gold Retrieval (Oracle)

Test with perfect retrieval to measure QA upper bound:

python baselines/locomo_dense.py \
    --llm_name gpt-4.1-mini-2025-04-14 \
    --dataset locomo_episodic \
    --embedding_name nvidia/NV-Embed-v2 \
    --gold_retrieval  # Use gold documents directly

Long Context Baseline

Direct prompting with full document context, testing long-context capabilities of LLMs.

Method Description

The long context baseline:

Concatenates all documents into a single long context
Prompts the LLM with the full context and question
No retrieval or graph traversal required
Tests the model’s ability to handle long contexts

Running Long Context Baselines

python baselines/locomo_long_context.py \
    --llm_name gpt-4o-mini \
    --dataset locomo_episodic

Context Length Limits: Long context baselines may exceed model context windows for very long documents. Some datasets may need to be truncated.

TISER Baseline

Temporal Information-aware Sequential Retrieval with chain-of-thought reasoning and reflection.

Method Description

TISER uses:

Chain-of-Thought Reasoning: Step-by-step reasoning within <reasoning> tags
Timeline Construction: Identifies relevant temporal events in <timeline> tags
Reflection: Self-evaluation and error checking in <reflection> tags
Final Answer: Concise answer in <answer> tags

TISER Prompt Structure

tiser_developer = """You are an AI assistant that uses a Chain of Thought (CoT) approach with reflection to answer queries. Follow these steps:
1. Reason through the problem step by step within the <reasoning> tags.
2. Given your previous reasoning, identify relevant temporal events in the given context for answering the given question within <timeline> tags. Assume relations in the context are unidirectional.
3. Reflect on your reasoning and the timeline to check for any errors or improvements within the <reflection> tags.
4. Make any necessary adjustments based on your reflection. If there is additional reasoning required, go back to Step 1 (reason through the problem step-by-step), otherwise move to the next step (Step 5).
5. Provide your final, concise answer within the <answer> tags.
"""

tiser_user = """Question: {question}
Temporal Context: {temporal_context}"""

Running TISER Baselines

python baselines/complex_tr_tiser.py \
    --llm_name gpt-4o-mini \
    --embedding_name nvidia/NV-Embed-v2

TISER Response Format

<reasoning>
[Step-by-step reasoning process]
<timeline>
[Relevant temporal events for answering the question]
</timeline>
<reflection>
[Self-evaluation of reasoning and timeline]
</reflection>
[Any adjustments based on reflection]
</reasoning>
<answer>
[Final concise answer]
</answer>

LLM-only Baseline

Direct prompting without any retrieval, testing pure LLM reasoning capability.

Running LLM-only Baselines

python baselines/tot_semantic_llm.py \
    --llm_name gpt-4o-mini

This baseline:

No retrieval or context provided
Tests parametric knowledge of LLM
Useful for measuring knowledge memorization vs. retrieval needs

Comparison Guidelines

Fair Comparison Setup

For fair comparison, use identical configurations:

python examples/locomo.py \
    --dataset locomo_episodic \
    --llm_name gpt-4o-mini \
    --embedding_name nvidia/NV-Embed-v2 \
    --extract_method episodic_gist \
    --qa_top_k 10

Key Configuration Matching

Match these parameters for fair comparison:

Same --llm_name for answer generation
Same --embedding_name for retrieval baselines
Same --qa_top_k for number of passages used
Same dataset split (use --indices for consistent sampling)

Expected Performance Differences

When ReMem Outperforms Baselines

Multi-hop Reasoning

Datasets: MuSiQue, 2WikiMultiHopQAReMem’s graph structure enables:

Following connections between related passages
Aggregating information across multiple hops
Better recall of supporting documents

Temporal Reasoning

Datasets: LoCoMo Temporal, Complex TR, TimeQAReMem’s temporal edges:

Track temporal relationships explicitly
Enable before/after reasoning
Better handle time-dependent queries

Long Conversations

Datasets: LoCoMo, LongMemEval, RealTalkReMem’s episodic memory:

Organizes by conversation sessions
Links related mentions across time
Better recall of episodic details

When Baselines May Compete

Single-hop Queries

For simple factual questions answerable from one passage, dense retrieval may be sufficient.

Short Contexts

When full context fits in model window, long-context baseline can leverage all information directly.

Perfect Retrieval

With oracle retrieval (—gold_retrieval), QA performance differences shrink as graph traversal advantage diminishes.

Output Comparison

Results File Location

# ReMem results
outputs/locomo/locomo_0_gpt-4o-mini_nvidia_NV-Embed-v2/
    └── rag_results_agent_max_step_3.json

# Dense baseline results
outputs/locomo_dense/locomo_0_gpt-4o-mini_nvidia_NV-Embed-v2/
    └── rag_results.json

Metrics to Compare

Metric	Description	Higher is Better
`qa_em`	Exact Match	✓
`qa_f1`	Token-level F1	✓
`qa_bleu1`	BLEU-1 score	✓
`qa_mem0_llm_judge`	LLM-as-judge evaluation	✓
`retrieval_recall`	Gold document recall	✓
`retrieval_ndcg_any`	NDCG ranking metric	✓

Example Comparison Results

{
  "ReMem": {
    "qa_em": 0.7234,
    "qa_f1": 0.8456,
    "retrieval_recall": 0.8901
  },
  "Dense": {
    "qa_em": 0.6512,
    "qa_f1": 0.7823,
    "retrieval_recall": 0.7234
  },
  "Long Context": {
    "qa_em": 0.6890,
    "qa_f1": 0.8012
  }
}

Analysis Scripts

Compare Multiple Methods

Create a comparison script:

comparison.py

import json
import glob

results = {}
for method in ["remem", "dense", "long_context"]:
    pattern = f"outputs/{method}/**/rag_results*.json"
    files = glob.glob(pattern, recursive=True)
    
    metrics = {"qa_em": [], "qa_f1": []}
    for f in files:
        data = json.load(open(f))
        if "overall_metrics" in data:
            for k in metrics:
                if k in data["overall_metrics"]:
                    metrics[k].append(data["overall_metrics"][k])
    
    results[method] = {
        k: sum(v)/len(v) for k, v in metrics.items() if v
    }

print(json.dumps(results, indent=2))

Run:

python comparison.py

Best Practices

Run Multiple Seeds: For robust comparison, run experiments with different random seeds or data shuffles.

Consistent Infrastructure: Run all methods on the same hardware to ensure fair timing comparisons.

Report All Metrics: Include both retrieval and QA metrics to understand where improvements come from.

Cache Considerations: When comparing timing, ensure fair cache usage. Either disable caching for all methods or enable for all.

Get Started

Core Concepts

Guides

Customization

Benchmarks

Baseline Methods

Overview