Running Benchmarks

Quick Start

Basic Benchmark Execution

Run a benchmark on MuSiQue dataset:

python main.py \
    --dataset musique \
    --llm_name gpt-4o-mini \
    --embedding_name nvidia/NV-Embed-v2

First Run: Initial execution will build the index and extract information (slower). Subsequent runs reuse cached data.

Command-Line Arguments

Required Arguments

--dataset

string

required

Dataset name (e.g., musique, 2wikimultihopqa, locomo_episodic)

Model Configuration

--llm_name

string

default:"gpt-4o-mini"

LLM model for QA and extraction. Supported models:

gpt-4o-mini (recommended)
gpt-4o
gpt-4-turbo
meta-llama/Llama-3.3-70B-Instruct
Any OpenAI-compatible model

--embedding_name

string

default:"nvidia/NV-Embed-v2"

Embedding model for dense retrieval:

nvidia/NV-Embed-v2 (recommended, requires GPU)
openai/text-embedding-3-small
openai/text-embedding-3-large

--llm_base_url

string

default:"https://api.openai.com/v1"

Custom API endpoint for LLM (for local models or alternative providers)

Extraction Methods

--extract_method

string

default:"opinie"

Information extraction strategy:

openie - Open IE for document datasets
episodic - Episode-based for conversations
episodic_gist - Episodic with summarization (recommended for conversations)
temporal - Temporal-aware extraction

Cache Control

--force_index_from_scratch

boolean

default:"false"

Short flag: -fi. Rebuild the entire index from scratch, ignoring cached embeddings.

--force_openie_from_scratch

boolean

default:"false"

Short flag: -fo. Rerun information extraction, ignoring cached results.

--force_rag

boolean

default:"false"

Short flag: -fr. Rerun QA evaluation even if results already exist.

Retrieval Configuration

--qa_top_k

integer

default:"5-10"

Number of top passages to use for answering questions.

--linking_top_k

integer

default:"5"

Number of linked passages to traverse in the knowledge graph.

--retrieval_top_k

integer

default:"200"

Initial retrieval pool size before reranking.

Agent Configuration

--agent_max_steps

integer

default:"3"

Maximum reasoning steps for agent:

1 - Retrieve only
2 - Retrieve + answer
>2 - Full multi-step reasoning

--agent_fixed_tools

boolean

default:"false"

Use simplified agent with only semantic_retrieve + output_answer tools.

--agent_fixed_retrieval_tool

string

default:"semantic_retrieve"

Retrieval tool for fixed mode: semantic_retrieve or lexical_retrieve.

Advanced Options

--llm_infer_mode

string

default:"online"

LLM inference mode:

online - API-based inference
offline - VLLM offline batch mode (for local models)

--use_azure

boolean

default:"false"

Use Azure OpenAI Service (requires Azure credentials)

--parallel

boolean

default:"false"

Enable parallel sample processing (for supported benchmarks)

--num_workers

integer

default:"5"

Number of parallel workers when --parallel is enabled

--verbose

boolean

default:"false"

Enable detailed logging output

Dataset-Specific Examples

MuSiQue & 2WikiMultiHopQA

python main.py \
    --dataset musique \
    --llm_name gpt-4o-mini \
    --embedding_name nvidia/NV-Embed-v2 \
    --extract_method openie

LoCoMo (Conversations)

python examples/locomo.py \
    --dataset locomo_episodic \
    --llm_name gpt-4o-mini \
    --embedding_name nvidia/NV-Embed-v2 \
    --extract_method episodic_gist \
    --qa_top_k 10

LongMemEval

python examples/longmemeval.py \
    --dataset longmemeval_s.json \
    --llm_name gpt-4o-mini \
    --embedding_name nvidia/NV-Embed-v2 \
    --extract_method episodic_gist

Complex TR (Temporal Reasoning)

python examples/complex_tr.py \
    --llm_name gpt-4o-mini \
    --embedding_name nvidia/NV-Embed-v2 \
    --extract_method episodic_gist \
    --agent_max_steps 5

Semantic QA

python examples/semantic_qa.py \
    --dataset musique \
    --llm_name gpt-4o-mini \
    --embedding_name nvidia/NV-Embed-v2

TimeQA

python examples/timeqa.py \
    --llm_name gpt-4o-mini \
    --dataset_file reproduce/dataset/timeqa/dev.easy.json \
    --embedding_name nvidia/NV-Embed-v2 \
    --extract_method temporal \
    --qa_top_k 5 \
    --max_samples 100  # Optional: limit samples

Understanding Output

Console Output

During execution, you’ll see:

Processing sample 0
# chunk contents: 142
Current metrics after 1 samples:
  retrieval_recall: 0.8750
  qa_em: 1.0000
  qa_f1: 1.0000
  qa_bleu1: 1.0000

Output Files

Results are saved to outputs/{dataset}/:

outputs/musique/musique_gpt-4o-mini_nvidia_NV-Embed-v2/
├── rag_results_agent_max_step_3.json    # QA results
├── overall_results_*.json               # Aggregated metrics
├── retrieval_results.json               # Retrieved passages
├── vdb_*.pkl                            # Cached embeddings
└── graph.pkl                            # Knowledge graph

Results Format

{
  "num_samples": 100,
  "average_metrics": {
    "qa_em": 0.7234,
    "qa_f1": 0.8123,
    "retrieval_recall": 0.8567
  },
  "configuration": {
    "llm_name": "gpt-4o-mini",
    "embedding_name": "nvidia/NV-Embed-v2",
    "dataset": "musique"
  },
  "samples": [...]
}

Analysis Tools

Aggregate Evaluation

After running benchmarks, aggregate results:

python examples/locomo_overall_eval.py \
    --working_dir outputs/locomo/

Graph Visualization

Visualize the knowledge graph structure:

python examples/igraph_graph_visualization.py \
    --graph_path outputs/musique/musique_gpt-4o-mini_nvidia_NV-Embed-v2/graph.pkl

Graph Information

Get statistics about the constructed graph:

python examples/igraph_graph_info.py \
    --graph_path outputs/musique/musique_gpt-4o-mini_nvidia_NV-Embed-v2/graph.pkl

Performance Tips

Use Caching: Don’t use -fi or -fo flags unless necessary. Reusing cached data significantly speeds up experiments.

Parallel Processing: For datasets supporting parallel processing (Complex TR, Semantic QA), use --parallel --num_workers 5 to speed up evaluation.

GPU Acceleration: Use NVIDIA NV-Embed-v2 with GPU for faster embedding computation:

CUDA_VISIBLE_DEVICES=0 python examples/locomo.py ...

API Costs: Running benchmarks with OpenAI models can incur costs. Use caching and start with small subsets to estimate costs.

Debugging

Enable Verbose Logging

python examples/locomo.py \
    --dataset locomo_episodic \
    --llm_name gpt-4o-mini \
    --verbose

Test on Small Subset

# LoCoMo: Use locomo10 (10 samples)
python examples/locomo.py --dataset locomo10 --llm_name gpt-4o-mini

# Complex TR: Use start/end indices
python examples/complex_tr.py --start 0 --end 5 --llm_name gpt-4o-mini

# TimeQA: Use max_samples
python examples/timeqa.py --max_samples 5 --llm_name gpt-4o-mini

Environment Variables

# OpenAI API Key
export OPENAI_API_KEY="sk-..."

# Azure OpenAI (if using --use_azure)
export AZURE_OPENAI_ENDPOINT="https://..."
export AZURE_OPENAI_API_KEY="..."

# GPU Configuration
export CUDA_VISIBLE_DEVICES=0
export CUDA_DEVICE_ORDER="PCI_BUS_ID"

# Reduce tokenizer warnings
export TOKENIZERS_PARALLELISM=false

Troubleshooting

Out of Memory Errors

Reduce --batch_size for embedding model
Reduce --qa_top_k to limit context size
Use smaller embedding model
Process dataset in batches using --start and --end

API Rate Limits

Reduce --num_workers for parallel processing
Add delays between requests (modify source code)
Use offline mode with local models

Slow First Run

This is expected - first run builds index and extracts information:

Indexing large corpus can take 30-60 minutes
OpenIE extraction adds additional time
Subsequent runs reuse cached data and are much faster

Missing Dataset Files

Ensure dataset files exist in reproduce/dataset/:

ls reproduce/dataset/musique/
# Should show: musique.json, musique_corpus.json

Get Started

Core Concepts

Guides

Customization

Benchmarks

Running Benchmarks

Quick Start

Basic Benchmark Execution

Command-Line Arguments

Required Arguments

Model Configuration

Extraction Methods

Cache Control

Retrieval Configuration

Agent Configuration

Advanced Options

Dataset-Specific Examples

MuSiQue & 2WikiMultiHopQA

LoCoMo (Conversations)

LongMemEval

Complex TR (Temporal Reasoning)

Semantic QA

TimeQA

Understanding Output

Console Output

Output Files

Results Format

Analysis Tools

Aggregate Evaluation

Graph Visualization

Graph Information

Performance Tips

Debugging

Enable Verbose Logging

Test on Small Subset

Environment Variables

Troubleshooting

Build docs developers (and LLMs) love

Get Started

Core Concepts

Guides

Customization

Benchmarks

​Quick Start

​Basic Benchmark Execution

​Command-Line Arguments

​Required Arguments

​Model Configuration

​Extraction Methods

​Cache Control

​Retrieval Configuration

​Agent Configuration

​Advanced Options

​Dataset-Specific Examples

​MuSiQue & 2WikiMultiHopQA

​LoCoMo (Conversations)

​LongMemEval

​Complex TR (Temporal Reasoning)

​Semantic QA

​TimeQA

​Understanding Output

​Console Output

​Output Files

​Results Format

​Analysis Tools

​Aggregate Evaluation

​Graph Visualization

​Graph Information

​Performance Tips

​Debugging

​Enable Verbose Logging

​Test on Small Subset

​Environment Variables

​Troubleshooting

Build docs developers (and LLMs) love

Quick Start

Basic Benchmark Execution

Command-Line Arguments

Required Arguments

Model Configuration

Extraction Methods

Cache Control

Retrieval Configuration

Agent Configuration

Advanced Options

Dataset-Specific Examples

MuSiQue & 2WikiMultiHopQA

LoCoMo (Conversations)

LongMemEval

Complex TR (Temporal Reasoning)

Semantic QA

TimeQA

Understanding Output

Console Output

Output Files

Results Format

Analysis Tools

Aggregate Evaluation

Graph Visualization

Graph Information

Performance Tips

Debugging

Enable Verbose Logging

Test on Small Subset

Environment Variables

Troubleshooting