Skip to main content

Quick Start

Basic Benchmark Execution

Run a benchmark on MuSiQue dataset:
python main.py \
    --dataset musique \
    --llm_name gpt-4o-mini \
    --embedding_name nvidia/NV-Embed-v2
First Run: Initial execution will build the index and extract information (slower). Subsequent runs reuse cached data.

Command-Line Arguments

Required Arguments

--dataset
string
required
Dataset name (e.g., musique, 2wikimultihopqa, locomo_episodic)

Model Configuration

--llm_name
string
default:"gpt-4o-mini"
LLM model for QA and extraction. Supported models:
  • gpt-4o-mini (recommended)
  • gpt-4o
  • gpt-4-turbo
  • meta-llama/Llama-3.3-70B-Instruct
  • Any OpenAI-compatible model
--embedding_name
string
default:"nvidia/NV-Embed-v2"
Embedding model for dense retrieval:
  • nvidia/NV-Embed-v2 (recommended, requires GPU)
  • openai/text-embedding-3-small
  • openai/text-embedding-3-large
--llm_base_url
string
default:"https://api.openai.com/v1"
Custom API endpoint for LLM (for local models or alternative providers)

Extraction Methods

--extract_method
string
default:"opinie"
Information extraction strategy:
  • openie - Open IE for document datasets
  • episodic - Episode-based for conversations
  • episodic_gist - Episodic with summarization (recommended for conversations)
  • temporal - Temporal-aware extraction

Cache Control

--force_index_from_scratch
boolean
default:"false"
Short flag: -fi. Rebuild the entire index from scratch, ignoring cached embeddings.
--force_openie_from_scratch
boolean
default:"false"
Short flag: -fo. Rerun information extraction, ignoring cached results.
--force_rag
boolean
default:"false"
Short flag: -fr. Rerun QA evaluation even if results already exist.

Retrieval Configuration

--qa_top_k
integer
default:"5-10"
Number of top passages to use for answering questions.
--linking_top_k
integer
default:"5"
Number of linked passages to traverse in the knowledge graph.
--retrieval_top_k
integer
default:"200"
Initial retrieval pool size before reranking.

Agent Configuration

--agent_max_steps
integer
default:"3"
Maximum reasoning steps for agent:
  • 1 - Retrieve only
  • 2 - Retrieve + answer
  • >2 - Full multi-step reasoning
--agent_fixed_tools
boolean
default:"false"
Use simplified agent with only semantic_retrieve + output_answer tools.
--agent_fixed_retrieval_tool
string
default:"semantic_retrieve"
Retrieval tool for fixed mode: semantic_retrieve or lexical_retrieve.

Advanced Options

--llm_infer_mode
string
default:"online"
LLM inference mode:
  • online - API-based inference
  • offline - VLLM offline batch mode (for local models)
--use_azure
boolean
default:"false"
Use Azure OpenAI Service (requires Azure credentials)
--parallel
boolean
default:"false"
Enable parallel sample processing (for supported benchmarks)
--num_workers
integer
default:"5"
Number of parallel workers when --parallel is enabled
--verbose
boolean
default:"false"
Enable detailed logging output

Dataset-Specific Examples

MuSiQue & 2WikiMultiHopQA

python main.py \
    --dataset musique \
    --llm_name gpt-4o-mini \
    --embedding_name nvidia/NV-Embed-v2 \
    --extract_method openie

LoCoMo (Conversations)

python examples/locomo.py \
    --dataset locomo_episodic \
    --llm_name gpt-4o-mini \
    --embedding_name nvidia/NV-Embed-v2 \
    --extract_method episodic_gist \
    --qa_top_k 10

LongMemEval

python examples/longmemeval.py \
    --dataset longmemeval_s.json \
    --llm_name gpt-4o-mini \
    --embedding_name nvidia/NV-Embed-v2 \
    --extract_method episodic_gist

Complex TR (Temporal Reasoning)

python examples/complex_tr.py \
    --llm_name gpt-4o-mini \
    --embedding_name nvidia/NV-Embed-v2 \
    --extract_method episodic_gist \
    --agent_max_steps 5

Semantic QA

python examples/semantic_qa.py \
    --dataset musique \
    --llm_name gpt-4o-mini \
    --embedding_name nvidia/NV-Embed-v2

TimeQA

python examples/timeqa.py \
    --llm_name gpt-4o-mini \
    --dataset_file reproduce/dataset/timeqa/dev.easy.json \
    --embedding_name nvidia/NV-Embed-v2 \
    --extract_method temporal \
    --qa_top_k 5 \
    --max_samples 100  # Optional: limit samples

Understanding Output

Console Output

During execution, you’ll see:
Processing sample 0
# chunk contents: 142
Current metrics after 1 samples:
  retrieval_recall: 0.8750
  qa_em: 1.0000
  qa_f1: 1.0000
  qa_bleu1: 1.0000

Output Files

Results are saved to outputs/{dataset}/:
outputs/musique/musique_gpt-4o-mini_nvidia_NV-Embed-v2/
├── rag_results_agent_max_step_3.json    # QA results
├── overall_results_*.json               # Aggregated metrics
├── retrieval_results.json               # Retrieved passages
├── vdb_*.pkl                            # Cached embeddings
└── graph.pkl                            # Knowledge graph

Results Format

{
  "num_samples": 100,
  "average_metrics": {
    "qa_em": 0.7234,
    "qa_f1": 0.8123,
    "retrieval_recall": 0.8567
  },
  "configuration": {
    "llm_name": "gpt-4o-mini",
    "embedding_name": "nvidia/NV-Embed-v2",
    "dataset": "musique"
  },
  "samples": [...]
}

Analysis Tools

Aggregate Evaluation

After running benchmarks, aggregate results:
python examples/locomo_overall_eval.py \
    --working_dir outputs/locomo/

Graph Visualization

Visualize the knowledge graph structure:
python examples/igraph_graph_visualization.py \
    --graph_path outputs/musique/musique_gpt-4o-mini_nvidia_NV-Embed-v2/graph.pkl

Graph Information

Get statistics about the constructed graph:
python examples/igraph_graph_info.py \
    --graph_path outputs/musique/musique_gpt-4o-mini_nvidia_NV-Embed-v2/graph.pkl

Performance Tips

Use Caching: Don’t use -fi or -fo flags unless necessary. Reusing cached data significantly speeds up experiments.
Parallel Processing: For datasets supporting parallel processing (Complex TR, Semantic QA), use --parallel --num_workers 5 to speed up evaluation.
GPU Acceleration: Use NVIDIA NV-Embed-v2 with GPU for faster embedding computation:
CUDA_VISIBLE_DEVICES=0 python examples/locomo.py ...
API Costs: Running benchmarks with OpenAI models can incur costs. Use caching and start with small subsets to estimate costs.

Debugging

Enable Verbose Logging

python examples/locomo.py \
    --dataset locomo_episodic \
    --llm_name gpt-4o-mini \
    --verbose

Test on Small Subset

# LoCoMo: Use locomo10 (10 samples)
python examples/locomo.py --dataset locomo10 --llm_name gpt-4o-mini

# Complex TR: Use start/end indices
python examples/complex_tr.py --start 0 --end 5 --llm_name gpt-4o-mini

# TimeQA: Use max_samples
python examples/timeqa.py --max_samples 5 --llm_name gpt-4o-mini

Environment Variables

# OpenAI API Key
export OPENAI_API_KEY="sk-..."

# Azure OpenAI (if using --use_azure)
export AZURE_OPENAI_ENDPOINT="https://..."
export AZURE_OPENAI_API_KEY="..."

# GPU Configuration
export CUDA_VISIBLE_DEVICES=0
export CUDA_DEVICE_ORDER="PCI_BUS_ID"

# Reduce tokenizer warnings
export TOKENIZERS_PARALLELISM=false

Troubleshooting

  • Reduce --batch_size for embedding model
  • Reduce --qa_top_k to limit context size
  • Use smaller embedding model
  • Process dataset in batches using --start and --end
  • Reduce --num_workers for parallel processing
  • Add delays between requests (modify source code)
  • Use offline mode with local models
This is expected - first run builds index and extracts information:
  • Indexing large corpus can take 30-60 minutes
  • OpenIE extraction adds additional time
  • Subsequent runs reuse cached data and are much faster
Ensure dataset files exist in reproduce/dataset/:
ls reproduce/dataset/musique/
# Should show: musique.json, musique_corpus.json

Build docs developers (and LLMs) love