Quick Start
Basic Benchmark Execution
Run a benchmark on MuSiQue dataset:First Run: Initial execution will build the index and extract information (slower). Subsequent runs reuse cached data.
Command-Line Arguments
Required Arguments
Dataset name (e.g.,
musique, 2wikimultihopqa, locomo_episodic)Model Configuration
LLM model for QA and extraction. Supported models:
gpt-4o-mini(recommended)gpt-4ogpt-4-turbometa-llama/Llama-3.3-70B-Instruct- Any OpenAI-compatible model
Embedding model for dense retrieval:
nvidia/NV-Embed-v2(recommended, requires GPU)openai/text-embedding-3-smallopenai/text-embedding-3-large
Custom API endpoint for LLM (for local models or alternative providers)
Extraction Methods
Information extraction strategy:
openie- Open IE for document datasetsepisodic- Episode-based for conversationsepisodic_gist- Episodic with summarization (recommended for conversations)temporal- Temporal-aware extraction
Cache Control
Short flag:
-fi. Rebuild the entire index from scratch, ignoring cached embeddings.Short flag:
-fo. Rerun information extraction, ignoring cached results.Short flag:
-fr. Rerun QA evaluation even if results already exist.Retrieval Configuration
Number of top passages to use for answering questions.
Number of linked passages to traverse in the knowledge graph.
Initial retrieval pool size before reranking.
Agent Configuration
Maximum reasoning steps for agent:
1- Retrieve only2- Retrieve + answer>2- Full multi-step reasoning
Use simplified agent with only semantic_retrieve + output_answer tools.
Retrieval tool for fixed mode:
semantic_retrieve or lexical_retrieve.Advanced Options
LLM inference mode:
online- API-based inferenceoffline- VLLM offline batch mode (for local models)
Use Azure OpenAI Service (requires Azure credentials)
Enable parallel sample processing (for supported benchmarks)
Number of parallel workers when
--parallel is enabledEnable detailed logging output
Dataset-Specific Examples
MuSiQue & 2WikiMultiHopQA
LoCoMo (Conversations)
LongMemEval
Complex TR (Temporal Reasoning)
Semantic QA
TimeQA
Understanding Output
Console Output
During execution, you’ll see:Output Files
Results are saved tooutputs/{dataset}/:
Results Format
Analysis Tools
Aggregate Evaluation
After running benchmarks, aggregate results:Graph Visualization
Visualize the knowledge graph structure:Graph Information
Get statistics about the constructed graph:Performance Tips
Debugging
Enable Verbose Logging
Test on Small Subset
Environment Variables
Troubleshooting
Out of Memory Errors
Out of Memory Errors
- Reduce
--batch_sizefor embedding model - Reduce
--qa_top_kto limit context size - Use smaller embedding model
- Process dataset in batches using
--startand--end
API Rate Limits
API Rate Limits
- Reduce
--num_workersfor parallel processing - Add delays between requests (modify source code)
- Use offline mode with local models
Slow First Run
Slow First Run
This is expected - first run builds index and extracts information:
- Indexing large corpus can take 30-60 minutes
- OpenIE extraction adds additional time
- Subsequent runs reuse cached data and are much faster
Missing Dataset Files
Missing Dataset Files
Ensure dataset files exist in
reproduce/dataset/: