Overview
Arcana’s evaluation system includes:- Test Cases - Questions paired with their known relevant chunks
- Evaluation Runs - Execute searches and measure performance
- Metrics - Standard IR metrics (MRR, Precision, Recall, Hit Rate) + Faithfulness
Creating Test Cases
Manual Test Cases
Create test cases when you know which chunks should be retrieved for a question:Synthetic Test Cases
Generate test cases automatically using an LLM:- Programmatically
- Mix Task
Filtering by Collection
Running Evaluations
Basic Evaluation
Run an evaluation against all test cases:- Programmatically
- Mix Task
Evaluating Answer Quality
For end-to-end RAG evaluation, evaluate the quality of generated answers:evaluate_answers: true is set:
Faithfulness measures whether the generated answer is grounded in the retrieved chunks (0 = hallucinated, 10 = fully faithful).
Understanding Results
Comparing Configurations
Run evaluations with different settings to find the best configuration:Managing Test Cases and Runs
Dashboard
The Arcana Dashboard provides a visual interface for evaluation:- Test Cases tab - View, generate, and delete test cases
- Run Evaluation tab - Execute evaluations with different search modes
- History tab - View past runs with metrics
Metrics Explained
Retrieval Metrics
| Metric | Description | Good Value |
|---|---|---|
| MRR (Mean Reciprocal Rank) | Average of 1/rank for first relevant result | > 0.7 |
| Recall@K | Fraction of relevant chunks found in top K | > 0.8 |
| Precision@K | Fraction of top K results that are relevant | > 0.6 |
| Hit Rate@K | Fraction of queries with at least one relevant result in top K | > 0.9 |
Answer Quality Metrics
| Metric | Description | Good Value |
|---|---|---|
| Faithfulness | How well the answer is grounded in retrieved context (0-10) | > 7.0 |
Which Metric to Focus On?
MRR (Mean Reciprocal Rank)
MRR (Mean Reciprocal Rank)
Best for single-answer scenarios where you need the relevant chunk first.Example: Documentation search where the first result should answer the question.
Recall@K
Recall@K
Important when you need to find all relevant information.Example: Research applications where missing relevant chunks is costly.
Precision@K
Precision@K
Matters when you want to minimize irrelevant context.Example: Reducing LLM token costs by only including relevant chunks.
Hit Rate@K
Hit Rate@K
Good baseline to ensure retrieval is working at all.Example: Verifying that your system finds at least one relevant result.
Faithfulness
Faithfulness
Essential for preventing hallucinations in generated answers.Example: Customer support where accuracy is critical.
Best Practices
Diverse test cases
Cover different topics and question types for reliable evaluation
Sufficient sample size
Aim for 50+ test cases for statistically reliable metrics
Regular evaluation
Re-run after changing embeddings, chunking, or search settings
Track over time
Compare runs to ensure changes improve quality
Use collection filtering
Evaluate specific document collections separately
Test all search modes
Compare semantic, fulltext, and hybrid to find what works best
Next Steps
Search Algorithms
Understand how different search modes affect metrics
Re-ranking
Improve precision with second-stage scoring
Dashboard
View evaluation results in the web UI
Telemetry
Monitor evaluation performance