Arcana.Evaluation
TheArcana.Evaluation module provides tools for measuring and improving retrieval quality using standard information retrieval metrics.
Overview
Evaluation helps you:- Generate synthetic test cases from your document chunks
- Measure retrieval performance (MRR, Recall, Precision, NDCG)
- Compare different search modes and configurations
- Track quality improvements over time
Main Functions
generate_test_cases/1
Generates synthetic test cases from existing chunks using an LLM.Your Ecto repo module
LLM for generating questions. Can be a model string, function, or module implementing
Arcana.LLMNumber of chunks to sample for test case generation
Limit to chunks from a specific source
Limit to chunks from a specific collection
Custom prompt template function
fn chunk_text -> prompt end{:ok, [%TestCase{}]} or {:error, reason}
run/1
Runs evaluation against existing test cases and returns metrics.Your Ecto repo module
Search mode:
:semantic, :fulltext, or :hybridLimit evaluation to specific source
Limit evaluation to specific collection
Also evaluate answer quality (requires LLM)
LLM for answer evaluation (required when
evaluate_answers: true)Number of results to evaluate (for recall@k, precision@k, NDCG@k)
{:ok, %Run{}} or {:error, reason}
The returned Run struct contains:
metrics: Map of metric name to scoretest_case_count: Number of test cases evaluatedmode: Search mode usedinserted_at: Timestamp
Test Case Management
list_test_cases/1
Lists all test cases.Your Ecto repo module
Filter by source ID
Filter by collection
get_test_case/2
Retrieves a specific test case by ID.create_test_case/1
Manually creates a test case.Your Ecto repo module
The test question
UUID of the chunk that should be retrieved
Source identifier
Collection name
delete_test_case/2
Deletes a test case.count_test_cases/1
Returns the total number of test cases.Run Management
list_runs/1
Lists evaluation runs.Your Ecto repo module
Maximum number of runs to return
get_run/2
Retrieves a specific evaluation run.delete_run/2
Deletes an evaluation run.Metrics
Arcana.Evaluation provides standard information retrieval metrics:Recall@k
Percentage of relevant documents retrieved in top k results.Precision@k
Percentage of retrieved documents that are relevant in top k results.MRR (Mean Reciprocal Rank)
Average of reciprocal ranks of the first relevant document.NDCG@k (Normalized Discounted Cumulative Gain)
Measures ranking quality, giving more weight to relevant documents at higher positions.Complete Example
Best Practices
Generate Representative Test Cases
Generate Representative Test Cases
- Sample from diverse documents and topics
- Include both easy and hard questions
- Aim for 50-200 test cases for reliable metrics
- Regenerate periodically as your content evolves
Run Regular Evaluations
Run Regular Evaluations
- Evaluate after configuration changes
- Compare different search modes
- Track metrics over time
- Test with different chunk sizes and overlap
Interpret Metrics Together
Interpret Metrics Together
- High recall, low precision: Too many irrelevant results
- Low recall, high precision: Missing relevant results
- High MRR: Relevant results ranked highly
- Use NDCG@k for ranking quality assessment
Optimize Based on Metrics
Optimize Based on Metrics
- Recall@5 < 0.7: Increase search limit or adjust chunking
- Precision@5 < 0.6: Add re-ranking or adjust thresholds
- MRR < 0.6: Review embedding model or query rewriting
- Low NDCG@5: Improve ranking (hybrid search, re-ranking)
Related
- Evaluation Guide - Comprehensive evaluation guide
- Search Algorithms - Understanding search modes
- Re-ranking - Improving result quality
- Telemetry - Monitoring search performance