Skip to main content
Arcana provides tools to evaluate how well your RAG pipeline retrieves relevant information and generates faithful answers.

Overview

Arcana’s evaluation system includes:
  1. Test Cases - Questions paired with their known relevant chunks
  2. Evaluation Runs - Execute searches and measure performance
  3. Metrics - Standard IR metrics (MRR, Precision, Recall, Hit Rate) + Faithfulness

Creating Test Cases

Manual Test Cases

Create test cases when you know which chunks should be retrieved for a question:
# First, find the chunk you want to use as ground truth
chunks = Arcana.search("GenServer state", repo: MyApp.Repo, limit: 1)
chunk = hd(chunks)

# Create a test case linking question to relevant chunk
{:ok, test_case} = Arcana.Evaluation.create_test_case(
  repo: MyApp.Repo,
  question: "How do you manage state in Elixir?",
  relevant_chunk_ids: [chunk.id]
)

Synthetic Test Cases

Generate test cases automatically using an LLM:
{:ok, test_cases} = Arcana.Evaluation.generate_test_cases(
  repo: MyApp.Repo,
  llm: Application.get_env(:arcana, :llm),
  sample_size: 50
)
The generator samples random chunks and asks the LLM to create questions that should retrieve those chunks.

Filtering by Collection

{:ok, test_cases} = Arcana.Evaluation.generate_test_cases(
  repo: MyApp.Repo,
  llm: Application.get_env(:arcana, :llm),
  sample_size: 50,
  collection: "elixir-docs"
)

Running Evaluations

Basic Evaluation

Run an evaluation against all test cases:
{:ok, run} = Arcana.Evaluation.run(
  repo: MyApp.Repo,
  mode: :semantic  # or :fulltext, :hybrid
)

Evaluating Answer Quality

For end-to-end RAG evaluation, evaluate the quality of generated answers:
{:ok, run} = Arcana.Evaluation.run(
  repo: MyApp.Repo,
  mode: :semantic,
  evaluate_answers: true,
  llm: Application.get_env(:arcana, :llm)
)

# Includes faithfulness metric
run.metrics.faithfulness  # => 7.8 (0-10 scale)
When evaluate_answers: true is set:
1

Generate Answer

Generates an answer for each test case using the retrieved chunks
2

Score Faithfulness

Uses LLM-as-judge to score how faithful the answer is to the context
3

Aggregate

Aggregates scores into an overall faithfulness metric
Faithfulness measures whether the generated answer is grounded in the retrieved chunks (0 = hallucinated, 10 = fully faithful).

Understanding Results

# Overall metrics
run.metrics
# => %{
#   recall_at_1: 0.62,
#   recall_at_3: 0.78,
#   recall_at_5: 0.84,
#   recall_at_10: 0.91,
#   precision_at_1: 0.62,
#   precision_at_3: 0.52,
#   precision_at_5: 0.34,
#   precision_at_10: 0.18,
#   mrr: 0.76,
#   hit_rate_at_1: 0.62,
#   hit_rate_at_3: 0.78,
#   hit_rate_at_5: 0.84,
#   hit_rate_at_10: 0.91,
#   faithfulness: 7.8  # If evaluate_answers: true
# }

# Per-case results
run.results
# => %{"case-id" => %{hit: true, rank: 2, ...}, ...}

# Configuration used
run.config
# => %{mode: :semantic, embedding: %{model: "...", dimensions: 384}}

Comparing Configurations

Run evaluations with different settings to find the best configuration:
# Test semantic search
{:ok, semantic_run} = Arcana.Evaluation.run(repo: MyApp.Repo, mode: :semantic)

# Test hybrid search
{:ok, hybrid_run} = Arcana.Evaluation.run(repo: MyApp.Repo, mode: :hybrid)

# Compare
IO.puts("Semantic MRR: #{semantic_run.metrics.mrr}")
IO.puts("Hybrid MRR: #{hybrid_run.metrics.mrr}")

Managing Test Cases and Runs

test_cases = Arcana.Evaluation.list_test_cases(repo: MyApp.Repo)

Dashboard

The Arcana Dashboard provides a visual interface for evaluation:
  • Test Cases tab - View, generate, and delete test cases
  • Run Evaluation tab - Execute evaluations with different search modes
  • History tab - View past runs with metrics
See the Dashboard Guide for setup instructions.

Metrics Explained

Retrieval Metrics

MetricDescriptionGood Value
MRR (Mean Reciprocal Rank)Average of 1/rank for first relevant result> 0.7
Recall@KFraction of relevant chunks found in top K> 0.8
Precision@KFraction of top K results that are relevant> 0.6
Hit Rate@KFraction of queries with at least one relevant result in top K> 0.9

Answer Quality Metrics

MetricDescriptionGood Value
FaithfulnessHow well the answer is grounded in retrieved context (0-10)> 7.0

Which Metric to Focus On?

Best for single-answer scenarios where you need the relevant chunk first.Example: Documentation search where the first result should answer the question.
Important when you need to find all relevant information.Example: Research applications where missing relevant chunks is costly.
Matters when you want to minimize irrelevant context.Example: Reducing LLM token costs by only including relevant chunks.
Good baseline to ensure retrieval is working at all.Example: Verifying that your system finds at least one relevant result.
Essential for preventing hallucinations in generated answers.Example: Customer support where accuracy is critical.

Best Practices

Diverse test cases

Cover different topics and question types for reliable evaluation

Sufficient sample size

Aim for 50+ test cases for statistically reliable metrics

Regular evaluation

Re-run after changing embeddings, chunking, or search settings

Track over time

Compare runs to ensure changes improve quality

Use collection filtering

Evaluate specific document collections separately

Test all search modes

Compare semantic, fulltext, and hybrid to find what works best

Next Steps

Search Algorithms

Understand how different search modes affect metrics

Re-ranking

Improve precision with second-stage scoring

Dashboard

View evaluation results in the web UI

Telemetry

Monitor evaluation performance

Build docs developers (and LLMs) love