Skip to main content

Arcana.Evaluation

The Arcana.Evaluation module provides tools for measuring and improving retrieval quality using standard information retrieval metrics.

Overview

Evaluation helps you:
  • Generate synthetic test cases from your document chunks
  • Measure retrieval performance (MRR, Recall, Precision, NDCG)
  • Compare different search modes and configurations
  • Track quality improvements over time

Main Functions

generate_test_cases/1

Generates synthetic test cases from existing chunks using an LLM.
{:ok, test_cases} = Arcana.Evaluation.generate_test_cases(
  repo: MyApp.Repo,
  llm: "openai:gpt-4o-mini",
  sample_size: 50
)
repo
module
required
Your Ecto repo module
llm
string | function | module
required
LLM for generating questions. Can be a model string, function, or module implementing Arcana.LLM
sample_size
integer
default:"50"
Number of chunks to sample for test case generation
source_id
string
Limit to chunks from a specific source
collection
string
Limit to chunks from a specific collection
prompt
function
Custom prompt template function fn chunk_text -> prompt end
Returns: {:ok, [%TestCase{}]} or {:error, reason}

run/1

Runs evaluation against existing test cases and returns metrics.
{:ok, run} = Arcana.Evaluation.run(
  repo: MyApp.Repo,
  mode: :semantic
)

run.metrics
# => %{
#   recall_at_5: 0.84,
#   precision_at_5: 0.68,
#   mrr: 0.76,
#   ndcg_at_5: 0.81
# }
repo
module
required
Your Ecto repo module
mode
atom
default:":semantic"
Search mode: :semantic, :fulltext, or :hybrid
source_id
string
Limit evaluation to specific source
collection
string
Limit evaluation to specific collection
evaluate_answers
boolean
default:"false"
Also evaluate answer quality (requires LLM)
llm
string | function | module
LLM for answer evaluation (required when evaluate_answers: true)
k
integer
default:"5"
Number of results to evaluate (for recall@k, precision@k, NDCG@k)
Returns: {:ok, %Run{}} or {:error, reason} The returned Run struct contains:
  • metrics: Map of metric name to score
  • test_case_count: Number of test cases evaluated
  • mode: Search mode used
  • inserted_at: Timestamp

Test Case Management

list_test_cases/1

Lists all test cases.
{:ok, test_cases} = Arcana.Evaluation.list_test_cases(repo: MyApp.Repo)
repo
module
required
Your Ecto repo module
source_id
string
Filter by source ID
collection
string
Filter by collection

get_test_case/2

Retrieves a specific test case by ID.
{:ok, test_case} = Arcana.Evaluation.get_test_case(id, repo: MyApp.Repo)

create_test_case/1

Manually creates a test case.
{:ok, test_case} = Arcana.Evaluation.create_test_case(
  repo: MyApp.Repo,
  question: "What is Elixir?",
  expected_chunk_id: chunk_id,
  source_id: "docs"
)
repo
module
required
Your Ecto repo module
question
string
required
The test question
expected_chunk_id
string
required
UUID of the chunk that should be retrieved
source_id
string
Source identifier
collection
string
Collection name

delete_test_case/2

Deletes a test case.
:ok = Arcana.Evaluation.delete_test_case(id, repo: MyApp.Repo)

count_test_cases/1

Returns the total number of test cases.
{:ok, count} = Arcana.Evaluation.count_test_cases(repo: MyApp.Repo)

Run Management

list_runs/1

Lists evaluation runs.
{:ok, runs} = Arcana.Evaluation.list_runs(repo: MyApp.Repo)
repo
module
required
Your Ecto repo module
limit
integer
default:"50"
Maximum number of runs to return

get_run/2

Retrieves a specific evaluation run.
{:ok, run} = Arcana.Evaluation.get_run(id, repo: MyApp.Repo)

delete_run/2

Deletes an evaluation run.
:ok = Arcana.Evaluation.delete_run(id, repo: MyApp.Repo)

Metrics

Arcana.Evaluation provides standard information retrieval metrics:

Recall@k

Percentage of relevant documents retrieved in top k results.

Precision@k

Percentage of retrieved documents that are relevant in top k results.

MRR (Mean Reciprocal Rank)

Average of reciprocal ranks of the first relevant document.

NDCG@k (Normalized Discounted Cumulative Gain)

Measures ranking quality, giving more weight to relevant documents at higher positions.

Complete Example

defmodule MyApp.Evaluation do
  alias Arcana.Evaluation

  def evaluate_retrieval do
    repo = MyApp.Repo
    llm = "openai:gpt-4o-mini"

    # 1. Generate test cases
    IO.puts("Generating test cases...")
    {:ok, test_cases} = Evaluation.generate_test_cases(
      repo: repo,
      llm: llm,
      sample_size: 100,
      collection: "docs"
    )
    IO.puts("Generated #{length(test_cases)} test cases")

    # 2. Evaluate semantic search
    IO.puts("\nEvaluating semantic search...")
    {:ok, semantic_run} = Evaluation.run(
      repo: repo,
      mode: :semantic,
      collection: "docs"
    )
    print_metrics("Semantic", semantic_run.metrics)

    # 3. Evaluate hybrid search
    IO.puts("\nEvaluating hybrid search...")
    {:ok, hybrid_run} = Evaluation.run(
      repo: repo,
      mode: :hybrid,
      collection: "docs"
    )
    print_metrics("Hybrid", hybrid_run.metrics)

    # 4. Compare results
    compare_runs(semantic_run, hybrid_run)
  end

  defp print_metrics(label, metrics) do
    IO.puts("#{label} Search Metrics:")
    IO.puts("  Recall@5: #{Float.round(metrics.recall_at_5, 3)}")
    IO.puts("  Precision@5: #{Float.round(metrics.precision_at_5, 3)}")
    IO.puts("  MRR: #{Float.round(metrics.mrr, 3)}")
    IO.puts("  NDCG@5: #{Float.round(metrics.ndcg_at_5, 3)}")
  end

  defp compare_runs(run1, run2) do
    IO.puts("\nComparison:")
    improvement = (run2.metrics.mrr - run1.metrics.mrr) / run1.metrics.mrr * 100
    IO.puts("  MRR improvement: #{Float.round(improvement, 1)}%")
  end
end

Best Practices

  • Sample from diverse documents and topics
  • Include both easy and hard questions
  • Aim for 50-200 test cases for reliable metrics
  • Regenerate periodically as your content evolves
  • Evaluate after configuration changes
  • Compare different search modes
  • Track metrics over time
  • Test with different chunk sizes and overlap
  • High recall, low precision: Too many irrelevant results
  • Low recall, high precision: Missing relevant results
  • High MRR: Relevant results ranked highly
  • Use NDCG@k for ranking quality assessment
  • Recall@5 < 0.7: Increase search limit or adjust chunking
  • Precision@5 < 0.6: Add re-ranking or adjust thresholds
  • MRR < 0.6: Review embedding model or query rewriting
  • Low NDCG@5: Improve ranking (hybrid search, re-ranking)

Build docs developers (and LLMs) love