What is LLM Evaluation?
LLM-as-a-judge evaluation uses language models to assess the quality of LLM outputs based on specific criteria. This approach scales better than human evaluation while maintaining high correlation with human judgments. Phoenix evaluations produce structured results with three components:- Score: Numeric value (typically 0-1 or boolean) indicating quality
- Label: Categorical classification (e.g., “relevant”, “irrelevant”)
- Explanation: Chain-of-thought reasoning explaining the judgment
Pre-Built Evaluators
Phoenix includes production-ready evaluators for common LLM evaluation tasks (fromsrc/phoenix/experiments/evaluators/):
LLM Evaluators
- Relevance
- Helpfulness
- Coherence
- Conciseness
Evaluates whether the output is relevant to the input question.
Code Evaluators
Code-based evaluators provide deterministic validation without LLM calls:- Contains Keyword
- Regex Match
- JSON Parsable
- Contains All Keywords
Check if output contains specific keywords.
Custom Evaluators
Custom Criteria Evaluator
Create evaluators for domain-specific criteria usingLLMCriteriaEvaluator (from src/phoenix/experiments/evaluators/llm_evaluators.py):
Custom Function Evaluator
Wrap any Python function as an evaluator usingcreate_evaluator (from src/phoenix/experiments/evaluators/utils.py):
Async Evaluators
Support for async evaluation functions:Evaluation Results
Evaluators returnEvaluationResult objects (from src/phoenix/experiments/types.py) with optional fields:
Evaluation Output Formats
Evaluators can return multiple formats that are automatically converted toEvaluationResult:
Client-Side vs Server-Side Evaluation
Client-Side Evaluation
Run evaluations during experiments or development:Server-Side Evaluation
Evaluate production traces automatically using the Phoenix UI:Configure evaluator
Click “Add Evaluator” and select from pre-built evaluators or define custom criteria.
- Continuous monitoring of production traces
- Batch evaluation of historical data
- Team collaboration without sharing code
Evaluation on Spans vs Experiments
Span Evaluations
Evaluate individual spans from traces usingSpanEvaluations (from src/phoenix/trace/span_evaluations.py):
Experiment Evaluations
Evaluate experiment runs systematically (see Experiments):Advanced Features
Rate Limiting
Handle API rate limits gracefully:Concurrency Control
Control parallel evaluation execution:Evaluation Datasets
Save and load evaluation results:Next Steps
Experiments
Run systematic experiments with evaluators
Datasets
Create evaluation datasets from traces
Pre-Built Evals
Complete reference for all evaluators
Custom Evaluators
Build advanced custom evaluation logic