Evaluation & Testing Overview
Evaluation is critical for building reliable generative AI applications. This guide introduces evaluation concepts, metrics, and best practices for testing AI systems on Google Cloud.Why Evaluation Matters
Evaluation helps you:- Measure quality: Quantify how well your AI performs against requirements
- Compare models: Make data-driven decisions when selecting or migrating models
- Track improvements: Monitor performance as you iterate on prompts and configurations
- Ensure reliability: Detect issues like hallucinations, toxicity, and bias
- Build trust: Provide evidence that your system works as intended
Evaluation Approaches
Model-Based Evaluation
Use AI models to evaluate AI outputs. Model-based metrics assess qualities like coherence, fluency, and safety.Model-based metrics use language models as judges to evaluate responses according to predefined rubrics.
- Coherence: Logical flow and consistency of responses
- Fluency: Natural language quality and readability
- Safety: Detection of harmful or toxic content
- Groundedness: Alignment with provided context
- Instruction following: Adherence to prompt requirements
Reference-Based Evaluation
Compare model outputs against golden reference answers using statistical metrics. Common reference-based metrics:- ROUGE: Measures overlap with reference text (recall-focused)
- BLEU: Measures n-gram precision against references
- Exact Match: Binary score for perfect matches
Computation-Based Evaluation
Apply deterministic algorithms to evaluate specific properties. Examples:- Tool call validation: Verify function calling correctness
- Schema compliance: Check JSON structure validity
- Length constraints: Measure response verbosity
Core Evaluation Concepts
Datasets
Evaluation requires well-structured datasets:For best results, use at least 100 examples. More examples provide statistically significant metrics.
Metrics
Metrics quantify different aspects of quality:Model-Based
Assess subjective qualities like helpfulness and relevance using AI judges
Reference-Based
Compare outputs to ground truth using statistical measures
Custom Metrics
Define domain-specific evaluation criteria for your use case
Computation-Based
Apply deterministic rules and algorithms
Experiments
Organize evaluations into experiments to track and compare results:Evaluation Frameworks
EvalTask API
The traditional approach for model evaluation:- Model comparison and selection
- Prompt engineering evaluation
- RAG system assessment
Gen AI Evaluation Service SDK
The modern SDK for comprehensive evaluation:- Agent evaluation with traces
- Persistent evaluation runs
- Advanced visualizations
Best Practices
Start with clear objectives
Define what “good” means for your specific use case before selecting metrics.
Use multiple metrics
No single metric captures all aspects of quality. Combine metrics for comprehensive assessment.
Build quality datasets
Invest in diverse, representative evaluation datasets with at least 100 examples.
Evaluation Workflow
- Define metrics: Select appropriate metrics for your use case
- Prepare dataset: Create evaluation examples with prompts and references
- Run evaluation: Execute evaluation using chosen framework
- Analyze results: Review metrics and identify areas for improvement
- Iterate: Refine prompts, models, or configurations based on insights
Common Evaluation Scenarios
Text Generation
- Metrics: coherence, fluency, text_quality
- Focus: Natural language quality and readability
Question Answering
- Metrics: question_answering_quality, groundedness, relevance
- Focus: Accuracy and context alignment
Summarization
- Metrics: summarization_quality, rouge, verbosity
- Focus: Information coverage and conciseness
RAG Applications
- Metrics: groundedness, relevance, hallucination
- Focus: Context utilization and factuality
Agent Systems
- Metrics: tool_use_quality, final_response_quality, hallucination
- Focus: Tool calling accuracy and response quality
Next Steps
Gen AI Eval SDK
Learn about the modern evaluation SDK with predefined metrics
Agent Evaluation
Evaluate agentic systems with tool use and traces
Model Migration
Compare models to make informed migration decisions
View Evaluation Results
Visualize and analyze evaluation reports