Why Evaluate?
Evaluations help you:- Measure Quality: Quantify agent accuracy and reliability
- Catch Regressions: Detect when changes degrade performance
- Compare Approaches: Test different models, prompts, or configurations
- Track Improvements: Monitor performance over time
- Build Confidence: Ensure production readiness
Types of Evaluations
Agno provides several evaluation frameworks:Accuracy
Measure how well agent output matches expected answers
Agent-as-Judge
Use an LLM to evaluate response quality and correctness
Performance
Track speed, token usage, and resource consumption
Reliability
Measure consistency and error rates across runs
Quick Start: Accuracy Evaluation
Test if your agent produces correct answers:Accuracy Evaluation
Basic Usage
Custom Evaluator
Customize the judging criteria:Dynamic Test Cases
Use callables for dynamic inputs:Batch Testing
Test multiple cases:Agent-as-Judge Evaluation
Use an LLM to evaluate response quality:Performance Evaluation
Measure speed and resource usage:Reliability Evaluation
Test consistency and error handling:Storing Results
Persist evaluation results in database:Comparing Configurations
Compare different agent setups:Regression Testing
Detect performance degradation:Async Evaluation
Run evaluations asynchronously:CI/CD Integration
Integrate evals into your pipeline:Best Practices
Representative Tests
Use test cases that reflect real usage patterns
Multiple Iterations
Run multiple times to account for non-determinism
Track Over Time
Store results in database to monitor trends
Automate Testing
Integrate evals into CI/CD pipeline
Evaluation Schema
Accuracy evaluation results:Next Steps
Tracing
Monitor agent execution in detail
Learning
Use eval results to improve agents
Guardrails
Test safety and validation mechanisms
Reasoning
Evaluate reasoning quality