Evaluation
Evaluation helps you measure and improve the quality of your AI applications. Genkit provides tools for scoring outputs, running test datasets, and tracking performance over time.Why Evaluate?
AI outputs can be unpredictable. Evaluation helps you:- Measure quality - Quantify how well your AI performs
- Catch regressions - Detect when changes make outputs worse
- Compare approaches - Test different models, prompts, or parameters
- Track improvements - Monitor quality over time
Built-in Evaluators
Genkit includes several built-in evaluators:DeepEqual
Checks if the output exactly matches an expected value:Regex
Matches output against a regular expression:JSONata
Queries structured output using JSONata:Custom Evaluators
Create custom evaluators for your specific needs:Batch Evaluators
Process multiple test cases efficiently:Using the Developer UI
The Genkit Developer UI provides visual evaluation tools:-
Run the Dev UI:
- Navigate to Evaluate tab
-
Create a test dataset:
- Run evaluation and view results with detailed traces
Programmatic Evaluation
Evaluate flows programmatically in your tests:Evaluation Metrics
Accuracy
Measure exact match rate:TypeScript
Semantic Similarity
Use embeddings to measure semantic similarity:Retrieval Metrics (RAG)
For RAG applications, measure retrieval quality:TypeScript
A/B Testing
Compare different approaches:Best Practices
Create Diverse Test Sets
Cover various scenarios:TypeScript
Track Metrics Over Time
Store evaluation results:TypeScript
Automate Evaluation in CI/CD
Run evaluations automatically:Use Human Evaluation
For subjective qualities, involve humans:TypeScript
Next Steps
- Learn about Flows for production deployment
- Explore Observability for monitoring
- Check out Developer Tools for testing