Overview
The Evaluations feature provides:- Automated testing of chatflows against datasets
- Multiple evaluation metrics (accuracy, latency, cost, pass rate)
- Version tracking for comparing different configurations
- Detailed result analysis and visualization
- Support for multiple chatflows in a single evaluation
- Historical tracking of evaluation runs
Creating an Evaluation
Evaluations run asynchronously in the background. You can continue using Flowise while evaluations are in progress.
Understanding Evaluation Results
Once an evaluation completes, you can view detailed results:Overview Metrics
The evaluation summary displays aggregate metrics:Individual Run Results
Click View Results to see detailed metrics for each test case:- Input query from dataset
- Generated response
- Expected output (if provided)
- Success/failure status
- Latency and token counts
- Cost breakdown
Version History
Evaluations support versioning to track improvements:Performance Charts
Visualize trends over time:- Pass Rate Chart: Track quality improvements
- Latency Chart: Monitor response time trends
- Token Usage Chart: Analyze token consumption
- Cost Chart: Track spending over versions
Evaluation Metrics Explained
Pass Rate
The percentage of test cases that meet success criteria:- 90%+: Excellent performance (green)
- 50-89%: Acceptable performance (orange)
- Below 50%: Needs improvement (red)
- Comparing generated output to expected output
- Custom validation rules
- Error-free execution
Latency
Measures the time taken for the chatflow to generate a response:- Using faster LLM models
- Reducing context window size
- Optimizing retrieval queries
- Caching frequent queries
Token Usage
Tracks input and output tokens for cost management:Cost Analysis
Calculates total cost based on token usage and model pricing:Running Evaluations via API
Automate evaluations through the REST API:Create Evaluation
Get Evaluation Results
Run Again
Re-run an existing evaluation:Delete Evaluation
See the complete Evaluations API Reference for all available endpoints.
Evaluation Status
Evaluations have three status states:- Pending (yellow): Evaluation is queued or in progress
- Completed (green): Evaluation finished successfully
- Error (red): Evaluation failed (hover for error details)
Auto-Refresh
Enable auto-refresh to monitor running evaluations in real-time:- Click the Play icon in the top-right corner
- The page will refresh every 5 seconds
- Click the Pause icon to disable auto-refresh
Managing Evaluations
Delete Individual Versions
Expand an evaluation to see all versions:- Click the expand arrow next to the version number
- Select one or more versions using checkboxes
- Click Delete to remove selected versions
Delete All Versions
To delete an entire evaluation with all versions:- Select the evaluation using the checkbox
- Click Delete in the top action bar
- Confirm deletion
Best Practices
Dataset Preparation
Create comprehensive test datasets:Regular Testing
Establish an evaluation cadence:- Run evaluations after every chatflow change
- Weekly regression testing
- Before deploying to production
- After dataset updates
Version Management
Use version numbers to track changes:- Document what changed between versions
- Compare metrics across versions
- Keep historical data for analysis
- Archive old versions when needed
Performance Benchmarks
Set target benchmarks for your use case:A/B Testing
Compare different chatflow configurations:- Create evaluation with multiple chatflows
- Use the same dataset for fair comparison
- Analyze metrics side-by-side
- Choose the best-performing configuration
Troubleshooting
Evaluation Stuck in Pending
- Check server logs for errors
- Verify dataset is accessible
- Ensure chatflow is not locked or being edited
- Restart the evaluation
Inconsistent Results
- Non-deterministic LLM outputs may vary
- Use temperature=0 for consistent results
- Increase sample size for better averages
- Check for dataset issues
High Failure Rate
- Review failed test cases in detail
- Check error messages in results
- Verify dataset quality and format
- Test chatflow manually with failing inputs
