Skip to main content
Evaluations in Flowise allow you to systematically test your chatflows, measure their performance, and track improvements over time. Run evaluations against datasets to ensure your AI agents meet quality standards.

Overview

The Evaluations feature provides:
  • Automated testing of chatflows against datasets
  • Multiple evaluation metrics (accuracy, latency, cost, pass rate)
  • Version tracking for comparing different configurations
  • Detailed result analysis and visualization
  • Support for multiple chatflows in a single evaluation
  • Historical tracking of evaluation runs

Creating an Evaluation

2
From the main navigation menu, click on Evaluations.
3
Start New Evaluation
4
Click the New Evaluation button in the top-right corner.
5
Configure Evaluation
6
Provide the following information:
7
Basic Information
8
  • Evaluation Name: A descriptive name for tracking purposes
  • Description: Optional details about what you’re testing
  • 9
    Select Dataset
    10
    Choose a dataset containing test cases:
    11
  • Dataset must have input-output pairs
  • Supports multiple formats (CSV, JSON)
  • Can include expected outputs for comparison
  • 12
    Select Chatflow(s)
    13
    Choose one or more chatflows to evaluate:
    14
  • Test multiple chatflows simultaneously
  • Compare performance across different configurations
  • Support for both Chatflow and AgentflowV2 types
  • 15
    Evaluation Metrics
    16
    Select which metrics to track:
    17
  • Accuracy: Compare outputs against expected results
  • Latency: Measure response time
  • Token Usage: Track input/output tokens
  • Cost: Calculate total cost based on token usage
  • Pass Rate: Percentage of successful evaluations
  • 18
    Run Evaluation
    19
    Click Start New Evaluation to begin the evaluation process.
    Evaluations run asynchronously in the background. You can continue using Flowise while evaluations are in progress.

    Understanding Evaluation Results

    Once an evaluation completes, you can view detailed results:

    Overview Metrics

    The evaluation summary displays aggregate metrics:
    Evaluation: Customer Support Test v1
    Status: Completed
    Latest Version: 3
    Last Run: Mar 4, 2024, 02:30:15 PM
    
    Average Metrics:
    - Total Runs: 50
    - Pass Rate: 86%
    - Avg Latency: 2,341ms
    - Avg Cost: $0.0423
    

    Individual Run Results

    Click View Results to see detailed metrics for each test case:
    • Input query from dataset
    • Generated response
    • Expected output (if provided)
    • Success/failure status
    • Latency and token counts
    • Cost breakdown

    Version History

    Evaluations support versioning to track improvements:
    Version 3 (Latest)
    ├─ Pass Rate: 86%
    ├─ Avg Latency: 2,341ms
    └─ Run Date: Mar 4, 2024
    
    Version 2
    ├─ Pass Rate: 78%
    ├─ Avg Latency: 3,102ms
    └─ Run Date: Mar 1, 2024
    
    Version 1
    ├─ Pass Rate: 71%
    ├─ Avg Latency: 3,845ms
    └─ Run Date: Feb 28, 2024
    

    Performance Charts

    Visualize trends over time:
    • Pass Rate Chart: Track quality improvements
    • Latency Chart: Monitor response time trends
    • Token Usage Chart: Analyze token consumption
    • Cost Chart: Track spending over versions

    Evaluation Metrics Explained

    Pass Rate

    The percentage of test cases that meet success criteria:
    • 90%+: Excellent performance (green)
    • 50-89%: Acceptable performance (orange)
    • Below 50%: Needs improvement (red)
    Pass rate is determined by:
    • Comparing generated output to expected output
    • Custom validation rules
    • Error-free execution

    Latency

    Measures the time taken for the chatflow to generate a response:
    Average Latency: 2,341ms
    ├─ Min: 1,203ms
    ├─ Max: 5,678ms
    └─ P95: 4,102ms
    
    Optimize latency by:
    • Using faster LLM models
    • Reducing context window size
    • Optimizing retrieval queries
    • Caching frequent queries

    Token Usage

    Tracks input and output tokens for cost management:
    Token Statistics:
    ├─ Avg Input Tokens: 342
    ├─ Avg Output Tokens: 156
    └─ Total Tokens: 24,900
    

    Cost Analysis

    Calculates total cost based on token usage and model pricing:
    Cost Breakdown:
    ├─ Model: gpt-4-turbo
    ├─ Input Cost: $0.0342 (3,420 tokens × $0.01/1k)
    ├─ Output Cost: $0.0468 (1,560 tokens × $0.03/1k)
    └─ Total: $0.0810
    

    Running Evaluations via API

    Automate evaluations through the REST API:

    Create Evaluation

    import requests
    
    API_URL = "http://localhost:3000/api/v1/evaluations"
    API_KEY = "your_api_key_here"
    
    payload = {
        "name": "Customer Support Evaluation",
        "datasetId": "dataset-123",
        "chatflowIds": ["chatflow-456"],
        "metrics": ["accuracy", "latency", "cost"]
    }
    
    headers = {
        "Authorization": f"Bearer {API_KEY}",
        "Content-Type": "application/json"
    }
    
    response = requests.post(API_URL, json=payload, headers=headers)
    print(response.json())
    

    Get Evaluation Results

    curl -X GET http://localhost:3000/api/v1/evaluations/{evaluationId} \
      -H "Authorization: Bearer <your_api_key>"
    

    Run Again

    Re-run an existing evaluation:
    curl -X POST http://localhost:3000/api/v1/evaluations/run-again/{evaluationId} \
      -H "Authorization: Bearer <your_api_key>"
    

    Delete Evaluation

    curl -X DELETE http://localhost:3000/api/v1/evaluations/{evaluationId} \
      -H "Authorization: Bearer <your_api_key>"
    
    See the complete Evaluations API Reference for all available endpoints.

    Evaluation Status

    Evaluations have three status states:
    • Pending (yellow): Evaluation is queued or in progress
    • Completed (green): Evaluation finished successfully
    • Error (red): Evaluation failed (hover for error details)

    Auto-Refresh

    Enable auto-refresh to monitor running evaluations in real-time:
    1. Click the Play icon in the top-right corner
    2. The page will refresh every 5 seconds
    3. Click the Pause icon to disable auto-refresh

    Managing Evaluations

    Delete Individual Versions

    Expand an evaluation to see all versions:
    1. Click the expand arrow next to the version number
    2. Select one or more versions using checkboxes
    3. Click Delete to remove selected versions

    Delete All Versions

    To delete an entire evaluation with all versions:
    1. Select the evaluation using the checkbox
    2. Click Delete in the top action bar
    3. Confirm deletion
    Deleting evaluations is permanent and cannot be undone.

    Best Practices

    Dataset Preparation

    Create comprehensive test datasets:
    [
      {
        "input": "What is your return policy?",
        "expected_output": "We offer 30-day returns on all items.",
        "category": "policy"
      },
      {
        "input": "How do I track my order?",
        "expected_output": "You can track your order using the tracking number sent to your email.",
        "category": "support"
      }
    ]
    

    Regular Testing

    Establish an evaluation cadence:
    • Run evaluations after every chatflow change
    • Weekly regression testing
    • Before deploying to production
    • After dataset updates

    Version Management

    Use version numbers to track changes:
    • Document what changed between versions
    • Compare metrics across versions
    • Keep historical data for analysis
    • Archive old versions when needed

    Performance Benchmarks

    Set target benchmarks for your use case:
    Production Readiness:
    ├─ Pass Rate: > 90%
    ├─ Latency: < 3 seconds
    ├─ Cost per query: < $0.10
    └─ Error Rate: < 1%
    

    A/B Testing

    Compare different chatflow configurations:
    1. Create evaluation with multiple chatflows
    2. Use the same dataset for fair comparison
    3. Analyze metrics side-by-side
    4. Choose the best-performing configuration

    Troubleshooting

    Evaluation Stuck in Pending

    • Check server logs for errors
    • Verify dataset is accessible
    • Ensure chatflow is not locked or being edited
    • Restart the evaluation

    Inconsistent Results

    • Non-deterministic LLM outputs may vary
    • Use temperature=0 for consistent results
    • Increase sample size for better averages
    • Check for dataset issues

    High Failure Rate

    • Review failed test cases in detail
    • Check error messages in results
    • Verify dataset quality and format
    • Test chatflow manually with failing inputs

    Build docs developers (and LLMs) love