Skip to main content

Evaluation Overview

NeMo Guardrails includes comprehensive evaluation tools to measure the accuracy and performance of different types of rails. This guide introduces the evaluation framework and available tools.

Why Evaluate Guardrails?

Evaluating your guardrails configuration helps you:
  • Measure Effectiveness: Quantify how well guardrails protect against vulnerabilities
  • Optimize Performance: Identify and fix accuracy issues
  • Ensure Quality: Validate configurations before production deployment
  • Track Improvements: Monitor performance over time
  • Build Confidence: Demonstrate guardrails effectiveness to stakeholders

Evaluation Types

NeMo Guardrails provides evaluation tools for multiple rail types:

Dialog Rails (Topical Rails)

Evaluates the core conversation guidance mechanism:
  • User Intent Detection: Accuracy of canonical form generation
  • Next Step Generation: Bot intent prediction accuracy
  • Bot Message Generation: Response quality
Learn more →

Moderation Rails

Evaluates input and output moderation:
  • Jailbreak Detection: Blocking harmful user inputs
  • Output Moderation: Filtering unsafe LLM responses
Learn more →

Fact-Checking Rails

Evaluates factual grounding:
  • Positive Entailment: Detecting factually correct responses
  • Negative Entailment: Identifying incorrect information
Learn more →

Hallucination Rails

Evaluates detection of fabricated information:
  • Consistency Checking: Comparing multiple responses
  • False Premise Detection: Identifying unanswerable questions
Learn more →

Evaluation CLI Commands

All evaluation tools are accessible through the NeMo Guardrails CLI:

Evaluation Workflow

1

Prepare Your Configuration

Ensure your guardrails configuration is complete:
/path/to/config/
├── config.yml
├── config.co
└── kb/
2

Choose Evaluation Type

Select the appropriate evaluation based on your rails:
  • Use topical for dialog flows
  • Use moderation for input/output checking
  • Use fact-checking for factual grounding
  • Use hallucination for consistency checking
3

Run Evaluation

Execute the evaluation command:
nemoguardrails evaluate topical \
  --config=/path/to/config \
  --verbose \
  --output-dir=./results
4

Analyze Results

Review the output metrics and identify areas for improvement:
Processed 226/226 samples!
Num intent errors: 27
User Intent Accuracy: 88.1%
Bot Intent Accuracy: 88.5%
Bot Message Accuracy: 88.9%
5

Iterate and Improve

Based on results:
  • Add more training examples
  • Refine canonical forms
  • Adjust similarity thresholds
  • Update prompts

Key Metrics

Understand the metrics reported by evaluation tools:

Accuracy Metrics

  • User Intent Accuracy: Percentage of correctly identified user intents
  • Bot Intent Accuracy: Percentage of correct next step predictions
  • Bot Message Accuracy: Percentage of appropriate responses

Performance Metrics

  • Latency: Time per evaluation sample (in milliseconds)
  • Throughput: Evaluations processed per second
  • Error Rate: Percentage of failed evaluations

Protection Metrics

  • Block Rate: Percentage of harmful inputs blocked
  • False Positive Rate: Percentage of legitimate inputs incorrectly blocked
  • False Negative Rate: Percentage of harmful inputs that passed through

Datasets

NeMo Guardrails includes sample datasets for evaluation: You can also use your own datasets in the appropriate format.

Vulnerability Scanning

Beyond standard evaluation, NeMo Guardrails can be tested against known vulnerabilities:

Garak Integration

Garak is an LLM vulnerability scanner that tests for:
  • Jailbreak attempts
  • Prompt injections
  • Malware generation prompts
  • Encoding attacks
  • Known bad signatures
See the Vulnerability Scanning guide for detailed results.

Evaluation Best Practices

Tip: Run evaluations regularly throughout development, not just before deployment.

1. Use Representative Data

  • Include diverse user inputs
  • Cover edge cases and adversarial examples
  • Balance positive and negative samples

2. Establish Baselines

  • Run evaluations on initial configuration
  • Track metrics over time
  • Compare against benchmarks

3. Test in Isolation

  • Evaluate each rail type separately
  • Identify specific weak points
  • Optimize targeted improvements

4. Validate with Real Data

  • Supplement synthetic data with production samples
  • Test with actual user queries
  • Monitor production metrics

5. Iterate Continuously

  • Regular evaluation cycles
  • A/B test configuration changes
  • Document improvements

Common Evaluation Parameters

Most evaluation commands support these parameters:
ParameterDescriptionDefault
--configPath to guardrails configurationRequired
--verboseEnable verbose outputFalse
--num-samplesNumber of samples to evaluate50
--dataset-pathCustom dataset pathBuilt-in
--output-dirOutput directory for resultseval_outputs/
--write-outputsSave results to filesTrue

Interpreting Results

Understanding your evaluation results:

High Accuracy (>85%)

  • Configuration is well-tuned
  • Sufficient training examples
  • Appropriate for production

Medium Accuracy (70-85%)

  • May need more training data
  • Consider similarity thresholds
  • Review edge cases

Low Accuracy (<70%)

  • Insufficient training examples
  • Prompt engineering needed
  • Consider different LLM
  • Review canonical forms

Next Steps

Vulnerability Scanning

Test against known attack vectors

Evaluation Metrics

Deep dive into performance metrics

Build docs developers (and LLMs) love