Overview
The content safety rail uses specialized content moderation models (like Llama Guard or NeMo Guard) to classify content against safety policies. It can:- Check user inputs before processing
- Validate bot outputs before returning to users
- Support multilingual refusal messages
- Enable reasoning/explanation for safety decisions
Quick Start
Configuration
Basic Configuration
config.yml
With Reasoning Enabled
Enable the model to provide explanations for safety decisions:config.yml
Multilingual Support
Provide localized refusal messages for different languages:config.yml
- English (en)
- Spanish (es)
- Chinese (zh)
- German (de)
- French (fr)
- Hindi (hi)
- Japanese (ja)
- Arabic (ar)
- Thai (th)
Language detection requires the
fast-langdetect package:Input vs Output Checks
Input Check
Validates user messages before processing:config.yml
$user_message- The user’s input text
Output Check
Validates bot responses before returning them:config.yml
$user_message- The original user input$bot_message- The generated bot response
Behavior
When unsafe content is detected, the response includes:With Rails Exceptions
config.yml
ContentSafetyCheckInputException- For input violationsContentSafetyCheckOuputException- For output violations
Without Rails Exceptions
The bot refuses to respond and aborts the conversation.Using Different Models
You can use various content safety models:Llama Guard
config.yml
OpenAI Moderation
config.yml
Custom Flows
Create custom content safety flows:flows.co
Accessing Policy Violations
The policy violations are stored in global context variables:flows.co
Caching
Content safety checks support model-level caching:Implementation Details
The content safety flows are defined in:/nemoguardrails/library/content_safety/flows.co/nemoguardrails/library/content_safety/actions.py
ContentSafetyCheckInputAction- Checks user inputContentSafetyCheckOutputAction- Checks bot outputDetectLanguageAction- Detects user language for multilingual support