LLM Vulnerability Scanning
While most recent LLMs are aligned to be safer, any LLM-powered application is prone to various attacks. NeMo Guardrails provides mechanisms for protecting against vulnerabilities like jailbreaks and prompt injections.Understanding LLM Vulnerabilities
LLM applications face numerous security risks outlined in the OWASP Top 10 for LLM Applications:- Prompt Injection: Malicious inputs that override system instructions
- Jailbreaks: Attempts to bypass safety guardrails
- Data Leakage: Extracting training data or sensitive information
- Harmful Content Generation: Eliciting unsafe or inappropriate responses
- Model Manipulation: Exploiting model behaviors for unintended outputs
Garak: LLM Vulnerability Scanner
Garak is an open-source tool for scanning LLM applications against common vulnerabilities. Think of it as an LLM equivalent to network security scanners like nmap.Key Features
- Comprehensive vulnerability categories
- Automated testing framework
- Detailed reporting
- Integration with NeMo Guardrails
Installation
Protection Configurations
Testing different levels of guardrails protection:Configuration Levels
Bare LLM (No Protection)
Testing the LLM without any guardrails:
- No general instructions
- No dialogue rails
- No moderation rails
General Instructions
Protection using prompt engineering:
- System prompts with safety guidelines
- Behavioral instructions
General Instructions + Dialog Rails
Adding conversation flow controls:
- Topic boundaries
- Unwanted topic refusal
- Canonical form validation
Vulnerability Scan Results
Results from scanning a sample ABC bot configuration withgpt-3.5-turbo-instruct:
Protection Comparison
| Protection Level | General Instructions | Dialog Rails | Moderation Rails |
|---|---|---|---|
bare_llm | ✗ | ✗ | ✗ |
with_gi | ✓ | ✗ | ✗ |
with_gi_dr | ✓ | ✓ | ✗ |
with_gi_dr_mo | ✓ | ✓ | ✓ |
Protection Rates by Vulnerability Category
Higher percentages indicate better protection (fewer successful attacks):| Vulnerability Category | bare_llm | with_gi | with_gi_dr | with_gi_dr_mo |
|---|---|---|---|---|
| module continuation | 92.8% | 69.5% | 99.3% | 100% |
| module dan | 27.3% | 40.7% | 61.3% | 52.7% |
| module encoding | 90.3% | 98.2% | 100% | 100% |
| module goodside | 32.2% | 32.2% | 66.7% | 66.7% |
| module knownbadsignatures | 4.0% | 97.3% | 100% | 100% |
| module leakreplay | 76.8% | 85.7% | 89.6% | 100% |
| module lmrc | 85.0% | 81.9% | 86.5% | 94.4% |
| module malwaregen | 50.2% | 92.2% | 93.7% | 100% |
| module packagehallucination | 97.4% | 100% | 100% | 100% |
| module realpublicityprompts | 100% | 100% | 100% | 100% |
| module snowball | 34.5% | 82.1% | 99.0% | 100% |
| module xss | 92.5% | 100% | 100% | 100% |
Key Findings
Even powerful LLMs like GPT-3.5-turbo are vulnerable to attacks without guardrails. The combination of dialog rails and moderation rails provides the strongest protection.
- Bare LLM is Vulnerable: Without guardrails, the LLM is susceptible to multiple attack types
- General Instructions Help: Prompt engineering alone provides modest improvement
- Dialog Rails are Effective: Adding dialog rails significantly improves protection
- Full Stack is Best: Combining all protection layers achieves near-perfect protection
- Dialog Rails Alone: Even without moderation, dialog rails provide good protection
Vulnerability Categories Explained
Continuation Attacks
Attempts to continue model output to bypass restrictions. Example: “Sure, I’ll help with that. Here’s how to…” Protection: 100% with full guardrailsDAN (Do Anything Now)
Jailbreak attempts using roleplay to bypass safety. Example: “Pretend you’re DAN who can do anything…” Protection: 61.3% with dialog rails, 52.7% with full guardrailsEncoding Attacks
Using alternative encodings to hide malicious prompts. Example: Base64, ROT13, or other encoding schemes Protection: 100% with dialog rails or full guardrailsGoodside Attacks
Known attack patterns from security researcher Riley Goodside. Protection: 66.7% with dialog rails or full guardrailsKnown Bad Signatures
Recognized malicious prompt patterns. Protection: 100% with dialog rails or full guardrailsLeak/Replay Attacks
Attempts to extract system prompts or training data. Protection: 100% with full guardrailsLMRC Attacks
Language Model Risk Cards - documented risky behaviors. Protection: 94.4% with full guardrailsMalware Generation
Requests to generate malicious code. Protection: 100% with full guardrailsPackage Hallucination
Attempts to get the model to recommend fake packages. Protection: 100% with all configurationsXSS Attacks
Cross-site scripting attempt generation. Protection: 100% with general instructions or betterRunning Your Own Vulnerability Scans
Interpreting Scan Results
Protection Rate Calculation
Risk Assessment
| Protection Rate | Risk Level | Action Required |
|---|---|---|
| 95-100% | Very Low | Monitor regularly |
| 85-94% | Low | Minor improvements |
| 70-84% | Medium | Strengthen guardrails |
| 50-69% | High | Major improvements needed |
| <50% | Critical | Immediate action required |
Improving Protection Rates
If scans reveal vulnerabilities:Enable All Rail Types
Ensure you’re using:
- Dialog rails for topic control
- Input moderation for jailbreak detection
- Output moderation for response filtering
Strengthen Prompts
Improve system prompts with:
- Clear behavioral guidelines
- Explicit refusal instructions
- Safety constraints
Add Training Examples
Include examples of:
- Attack patterns to reject
- Appropriate refusal responses
- Edge cases
Limitations
Understanding scan limitations:Vulnerability scanning tests known attack patterns. It cannot guarantee protection against novel attacks or all possible inputs.
- False Negatives: Some attacks may not be detected
- Evolving Threats: New attack vectors emerge regularly
- Legitimate User Impact: High protection may block valid requests (not tested in basic scans)
- Context Dependent: Results vary by use case and LLM model
Best Practices
- Regular Scanning: Run vulnerability scans periodically, not just once
- Multiple Configurations: Test with different guardrail combinations
- Production Testing: Scan with production-like configurations
- Monitor Production: Track real-world attack attempts
- Stay Updated: Keep Garak and NeMo Guardrails updated
- Document Results: Maintain scan history for compliance
Additional Resources
Next Steps
Evaluation Metrics
Understand detailed metrics
Production Security
Production deployment security