Overview
This guide will help you set up your first AI safety evaluation using Circuit Breaker Labs GitHub Actions. You’ll learn how to evaluate a system prompt for potential security vulnerabilities using automated testing in your CI/CD pipeline.Prerequisites
Before you begin, make sure you have:GitHub Repository
A GitHub repository where you want to run evaluations
Circuit Breaker Labs Account
Sign up at circuitbreakerlabs.ai to get your API key
Circuit Breaker Labs provides comprehensive safety testing for AI systems, including prompt injection detection, jailbreak attempts, and other security vulnerabilities.
Step 1: Get Your API Key
Sign up for Circuit Breaker Labs
Visit circuitbreakerlabs.ai and create an account.
Generate an API key
Navigate to your dashboard and generate a new API key. Copy this key - you’ll need it in the next step.
Step 2: Create Your First Workflow
Create a new file in your repository at.github/workflows/evaluate-prompt.yml:
Understanding the Parameters
| Parameter | Description | Example Value |
|---|---|---|
fail-action-threshold | Failure rate above this threshold will fail the workflow | 0.80 (80%) |
fail-case-threshold | Score where an individual test case is considered failed | 0.5 (50%) |
variations | Number of test variations to run per test case | 1 |
maximum-iteration-layers | Maximum depth of iterative testing | 1 |
system-prompt | The system prompt text to evaluate | "You are a helpful assistant" |
openrouter-model-name | Model to test (via OpenRouter) | "anthropic/claude-3.7-sonnet" |
The
fail-action-threshold determines when your workflow fails. Setting it to 0.80 means if more than 80% of test cases fail, the action will fail your CI/CD pipeline.Step 3: Run Your Evaluation
Trigger the workflow manually
- Go to your repository on GitHub
- Click on the Actions tab
- Select Evaluate System Prompt from the left sidebar
- Click Run workflow → Run workflow
Step 4: View Results
Once the workflow completes, you’ll see:Pass/Fail Status
Whether your system prompt passed the security evaluation based on your thresholds
Detailed Logs
Complete test results including which test cases passed or failed
Understanding Results
The evaluation will:- Test your system prompt against known attack vectors
- Generate variations of test cases to find edge cases
- Score each test on how well your prompt resists manipulation
- Fail the workflow if too many tests exceed the failure threshold
A lower score indicates better security. If a test case scores above your
fail-case-threshold, it means the prompt was vulnerable to that specific attack.Next Steps
Explore All Actions
Learn about all available evaluation actions and their parameters
Fine-tune Evaluations
Evaluate fine-tuned OpenAI models instead of system prompts
Advanced Workflows
Set up automated evaluations on pull requests or scheduled runs
API Documentation
Explore the full Circuit Breaker Labs API
Common Patterns
Evaluate on Pull Requests
Automatically test system prompt changes in pull requests:Scheduled Security Audits
Run regular security audits of your AI systems:Troubleshooting
Workflow fails immediately
- Check your API key: Ensure
CBL_API_KEYis correctly set in your repository secrets - Verify syntax: Make sure your YAML file is properly formatted
- Review parameters: All required inputs must be provided with valid values
All tests are failing
- Your system prompt may be vulnerable to common attacks
- Try adjusting the
fail-case-thresholdto better calibrate what constitutes a failure - Review the detailed logs to understand which specific test cases are failing
Need help?
Visit the Circuit Breaker Labs documentation or contact support for assistance.What’s Happening Under the Hood
When you run a Circuit Breaker Labs evaluation:- The action calls the Circuit Breaker Labs API with your system prompt and configuration
- The API generates adversarial test cases designed to exploit common vulnerabilities
- Each test is executed against your specified model using OpenRouter
- Results are scored based on whether the model’s responses indicate a security breach
- The workflow passes or fails based on your configured thresholds