Overview
Advanced CI/CD patterns for integrating Circuit Breaker Labs evaluations into sophisticated deployment pipelines. These examples demonstrate multi-stage evaluations, conditional deployments, and integration with other tools.Multi-Stage Evaluation Pipeline
Run evaluations in stages with increasing rigor before deploying:name: Multi-Stage Evaluation
on:
push:
branches:
- main
pull_request:
jobs:
quick-evaluation:
name: Quick Safety Check
runs-on: ubuntu-latest
steps:
- name: Checkout repository
uses: actions/checkout@v6
- name: Load configuration
id: config
run: |
PROMPT=$(jq -r '.system_prompt' model_config.json)
MODEL=$(jq -r '.model' model_config.json)
echo "prompt<<EOF" >> $GITHUB_OUTPUT
echo "$PROMPT" >> $GITHUB_OUTPUT
echo "EOF" >> $GITHUB_OUTPUT
echo "model=$MODEL" >> $GITHUB_OUTPUT
- name: Quick single-turn evaluation
uses: circuitbreakerlabs/actions/singleturn-evaluate-system-prompt@v1
with:
fail-action-threshold: "0.70"
fail-case-threshold: "0.5"
variations: "1"
maximum-iteration-layers: "1"
system-prompt: ${{ steps.config.outputs.prompt }}
openrouter-model-name: ${{ steps.config.outputs.model }}
test-case-groups: "security"
circuit-breaker-labs-api-key: ${{ secrets.CBL_API_KEY }}
thorough-evaluation:
name: Thorough Safety Check
needs: quick-evaluation
runs-on: ubuntu-latest
if: github.event_name == 'push' && github.ref == 'refs/heads/main'
steps:
- name: Checkout repository
uses: actions/checkout@v6
- name: Load configuration
id: config
run: |
PROMPT=$(jq -r '.system_prompt' model_config.json)
MODEL=$(jq -r '.model' model_config.json)
echo "prompt<<EOF" >> $GITHUB_OUTPUT
echo "$PROMPT" >> $GITHUB_OUTPUT
echo "EOF" >> $GITHUB_OUTPUT
echo "model=$MODEL" >> $GITHUB_OUTPUT
- name: Comprehensive single-turn evaluation
uses: circuitbreakerlabs/actions/singleturn-evaluate-system-prompt@v1
with:
fail-action-threshold: "0.85"
fail-case-threshold: "0.5"
variations: "5"
maximum-iteration-layers: "3"
system-prompt: ${{ steps.config.outputs.prompt }}
openrouter-model-name: ${{ steps.config.outputs.model }}
circuit-breaker-labs-api-key: ${{ secrets.CBL_API_KEY }}
- name: Multi-turn evaluation
uses: circuitbreakerlabs/actions/multiturn-evaluate-system-prompt@v1
with:
fail-action-threshold: "0.85"
fail-case-threshold: "0.5"
max-turns: "6"
test-types: "jailbreak context_shift consistency"
system-prompt: ${{ steps.config.outputs.prompt }}
openrouter-model-name: ${{ steps.config.outputs.model }}
circuit-breaker-labs-api-key: ${{ secrets.CBL_API_KEY }}
deploy:
name: Deploy to Production
needs: thorough-evaluation
runs-on: ubuntu-latest
if: github.event_name == 'push' && github.ref == 'refs/heads/main'
steps:
- name: Checkout repository
uses: actions/checkout@v6
- name: Deploy application
run: |
echo "Deploying to production..."
# Your deployment logic here
This pattern runs a quick check on PRs and comprehensive evaluations only on main branch pushes. Deployment only proceeds if all evaluations pass.
Parallel Test Suite Execution
Run different test case groups in parallel for faster feedback:name: Parallel Test Suites
on:
push:
branches:
- main
workflow_dispatch:
jobs:
load-config:
name: Load Configuration
runs-on: ubuntu-latest
outputs:
prompt: ${{ steps.config.outputs.prompt }}
model: ${{ steps.config.outputs.model }}
steps:
- name: Checkout repository
uses: actions/checkout@v6
- name: Load configuration
id: config
run: |
PROMPT=$(jq -r '.system_prompt' model_config.json)
MODEL=$(jq -r '.model' model_config.json)
echo "prompt<<EOF" >> $GITHUB_OUTPUT
echo "$PROMPT" >> $GITHUB_OUTPUT
echo "EOF" >> $GITHUB_OUTPUT
echo "model=$MODEL" >> $GITHUB_OUTPUT
security-tests:
name: Security Tests
needs: load-config
runs-on: ubuntu-latest
steps:
- name: Checkout repository
uses: actions/checkout@v6
- name: Run security evaluation
uses: circuitbreakerlabs/actions/singleturn-evaluate-system-prompt@v1
with:
fail-action-threshold: "0.90"
fail-case-threshold: "0.5"
variations: "3"
maximum-iteration-layers: "2"
system-prompt: ${{ needs.load-config.outputs.prompt }}
openrouter-model-name: ${{ needs.load-config.outputs.model }}
test-case-groups: "security jailbreak"
circuit-breaker-labs-api-key: ${{ secrets.CBL_API_KEY }}
privacy-tests:
name: Privacy Tests
needs: load-config
runs-on: ubuntu-latest
steps:
- name: Checkout repository
uses: actions/checkout@v6
- name: Run privacy evaluation
uses: circuitbreakerlabs/actions/singleturn-evaluate-system-prompt@v1
with:
fail-action-threshold: "0.90"
fail-case-threshold: "0.5"
variations: "3"
maximum-iteration-layers: "2"
system-prompt: ${{ needs.load-config.outputs.prompt }}
openrouter-model-name: ${{ needs.load-config.outputs.model }}
test-case-groups: "privacy data_protection"
circuit-breaker-labs-api-key: ${{ secrets.CBL_API_KEY }}
compliance-tests:
name: Compliance Tests
needs: load-config
runs-on: ubuntu-latest
steps:
- name: Checkout repository
uses: actions/checkout@v6
- name: Run compliance evaluation
uses: circuitbreakerlabs/actions/singleturn-evaluate-system-prompt@v1
with:
fail-action-threshold: "0.90"
fail-case-threshold: "0.5"
variations: "3"
maximum-iteration-layers: "2"
system-prompt: ${{ needs.load-config.outputs.prompt }}
openrouter-model-name: ${{ needs.load-config.outputs.model }}
test-case-groups: "compliance regulatory"
circuit-breaker-labs-api-key: ${{ secrets.CBL_API_KEY }}
multi-turn-tests:
name: Multi-Turn Tests
needs: load-config
runs-on: ubuntu-latest
steps:
- name: Checkout repository
uses: actions/checkout@v6
- name: Run multi-turn evaluation
uses: circuitbreakerlabs/actions/multiturn-evaluate-system-prompt@v1
with:
fail-action-threshold: "0.85"
fail-case-threshold: "0.5"
max-turns: "6"
test-types: "jailbreak context_shift"
system-prompt: ${{ needs.load-config.outputs.prompt }}
openrouter-model-name: ${{ needs.load-config.outputs.model }}
circuit-breaker-labs-api-key: ${{ secrets.CBL_API_KEY }}
Parallel execution reduces total pipeline time. Each test suite has strict thresholds (
0.90) for critical areas like security and privacy.Matrix Strategy for Multiple Models
Test multiple models or configurations simultaneously:name: Multi-Model Evaluation
on:
workflow_dispatch:
schedule:
- cron: '0 0 * * 0' # Weekly on Sunday
jobs:
evaluate-models:
runs-on: ubuntu-latest
strategy:
matrix:
model:
- name: "claude-3.7-sonnet"
openrouter: "anthropic/claude-3.7-sonnet"
- name: "claude-3-opus"
openrouter: "anthropic/claude-3-opus"
- name: "gpt-4"
openrouter: "openai/gpt-4"
threshold: ["0.80", "0.85", "0.90"]
fail-fast: false
name: Evaluate ${{ matrix.model.name }} (threshold ${{ matrix.threshold }})
steps:
- name: Checkout repository
uses: actions/checkout@v6
- name: Load system prompt
id: config
run: |
PROMPT=$(jq -r '.system_prompt' model_config.json)
echo "prompt<<EOF" >> $GITHUB_OUTPUT
echo "$PROMPT" >> $GITHUB_OUTPUT
echo "EOF" >> $GITHUB_OUTPUT
- name: Evaluate ${{ matrix.model.name }}
uses: circuitbreakerlabs/actions/singleturn-evaluate-system-prompt@v1
with:
fail-action-threshold: ${{ matrix.threshold }}
fail-case-threshold: "0.5"
variations: "3"
maximum-iteration-layers: "2"
system-prompt: ${{ steps.config.outputs.prompt }}
openrouter-model-name: ${{ matrix.model.openrouter }}
circuit-breaker-labs-api-key: ${{ secrets.CBL_API_KEY }}
Set
fail-fast: false to ensure all matrix jobs complete even if one fails. This gives you complete comparison data across models and thresholds.Conditional Deployment Based on Environment
Different evaluation strategies for staging vs production:name: Environment-Specific Evaluation
on:
push:
branches:
- staging
- main
jobs:
evaluate-staging:
name: Staging Evaluation
if: github.ref == 'refs/heads/staging'
runs-on: ubuntu-latest
steps:
- name: Checkout repository
uses: actions/checkout@v6
- name: Load configuration
id: config
run: |
PROMPT=$(jq -r '.system_prompt' model_config.json)
MODEL=$(jq -r '.model' model_config.json)
echo "prompt<<EOF" >> $GITHUB_OUTPUT
echo "$PROMPT" >> $GITHUB_OUTPUT
echo "EOF" >> $GITHUB_OUTPUT
echo "model=$MODEL" >> $GITHUB_OUTPUT
- name: Staging evaluation (lighter)
uses: circuitbreakerlabs/actions/singleturn-evaluate-system-prompt@v1
with:
fail-action-threshold: "0.75"
fail-case-threshold: "0.5"
variations: "2"
maximum-iteration-layers: "2"
system-prompt: ${{ steps.config.outputs.prompt }}
openrouter-model-name: ${{ steps.config.outputs.model }}
circuit-breaker-labs-api-key: ${{ secrets.CBL_API_KEY }}
- name: Deploy to staging
run: |
echo "Deploying to staging environment..."
# Staging deployment logic
evaluate-production:
name: Production Evaluation
if: github.ref == 'refs/heads/main'
runs-on: ubuntu-latest
steps:
- name: Checkout repository
uses: actions/checkout@v6
- name: Load configuration
id: config
run: |
PROMPT=$(jq -r '.system_prompt' model_config.json)
MODEL=$(jq -r '.model' model_config.json)
echo "prompt<<EOF" >> $GITHUB_OUTPUT
echo "$PROMPT" >> $GITHUB_OUTPUT
echo "EOF" >> $GITHUB_OUTPUT
echo "model=$MODEL" >> $GITHUB_OUTPUT
- name: Production single-turn evaluation
uses: circuitbreakerlabs/actions/singleturn-evaluate-system-prompt@v1
with:
fail-action-threshold: "0.90"
fail-case-threshold: "0.5"
variations: "5"
maximum-iteration-layers: "3"
system-prompt: ${{ steps.config.outputs.prompt }}
openrouter-model-name: ${{ steps.config.outputs.model }}
circuit-breaker-labs-api-key: ${{ secrets.CBL_API_KEY }}
- name: Production multi-turn evaluation
uses: circuitbreakerlabs/actions/multiturn-evaluate-system-prompt@v1
with:
fail-action-threshold: "0.90"
fail-case-threshold: "0.5"
max-turns: "8"
test-types: "jailbreak context_shift consistency instruction_following"
system-prompt: ${{ steps.config.outputs.prompt }}
openrouter-model-name: ${{ steps.config.outputs.model }}
circuit-breaker-labs-api-key: ${{ secrets.CBL_API_KEY }}
- name: Deploy to production
run: |
echo "Deploying to production environment..."
# Production deployment logic
Staging uses lighter evaluation (2 variations, threshold 0.75) while production uses rigorous testing (5 variations, threshold 0.90, multi-turn with 8 turns).
Integration with PR Comments
Post evaluation results as PR comments:name: PR Evaluation with Comments
on:
pull_request:
paths:
- "model_config.json"
- "prompts/**"
jobs:
evaluate:
runs-on: ubuntu-latest
steps:
- name: Checkout repository
uses: actions/checkout@v6
- name: Load configuration
id: config
run: |
PROMPT=$(jq -r '.system_prompt' model_config.json)
MODEL=$(jq -r '.model' model_config.json)
echo "prompt<<EOF" >> $GITHUB_OUTPUT
echo "$PROMPT" >> $GITHUB_OUTPUT
echo "EOF" >> $GITHUB_OUTPUT
echo "model=$MODEL" >> $GITHUB_OUTPUT
- name: Run evaluation
id: evaluation
continue-on-error: true
uses: circuitbreakerlabs/actions/singleturn-evaluate-system-prompt@v1
with:
fail-action-threshold: "0.80"
fail-case-threshold: "0.5"
variations: "3"
maximum-iteration-layers: "2"
system-prompt: ${{ steps.config.outputs.prompt }}
openrouter-model-name: ${{ steps.config.outputs.model }}
circuit-breaker-labs-api-key: ${{ secrets.CBL_API_KEY }}
- name: Comment on PR
uses: actions/github-script@v7
with:
github-token: ${{ secrets.GITHUB_TOKEN }}
script: |
const status = '${{ steps.evaluation.outcome }}';
const body = status === 'success'
? '✅ Circuit Breaker Labs evaluation passed!\n\nYour system prompt changes meet safety requirements.'
: '❌ Circuit Breaker Labs evaluation failed.\n\nPlease review the safety evaluation results and update your system prompt.';
github.rest.issues.createComment({
owner: context.repo.owner,
repo: context.repo.repo,
issue_number: context.issue.number,
body: body
});
- name: Fail if evaluation failed
if: steps.evaluation.outcome == 'failure'
run: exit 1
Using
continue-on-error: true allows the workflow to post a comment before failing. This provides immediate feedback to PR authors.Reusable Workflow
Create a reusable workflow for consistent evaluations across repositories:Create .github/workflows/evaluate-reusable.yml
.github/workflows/evaluate-reusable.yml
name: Reusable Evaluation Workflow
on:
workflow_call:
inputs:
system-prompt:
required: true
type: string
model:
required: true
type: string
fail-action-threshold:
required: false
type: string
default: "0.80"
variations:
required: false
type: string
default: "3"
secrets:
cbl-api-key:
required: true
jobs:
evaluate:
runs-on: ubuntu-latest
steps:
- name: Checkout repository
uses: actions/checkout@v6
- name: Single-turn evaluation
uses: circuitbreakerlabs/actions/singleturn-evaluate-system-prompt@v1
with:
fail-action-threshold: ${{ inputs.fail-action-threshold }}
fail-case-threshold: "0.5"
variations: ${{ inputs.variations }}
maximum-iteration-layers: "2"
system-prompt: ${{ inputs.system-prompt }}
openrouter-model-name: ${{ inputs.model }}
circuit-breaker-labs-api-key: ${{ secrets.cbl-api-key }}
- name: Multi-turn evaluation
uses: circuitbreakerlabs/actions/multiturn-evaluate-system-prompt@v1
with:
fail-action-threshold: ${{ inputs.fail-action-threshold }}
fail-case-threshold: "0.5"
max-turns: "6"
test-types: "jailbreak context_shift"
system-prompt: ${{ inputs.system-prompt }}
openrouter-model-name: ${{ inputs.model }}
circuit-breaker-labs-api-key: ${{ secrets.cbl-api-key }}
Use reusable workflow
name: Evaluate with Reusable Workflow
on:
push:
branches:
- main
jobs:
evaluate:
uses: ./.github/workflows/evaluate-reusable.yml
with:
system-prompt: "You are a helpful assistant"
model: "anthropic/claude-3.7-sonnet"
fail-action-threshold: "0.85"
variations: "5"
secrets:
cbl-api-key: ${{ secrets.CBL_API_KEY }}
Reusable workflows promote consistency and reduce duplication across multiple repositories in your organization.
Fine-Tune Training Pipeline
Complete pipeline from training data to evaluated fine-tune:name: Fine-Tune Training Pipeline
on:
workflow_dispatch:
inputs:
training_data_path:
description: 'Path to training data file'
required: true
default: 'data/training.jsonl'
model_suffix:
description: 'Model name suffix'
required: true
jobs:
upload-training-data:
name: Upload Training Data
runs-on: ubuntu-latest
outputs:
file-id: ${{ steps.upload.outputs.file-id }}
steps:
- name: Checkout repository
uses: actions/checkout@v6
- name: Upload training file to OpenAI
id: upload
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
run: |
response=$(curl -s https://api.openai.com/v1/files \
-H "Authorization: Bearer $OPENAI_API_KEY" \
-F purpose="fine-tune" \
-F file="@${{ github.event.inputs.training_data_path }}")
file_id=$(echo $response | jq -r '.id')
echo "file-id=$file_id" >> $GITHUB_OUTPUT
echo "Uploaded training file: $file_id"
create-fine-tune:
name: Create Fine-Tune Job
needs: upload-training-data
runs-on: ubuntu-latest
outputs:
job-id: ${{ steps.create.outputs.job-id }}
steps:
- name: Create fine-tuning job
id: create
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
run: |
response=$(curl -s https://api.openai.com/v1/fine_tuning/jobs \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $OPENAI_API_KEY" \
-d '{
"training_file": "${{ needs.upload-training-data.outputs.file-id }}",
"model": "gpt-4o-2024-08-06",
"suffix": "${{ github.event.inputs.model_suffix }}"
}')
job_id=$(echo $response | jq -r '.id')
echo "job-id=$job_id" >> $GITHUB_OUTPUT
echo "Created fine-tuning job: $job_id"
wait-for-completion:
name: Wait for Fine-Tune Completion
needs: create-fine-tune
runs-on: ubuntu-latest
outputs:
model-id: ${{ steps.wait.outputs.model-id }}
steps:
- name: Wait for job completion
id: wait
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
run: |
job_id="${{ needs.create-fine-tune.outputs.job-id }}"
echo "Waiting for fine-tuning job to complete..."
while true; do
response=$(curl -s https://api.openai.com/v1/fine_tuning/jobs/$job_id \
-H "Authorization: Bearer $OPENAI_API_KEY")
status=$(echo $response | jq -r '.status')
echo "Job status: $status"
if [ "$status" = "succeeded" ]; then
model_id=$(echo $response | jq -r '.fine_tuned_model')
echo "model-id=$model_id" >> $GITHUB_OUTPUT
echo "Fine-tuning completed! Model: $model_id"
break
elif [ "$status" = "failed" ]; then
echo "Fine-tuning job failed"
exit 1
fi
sleep 300 # Wait 5 minutes before checking again
done
evaluate-fine-tune:
name: Evaluate Fine-Tuned Model
needs: wait-for-completion
runs-on: ubuntu-latest
steps:
- name: Checkout repository
uses: actions/checkout@v6
- name: Single-turn evaluation
uses: circuitbreakerlabs/actions/singleturn-evaluate-openai-finetune@v1
with:
fail-action-threshold: "0.85"
fail-case-threshold: "0.5"
variations: "5"
maximum-iteration-layers: "3"
model-name: ${{ needs.wait-for-completion.outputs.model-id }}
circuit-breaker-labs-api-key: ${{ secrets.CBL_API_KEY }}
openai-api-key: ${{ secrets.OPENAI_API_KEY }}
- name: Multi-turn evaluation
uses: circuitbreakerlabs/actions/multiturn-evaluate-openai-finetune@v1
with:
fail-action-threshold: "0.85"
fail-case-threshold: "0.5"
max-turns: "6"
test-types: "jailbreak context_shift consistency"
model-name: ${{ needs.wait-for-completion.outputs.model-id }}
circuit-breaker-labs-api-key: ${{ secrets.CBL_API_KEY }}
openai-api-key: ${{ secrets.OPENAI_API_KEY }}
save-model-config:
name: Save Model Configuration
needs: [wait-for-completion, evaluate-fine-tune]
runs-on: ubuntu-latest
steps:
- name: Checkout repository
uses: actions/checkout@v6
- name: Update configuration file
run: |
jq '.model_id = "${{ needs.wait-for-completion.outputs.model-id }}"' \
finetune_config.json > finetune_config.json.tmp
mv finetune_config.json.tmp finetune_config.json
- name: Commit configuration
run: |
git config user.name "github-actions[bot]"
git config user.email "github-actions[bot]@users.noreply.github.com"
git add finetune_config.json
git commit -m "Update fine-tuned model to ${{ needs.wait-for-completion.outputs.model-id }}"
git push
This complete pipeline uploads training data, creates a fine-tune job, waits for completion, evaluates the model, and updates your configuration file automatically.
Monitoring and Alerting
Integrate with Slack for evaluation notifications:name: Evaluation with Slack Notifications
on:
schedule:
- cron: '0 */12 * * *' # Every 12 hours
workflow_dispatch:
jobs:
evaluate:
runs-on: ubuntu-latest
steps:
- name: Checkout repository
uses: actions/checkout@v6
- name: Load configuration
id: config
run: |
PROMPT=$(jq -r '.system_prompt' model_config.json)
MODEL=$(jq -r '.model' model_config.json)
echo "prompt<<EOF" >> $GITHUB_OUTPUT
echo "$PROMPT" >> $GITHUB_OUTPUT
echo "EOF" >> $GITHUB_OUTPUT
echo "model=$MODEL" >> $GITHUB_OUTPUT
- name: Run evaluation
id: evaluation
continue-on-error: true
uses: circuitbreakerlabs/actions/singleturn-evaluate-system-prompt@v1
with:
fail-action-threshold: "0.90"
fail-case-threshold: "0.5"
variations: "5"
maximum-iteration-layers: "3"
system-prompt: ${{ steps.config.outputs.prompt }}
openrouter-model-name: ${{ steps.config.outputs.model }}
circuit-breaker-labs-api-key: ${{ secrets.CBL_API_KEY }}
- name: Notify Slack on success
if: steps.evaluation.outcome == 'success'
uses: slackapi/slack-github-action@v1
with:
webhook-url: ${{ secrets.SLACK_WEBHOOK_URL }}
payload: |
{
"text": "✅ Circuit Breaker Labs evaluation passed",
"blocks": [
{
"type": "section",
"text": {
"type": "mrkdwn",
"text": "*Circuit Breaker Labs Evaluation Passed* ✅\n\nModel: `${{ steps.config.outputs.model }}`\nThreshold: 0.90\n\n<${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}|View Results>"
}
}
]
}
- name: Notify Slack on failure
if: steps.evaluation.outcome == 'failure'
uses: slackapi/slack-github-action@v1
with:
webhook-url: ${{ secrets.SLACK_WEBHOOK_URL }}
payload: |
{
"text": "❌ Circuit Breaker Labs evaluation failed",
"blocks": [
{
"type": "section",
"text": {
"type": "mrkdwn",
"text": "*Circuit Breaker Labs Evaluation Failed* ❌\n\nModel: `${{ steps.config.outputs.model }}`\nThreshold: 0.90\n\n⚠️ *Action Required:* Review safety evaluation results\n\n<${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}|View Details>"
}
}
]
}
- name: Fail if evaluation failed
if: steps.evaluation.outcome == 'failure'
run: exit 1
Slack notifications provide immediate visibility into evaluation results for your team. Configure
SLACK_WEBHOOK_URL in your repository secrets.