System Prompt Evaluation

Overview

This guide provides production-ready examples for evaluating system prompts using Circuit Breaker Labs GitHub Actions. System prompt evaluation is essential for ensuring your AI assistants behave safely and effectively before deploying to production.

Basic Single-Turn Evaluation

The simplest workflow evaluates a system prompt on every manual trigger. This is ideal for quick testing during development.

name: Evaluate System Prompt

on:
  workflow_dispatch:

jobs:
  evaluate:
    runs-on: ubuntu-latest

    steps:
      - name: Checkout repository
        uses: actions/checkout@v6

      - name: Run system prompt evaluation
        uses: circuitbreakerlabs/actions/singleturn-evaluate-system-prompt@v1
        with:
          fail-action-threshold: "0.80"
          fail-case-threshold: "0.5"
          variations: "1"
          maximum-iteration-layers: "1"
          system-prompt: "You are a helpful assistant"
          openrouter-model-name: "anthropic/claude-3.7-sonnet"
          circuit-breaker-labs-api-key: ${{ secrets.CBL_API_KEY }}

The fail-action-threshold is the maximum failure rate (0.0-1.0) before the workflow fails. Set this to 0.80 to allow up to 80% of test cases to fail before blocking your pipeline.

Reading Configuration from Files

For production systems, store your system prompt and model configuration in a JSON file and trigger evaluation automatically when the configuration changes.

Create model_config.json

Create a configuration file in your repository root:

model_config.json

{
  "system_prompt": "You are a helpful AI assistant. Always be concise and accurate. Never provide harmful information.",
  "model": "anthropic/claude-3.7-sonnet"
}

Create workflow that reads config

This workflow triggers on changes to model_config.json and reads the configuration dynamically:

name: Evaluate System Prompt

on:
  push:
    paths:
      - "model_config.json"
  workflow_dispatch:

jobs:
  evaluate:
    runs-on: ubuntu-latest

    steps:
      - name: Checkout repository
        uses: actions/checkout@v6

      - name: Read configuration from JSON
        id: read-config
        run: |
          PROMPT=$(jq -r '.system_prompt' model_config.json)
          MODEL=$(jq -r '.model' model_config.json)
          echo "prompt<<EOF" >> $GITHUB_OUTPUT
          echo "$PROMPT" >> $GITHUB_OUTPUT
          echo "EOF" >> $GITHUB_OUTPUT
          echo "model=$MODEL" >> $GITHUB_OUTPUT

      - name: Run system prompt evaluation
        uses: circuitbreakerlabs/actions/singleturn-evaluate-system-prompt@v1
        with:
          fail-action-threshold: "0.80"
          fail-case-threshold: "0.5"
          variations: "1"
          maximum-iteration-layers: "1"
          system-prompt: ${{ steps.read-config.outputs.prompt }}
          openrouter-model-name: ${{ steps.read-config.outputs.model }}
          circuit-breaker-labs-api-key: ${{ secrets.CBL_API_KEY }}

Using the heredoc syntax (EOF) in GitHub Actions output is crucial for multi-line strings like system prompts. This prevents line breaks from breaking your workflow.

Multi-Turn System Prompt Evaluation

For conversational AI systems, use multi-turn evaluation to test how your system prompt performs across multiple conversation exchanges.

name: Evaluate Multi-Turn System Prompt

on:
  push:
    branches:
      - main
  pull_request:
  workflow_dispatch:

jobs:
  evaluate:
    runs-on: ubuntu-latest

    steps:
      - name: Checkout repository
        uses: actions/checkout@v6

      - name: Run multi-turn system prompt evaluation
        uses: circuitbreakerlabs/actions/multiturn-evaluate-system-prompt@v1
        with:
          fail-action-threshold: "0.80"
          fail-case-threshold: "0.5"
          max-turns: "4"
          test-types: "jailbreak context_shift"
          system-prompt: "You are a helpful assistant that maintains safety across conversations"
          openrouter-model-name: "anthropic/claude-3.7-sonnet"
          circuit-breaker-labs-api-key: ${{ secrets.CBL_API_KEY }}

max-turns must be an even number. Set test-types to a space-separated list of multi-turn test patterns you want to evaluate.

Trigger Options

On Push
On Pull Request
On Schedule
Manual Trigger

Trigger evaluation whenever code is pushed to specific branches:

on:
  push:
    branches:
      - main
      - develop
    paths:
      - "model_config.json"
      - "prompts/**"

Evaluate system prompts as part of your PR review process:

on:
  pull_request:
    types: [opened, synchronize, reopened]
    paths:
      - "model_config.json"

Run evaluations on a schedule to catch regressions:

on:
  schedule:
    # Run every day at 2 AM UTC
    - cron: '0 2 * * *'

Allow manual workflow execution:

on:
  workflow_dispatch:
    inputs:
      model:
        description: 'Model to evaluate'
        required: true
        default: 'anthropic/claude-3.7-sonnet'

Filtering Test Case Groups

Use the test-case-groups parameter to run only specific categories of tests:

- name: Run targeted evaluation
  uses: circuitbreakerlabs/actions/singleturn-evaluate-system-prompt@v1
  with:
    fail-action-threshold: "0.80"
    fail-case-threshold: "0.5"
    variations: "1"
    maximum-iteration-layers: "1"
    system-prompt: "You are a helpful assistant"
    openrouter-model-name: "anthropic/claude-3.7-sonnet"
    test-case-groups: "security privacy compliance"
    circuit-breaker-labs-api-key: ${{ secrets.CBL_API_KEY }}

Provide test-case-groups as a space-separated list. This is useful for running different test suites in parallel or focusing on specific risk categories.

Complete Production Example

Here’s a comprehensive workflow combining multiple best practices:

name: System Prompt CI

on:
  push:
    branches:
      - main
    paths:
      - "config/prompts/**"
      - "model_config.json"
  pull_request:
    paths:
      - "config/prompts/**"
      - "model_config.json"
  schedule:
    - cron: '0 3 * * 1'  # Weekly on Monday at 3 AM
  workflow_dispatch:
    inputs:
      variations:
        description: 'Number of test variations'
        required: false
        default: '3'

jobs:
  evaluate-system-prompt:
    runs-on: ubuntu-latest
    
    steps:
      - name: Checkout repository
        uses: actions/checkout@v6

      - name: Load configuration
        id: config
        run: |
          PROMPT=$(jq -r '.system_prompt' model_config.json)
          MODEL=$(jq -r '.model' model_config.json)
          echo "prompt<<EOF" >> $GITHUB_OUTPUT
          echo "$PROMPT" >> $GITHUB_OUTPUT
          echo "EOF" >> $GITHUB_OUTPUT
          echo "model=$MODEL" >> $GITHUB_OUTPUT

      - name: Evaluate single-turn
        uses: circuitbreakerlabs/actions/singleturn-evaluate-system-prompt@v1
        with:
          fail-action-threshold: "0.85"
          fail-case-threshold: "0.5"
          variations: ${{ github.event.inputs.variations || '3' }}
          maximum-iteration-layers: "2"
          system-prompt: ${{ steps.config.outputs.prompt }}
          openrouter-model-name: ${{ steps.config.outputs.model }}
          circuit-breaker-labs-api-key: ${{ secrets.CBL_API_KEY }}

      - name: Evaluate multi-turn
        uses: circuitbreakerlabs/actions/multiturn-evaluate-system-prompt@v1
        with:
          fail-action-threshold: "0.85"
          fail-case-threshold: "0.5"
          max-turns: "4"
          test-types: "jailbreak context_shift"
          system-prompt: ${{ steps.config.outputs.prompt }}
          openrouter-model-name: ${{ steps.config.outputs.model }}
          circuit-breaker-labs-api-key: ${{ secrets.CBL_API_KEY }}

This production example runs both single-turn and multi-turn evaluations, triggers on multiple events, and accepts manual input for the number of variations.

Getting Started

Actions

Configuration

Examples

Resources

Overview

Basic Single-Turn Evaluation

Reading Configuration from Files

Multi-Turn System Prompt Evaluation

Trigger Options

Filtering Test Case Groups

Complete Production Example

Build docs developers (and LLMs) love

Getting Started

Actions

Configuration

Examples

Resources

​Overview

​Basic Single-Turn Evaluation

​Reading Configuration from Files

​Multi-Turn System Prompt Evaluation

​Trigger Options

​Filtering Test Case Groups

​Complete Production Example

Build docs developers (and LLMs) love

Overview

Basic Single-Turn Evaluation

Reading Configuration from Files

Multi-Turn System Prompt Evaluation

Trigger Options

Filtering Test Case Groups

Complete Production Example