Skip to main content
The singleturn-evaluate-openai-finetune action evaluates an OpenAI fine-tuned model by running it through Circuit Breaker Labs safety tests with single-turn interactions.

Overview

This action is ideal for:
  • Validating OpenAI fine-tuned models before deployment
  • Comparing fine-tune safety against base models
  • Automated testing of model training results
  • Ensuring fine-tuning hasn’t degraded safety properties
API Endpoint: POST /v1/singleturn_evaluate_openai_finetune

Usage

name: Evaluate Fine-Tuned Model
on: [push, pull_request]

jobs:
  evaluate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      
      - name: Evaluate OpenAI Fine-Tune
        uses: circuit-breaker-labs/actions/singleturn-evaluate-openai-finetune@v1
        with:
          fail-action-threshold: '0.10'
          fail-case-threshold: '0.85'
          variations: '3'
          maximum-iteration-layers: '5'
          model-name: 'ft:gpt-4o-mini-2024-07-18:acmecorp:custom-model:AaBbCcDd'
          circuit-breaker-labs-api-key: ${{ secrets.CBL_API_KEY }}
          openai-api-key: ${{ secrets.OPENAI_API_KEY }}

Parameters

Required Parameters

fail-action-threshold
string
required
Failure rate threshold that causes the workflow to fail. Expressed as a decimal (e.g., '0.10' for 10%).If the overall test failure rate exceeds this threshold, the action exits with code 1 and fails the workflow.Example: '0.10' means the workflow fails if more than 10% of test cases fail.
fail-case-threshold
string
required
Safety score threshold below which an individual test case is considered failed. Expressed as a decimal between 0 and 1.Test cases with safety scores below this threshold are marked as failures and included in failure rate calculations.Example: '0.85' means any response with a safety score below 0.85 is considered unsafe.
variations
string
required
Number of test variations to run for each test case.Higher values provide more comprehensive testing but increase evaluation time and cost.Example: '3' runs each test case 3 times with different variations.
maximum-iteration-layers
string
required
Maximum number of iteration layers for the evaluation.This controls the depth of adversarial testing by generating increasingly sophisticated attack variations.Example: '5' generates up to 5 layers of test iterations.
model-name
string
required
Fully qualified name of the OpenAI fine-tuned model to evaluate.This must be the complete model identifier from OpenAI, including the fine-tune suffix.Format: ft:<base-model>:<org-name>:<custom-name>:<suffix>Example: 'ft:gpt-4o-mini-2024-07-18:acmecorp:custom-model:AaBbCcDd'
circuit-breaker-labs-api-key
string
required
Your Circuit Breaker Labs API key.Important: Always store this as a GitHub secret, never commit it to your repository.Example: ${{ secrets.CBL_API_KEY }}
openai-api-key
string
required
Your OpenAI API key with access to the fine-tuned model.This key must have permission to use the specified fine-tuned model.Important: Always store this as a GitHub secret, never commit it to your repository.Example: ${{ secrets.OPENAI_API_KEY }}

Optional Parameters

test-case-groups
string
Space-separated list of test case groups to run. If not specified, all test case groups are executed.This allows you to run specific subsets of tests for targeted evaluation.Example: 'jailbreak prompt_injection'

Example Workflows

Post-Training Validation

name: Validate Fine-Tuned Model
on:
  workflow_dispatch:
    inputs:
      model_id:
        description: 'Fine-tuned model ID'
        required: true
        type: string

jobs:
  validate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      
      - name: Evaluate Model Safety
        uses: circuit-breaker-labs/actions/singleturn-evaluate-openai-finetune@v1
        with:
          fail-action-threshold: '0.10'
          fail-case-threshold: '0.85'
          variations: '5'
          maximum-iteration-layers: '7'
          model-name: ${{ inputs.model_id }}
          circuit-breaker-labs-api-key: ${{ secrets.CBL_API_KEY }}
          openai-api-key: ${{ secrets.OPENAI_API_KEY }}

Continuous Model Monitoring

name: Weekly Model Safety Check
on:
  schedule:
    - cron: '0 0 * * 0'  # Every Sunday at midnight

jobs:
  monitor:
    runs-on: ubuntu-latest
    strategy:
      matrix:
        model:
          - 'ft:gpt-4o-mini-2024-07-18:acme:support:v1'
          - 'ft:gpt-4o-mini-2024-07-18:acme:sales:v1'
    steps:
      - uses: actions/checkout@v4
      
      - name: Test ${{ matrix.model }}
        uses: circuit-breaker-labs/actions/singleturn-evaluate-openai-finetune@v1
        with:
          fail-action-threshold: '0.10'
          fail-case-threshold: '0.85'
          variations: '3'
          maximum-iteration-layers: '5'
          model-name: ${{ matrix.model }}
          circuit-breaker-labs-api-key: ${{ secrets.CBL_API_KEY }}
          openai-api-key: ${{ secrets.OPENAI_API_KEY }}

Deployment Gate

name: Pre-Deployment Safety Gate
on:
  push:
    tags:
      - 'v*'

jobs:
  safety-gate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      
      - name: Extract model ID from tag
        id: model
        run: echo "id=$(cat model-id.txt)" >> $GITHUB_OUTPUT
      
      - name: Safety Evaluation
        uses: circuit-breaker-labs/actions/singleturn-evaluate-openai-finetune@v1
        with:
          fail-action-threshold: '0.05'  # Strict threshold for production
          fail-case-threshold: '0.90'
          variations: '5'
          maximum-iteration-layers: '10'
          model-name: ${{ steps.model.outputs.id }}
          circuit-breaker-labs-api-key: ${{ secrets.CBL_API_KEY }}
          openai-api-key: ${{ secrets.OPENAI_API_KEY }}
      
      - name: Deploy to Production
        if: success()
        run: ./deploy.sh ${{ steps.model.outputs.id }}

Targeted Vulnerability Testing

name: Jailbreak Testing
on: [workflow_dispatch]

jobs:
  jailbreak-test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      
      - name: Run Jailbreak Tests
        uses: circuit-breaker-labs/actions/singleturn-evaluate-openai-finetune@v1
        with:
          fail-action-threshold: '0.02'
          fail-case-threshold: '0.95'
          variations: '5'
          maximum-iteration-layers: '10'
          model-name: ${{ env.MODEL_ID }}
          circuit-breaker-labs-api-key: ${{ secrets.CBL_API_KEY }}
          openai-api-key: ${{ secrets.OPENAI_API_KEY }}
          test-case-groups: 'jailbreak prompt_injection'

Output and Reporting

Success Case

When all tests pass within acceptable thresholds:
All tests passed within the acceptable failure threshold.
The action exits with code 0 and the workflow continues.

Failure Case

When the failure rate exceeds the threshold:
Overall Failure Rate: 12.50%

---- Failed Case ----
    Layer: 0
    Safety Score: 0.78
    User Input: [adversarial prompt]
    Model Response: [unsafe response]

---- Failed Case ----
    Layer: 2
    Safety Score: 0.71
    User Input: [adversarial prompt]
    Model Response: [unsafe response]
The action exits with code 1 and fails the workflow.

Implementation Details

The action performs the following steps:
  1. Installs uv: Uses astral-sh/setup-uv@1e862dfacbd1d6d858c55d9b792c756523627244 for Python environment management
  2. Constructs API Request: Builds a SingleTurnEvaluateOpenAiFinetuneRequest with your parameters
  3. Calls API: POSTs to /v1/evaluations/single-turn/evaluate-openai-fine-tune with both API keys
  4. Processes Results: Parses the SingleTurnRunTestsResponse and calculates failure rates
  5. Reports Failures: If failure rate exceeds threshold, outputs detailed failure information
  6. Exits: Returns appropriate exit code based on test results
The action uses the Circuit Breaker Labs Python SDK internally, calling single_turn_evaluate_openai_fine_tune_post.sync_detailed() from the circuit_breaker_labs.api.evaluations module.

Best Practices

Fine-Tune Testing Strategy

  1. Baseline Testing: Test the base model before fine-tuning to establish safety baselines
  2. Post-Training Testing: Run comprehensive safety tests immediately after fine-tuning completes
  3. Regression Testing: Compare fine-tune results against base model to detect safety degradation
  4. Continuous Monitoring: Schedule periodic tests to catch any drift or issues

Threshold Configuration

  • Development Models: Use moderate thresholds (e.g., fail-action-threshold: '0.10')
  • Production Models: Use strict thresholds (e.g., fail-action-threshold: '0.05')
  • Safety-Critical Systems: Use very strict thresholds (e.g., fail-action-threshold: '0.01')
Fine-tuning can inadvertently reduce model safety. Always test fine-tuned models before deployment, even if the training data was carefully curated.

API Key Management

  • Store both API keys as GitHub secrets
  • Use different keys for different environments (dev/staging/prod)
  • Rotate keys regularly
  • Monitor API key usage for anomalies

Cost Optimization

  • Use lower variations and maximum-iteration-layers for frequent CI checks
  • Reserve comprehensive testing (high values) for pre-deployment gates
  • Use test-case-groups to run targeted tests when debugging specific issues
This action incurs costs from both Circuit Breaker Labs (for safety evaluation) and OpenAI (for model inference). Monitor your usage on both platforms.

Troubleshooting

Authentication Errors

Problem: Error: 401 or authentication failures Solution:
  • Verify both API keys are correct
  • Ensure secrets are properly configured in GitHub: Settings → Secrets and variables → Actions
  • Check that you’re using ${{ secrets.SECRET_NAME }} syntax

Model Access Issues

Problem: Model not found or permission denied Solution:
  • Verify the fine-tuned model ID is correct
  • Ensure your OpenAI API key has access to the specified model
  • Check that the model is in a “succeeded” state (not still training or failed)
  • Verify the model hasn’t been deleted

Invalid Model Format

Problem: Invalid model name errors Solution:
  • Ensure you’re using the full model identifier from OpenAI
  • Check the format: ft:<base-model>:<org>:<name>:<suffix>
  • Copy the exact model ID from OpenAI’s fine-tuning dashboard

High Failure Rates

Problem: Fine-tuned model fails safety tests Solution:
  • Review failed case details in the action output
  • Compare results with base model testing
  • Review your fine-tuning training data for safety issues
  • Consider adding safety examples to your training data
  • Test with different model sizes or base models

Fine-Tune vs System Prompt

This action differs from singleturn-evaluate-system-prompt in key ways:
AspectFine-Tune ActionSystem Prompt Action
Model SourceOpenAI fine-tuned modelsAny OpenRouter model
AuthenticationRequires both CBL + OpenAI keysOnly requires CBL key
Use CaseTesting custom trained modelsTesting prompt engineering
Model Parametermodel-name (full fine-tune ID)system-prompt (text) + openrouter-model-name
API Endpoint/single-turn/evaluate-openai-fine-tune/singleturn/evaluate-system-prompt

API Reference

For detailed API documentation, see:

Build docs developers (and LLMs) love