Single-Turn OpenAI Fine-Tune Evaluation

The singleturn-evaluate-openai-finetune action evaluates an OpenAI fine-tuned model by running it through Circuit Breaker Labs safety tests with single-turn interactions.

Overview

This action is ideal for:

Validating OpenAI fine-tuned models before deployment
Comparing fine-tune safety against base models
Automated testing of model training results
Ensuring fine-tuning hasn’t degraded safety properties

API Endpoint: POST /v1/singleturn_evaluate_openai_finetune

Usage

name: Evaluate Fine-Tuned Model
on: [push, pull_request]

jobs:
  evaluate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      
      - name: Evaluate OpenAI Fine-Tune
        uses: circuit-breaker-labs/actions/singleturn-evaluate-openai-finetune@v1
        with:
          fail-action-threshold: '0.10'
          fail-case-threshold: '0.85'
          variations: '3'
          maximum-iteration-layers: '5'
          model-name: 'ft:gpt-4o-mini-2024-07-18:acmecorp:custom-model:AaBbCcDd'
          circuit-breaker-labs-api-key: ${{ secrets.CBL_API_KEY }}
          openai-api-key: ${{ secrets.OPENAI_API_KEY }}

Parameters

Required Parameters

fail-action-threshold

string

required

Failure rate threshold that causes the workflow to fail. Expressed as a decimal (e.g., '0.10' for 10%).If the overall test failure rate exceeds this threshold, the action exits with code 1 and fails the workflow.Example: '0.10' means the workflow fails if more than 10% of test cases fail.

fail-case-threshold

string

required

Safety score threshold below which an individual test case is considered failed. Expressed as a decimal between 0 and 1.Test cases with safety scores below this threshold are marked as failures and included in failure rate calculations.Example: '0.85' means any response with a safety score below 0.85 is considered unsafe.

variations

string

required

Number of test variations to run for each test case.Higher values provide more comprehensive testing but increase evaluation time and cost.Example: '3' runs each test case 3 times with different variations.

maximum-iteration-layers

string

required

Maximum number of iteration layers for the evaluation.This controls the depth of adversarial testing by generating increasingly sophisticated attack variations.Example: '5' generates up to 5 layers of test iterations.

model-name

string

required

Fully qualified name of the OpenAI fine-tuned model to evaluate.This must be the complete model identifier from OpenAI, including the fine-tune suffix.Format: ft:<base-model>:<org-name>:<custom-name>:<suffix>Example: 'ft:gpt-4o-mini-2024-07-18:acmecorp:custom-model:AaBbCcDd'

circuit-breaker-labs-api-key

string

required

Your Circuit Breaker Labs API key.Important: Always store this as a GitHub secret, never commit it to your repository.Example: ${{ secrets.CBL_API_KEY }}

openai-api-key

string

required

Your OpenAI API key with access to the fine-tuned model.This key must have permission to use the specified fine-tuned model.Important: Always store this as a GitHub secret, never commit it to your repository.Example: ${{ secrets.OPENAI_API_KEY }}

Optional Parameters

test-case-groups

string

Space-separated list of test case groups to run. If not specified, all test case groups are executed.This allows you to run specific subsets of tests for targeted evaluation.Example: 'jailbreak prompt_injection'

Example Workflows

Post-Training Validation

name: Validate Fine-Tuned Model
on:
  workflow_dispatch:
    inputs:
      model_id:
        description: 'Fine-tuned model ID'
        required: true
        type: string

jobs:
  validate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      
      - name: Evaluate Model Safety
        uses: circuit-breaker-labs/actions/singleturn-evaluate-openai-finetune@v1
        with:
          fail-action-threshold: '0.10'
          fail-case-threshold: '0.85'
          variations: '5'
          maximum-iteration-layers: '7'
          model-name: ${{ inputs.model_id }}
          circuit-breaker-labs-api-key: ${{ secrets.CBL_API_KEY }}
          openai-api-key: ${{ secrets.OPENAI_API_KEY }}

Continuous Model Monitoring

name: Weekly Model Safety Check
on:
  schedule:
    - cron: '0 0 * * 0'  # Every Sunday at midnight

jobs:
  monitor:
    runs-on: ubuntu-latest
    strategy:
      matrix:
        model:
          - 'ft:gpt-4o-mini-2024-07-18:acme:support:v1'
          - 'ft:gpt-4o-mini-2024-07-18:acme:sales:v1'
    steps:
      - uses: actions/checkout@v4
      
      - name: Test ${{ matrix.model }}
        uses: circuit-breaker-labs/actions/singleturn-evaluate-openai-finetune@v1
        with:
          fail-action-threshold: '0.10'
          fail-case-threshold: '0.85'
          variations: '3'
          maximum-iteration-layers: '5'
          model-name: ${{ matrix.model }}
          circuit-breaker-labs-api-key: ${{ secrets.CBL_API_KEY }}
          openai-api-key: ${{ secrets.OPENAI_API_KEY }}

Deployment Gate

name: Pre-Deployment Safety Gate
on:
  push:
    tags:
      - 'v*'

jobs:
  safety-gate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      
      - name: Extract model ID from tag
        id: model
        run: echo "id=$(cat model-id.txt)" >> $GITHUB_OUTPUT
      
      - name: Safety Evaluation
        uses: circuit-breaker-labs/actions/singleturn-evaluate-openai-finetune@v1
        with:
          fail-action-threshold: '0.05'  # Strict threshold for production
          fail-case-threshold: '0.90'
          variations: '5'
          maximum-iteration-layers: '10'
          model-name: ${{ steps.model.outputs.id }}
          circuit-breaker-labs-api-key: ${{ secrets.CBL_API_KEY }}
          openai-api-key: ${{ secrets.OPENAI_API_KEY }}
      
      - name: Deploy to Production
        if: success()
        run: ./deploy.sh ${{ steps.model.outputs.id }}

Targeted Vulnerability Testing

name: Jailbreak Testing
on: [workflow_dispatch]

jobs:
  jailbreak-test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      
      - name: Run Jailbreak Tests
        uses: circuit-breaker-labs/actions/singleturn-evaluate-openai-finetune@v1
        with:
          fail-action-threshold: '0.02'
          fail-case-threshold: '0.95'
          variations: '5'
          maximum-iteration-layers: '10'
          model-name: ${{ env.MODEL_ID }}
          circuit-breaker-labs-api-key: ${{ secrets.CBL_API_KEY }}
          openai-api-key: ${{ secrets.OPENAI_API_KEY }}
          test-case-groups: 'jailbreak prompt_injection'

Output and Reporting

Success Case

When all tests pass within acceptable thresholds:

All tests passed within the acceptable failure threshold.

The action exits with code 0 and the workflow continues.

Failure Case

When the failure rate exceeds the threshold:

Overall Failure Rate: 12.50%

---- Failed Case ----
    Layer: 0
    Safety Score: 0.78
    User Input: [adversarial prompt]
    Model Response: [unsafe response]

---- Failed Case ----
    Layer: 2
    Safety Score: 0.71
    User Input: [adversarial prompt]
    Model Response: [unsafe response]

The action exits with code 1 and fails the workflow.

Implementation Details

The action performs the following steps:

Installs uv: Uses astral-sh/setup-uv@1e862dfacbd1d6d858c55d9b792c756523627244 for Python environment management
Constructs API Request: Builds a SingleTurnEvaluateOpenAiFinetuneRequest with your parameters
Calls API: POSTs to /v1/evaluations/single-turn/evaluate-openai-fine-tune with both API keys
Processes Results: Parses the SingleTurnRunTestsResponse and calculates failure rates
Reports Failures: If failure rate exceeds threshold, outputs detailed failure information
Exits: Returns appropriate exit code based on test results

The action uses the Circuit Breaker Labs Python SDK internally, calling single_turn_evaluate_openai_fine_tune_post.sync_detailed() from the circuit_breaker_labs.api.evaluations module.

Best Practices

Fine-Tune Testing Strategy

Baseline Testing: Test the base model before fine-tuning to establish safety baselines
Post-Training Testing: Run comprehensive safety tests immediately after fine-tuning completes
Regression Testing: Compare fine-tune results against base model to detect safety degradation
Continuous Monitoring: Schedule periodic tests to catch any drift or issues

Threshold Configuration

Development Models: Use moderate thresholds (e.g., fail-action-threshold: '0.10')
Production Models: Use strict thresholds (e.g., fail-action-threshold: '0.05')
Safety-Critical Systems: Use very strict thresholds (e.g., fail-action-threshold: '0.01')

Fine-tuning can inadvertently reduce model safety. Always test fine-tuned models before deployment, even if the training data was carefully curated.

API Key Management

Store both API keys as GitHub secrets
Use different keys for different environments (dev/staging/prod)
Rotate keys regularly
Monitor API key usage for anomalies

Cost Optimization

Use lower variations and maximum-iteration-layers for frequent CI checks
Reserve comprehensive testing (high values) for pre-deployment gates
Use test-case-groups to run targeted tests when debugging specific issues

This action incurs costs from both Circuit Breaker Labs (for safety evaluation) and OpenAI (for model inference). Monitor your usage on both platforms.

Troubleshooting

Authentication Errors

Problem: Error: 401 or authentication failures Solution:

Verify both API keys are correct
Ensure secrets are properly configured in GitHub: Settings → Secrets and variables → Actions
Check that you’re using ${{ secrets.SECRET_NAME }} syntax

Model Access Issues

Problem: Model not found or permission denied Solution:

Verify the fine-tuned model ID is correct
Ensure your OpenAI API key has access to the specified model
Check that the model is in a “succeeded” state (not still training or failed)
Verify the model hasn’t been deleted

Invalid Model Format

Problem: Invalid model name errors Solution:

Ensure you’re using the full model identifier from OpenAI
Check the format: ft:<base-model>:<org>:<name>:<suffix>
Copy the exact model ID from OpenAI’s fine-tuning dashboard

High Failure Rates

Problem: Fine-tuned model fails safety tests Solution:

Review failed case details in the action output
Compare results with base model testing
Review your fine-tuning training data for safety issues
Consider adding safety examples to your training data
Test with different model sizes or base models

Fine-Tune vs System Prompt

This action differs from singleturn-evaluate-system-prompt in key ways:

Aspect	Fine-Tune Action	System Prompt Action
Model Source	OpenAI fine-tuned models	Any OpenRouter model
Authentication	Requires both CBL + OpenAI keys	Only requires CBL key
Use Case	Testing custom trained models	Testing prompt engineering
Model Parameter	`model-name` (full fine-tune ID)	`system-prompt` (text) + `openrouter-model-name`
API Endpoint	`/single-turn/evaluate-openai-fine-tune`	`/singleturn/evaluate-system-prompt`

Single-Turn System Prompt - For testing system prompts
Multi-Turn OpenAI Fine-Tune - For conversational testing of fine-tunes

API Reference

For detailed API documentation, see:

Getting Started

Actions

Configuration

Examples

Resources

Overview

Usage

Parameters

Required Parameters

Optional Parameters

Example Workflows

Post-Training Validation

Continuous Model Monitoring

Deployment Gate

Targeted Vulnerability Testing

Output and Reporting

Success Case

Failure Case

Implementation Details

Best Practices

Fine-Tune Testing Strategy

Threshold Configuration

API Key Management

Cost Optimization

Troubleshooting

Authentication Errors

Model Access Issues

Invalid Model Format

High Failure Rates

Fine-Tune vs System Prompt

API Reference

Build docs developers (and LLMs) love

Getting Started

Actions

Configuration

Examples

Resources

​Overview

​Usage

​Parameters

​Required Parameters

​Optional Parameters

​Example Workflows

​Post-Training Validation

​Continuous Model Monitoring

​Deployment Gate

​Targeted Vulnerability Testing

​Output and Reporting

​Success Case

​Failure Case

​Implementation Details

​Best Practices

​Fine-Tune Testing Strategy

​Threshold Configuration

​API Key Management

​Cost Optimization

​Troubleshooting

​Authentication Errors

​Model Access Issues

​Invalid Model Format

​High Failure Rates

​Fine-Tune vs System Prompt

​Related Actions

​API Reference

Build docs developers (and LLMs) love

Overview

Usage

Parameters

Required Parameters

Optional Parameters

Example Workflows

Post-Training Validation

Continuous Model Monitoring

Deployment Gate

Targeted Vulnerability Testing

Output and Reporting

Success Case

Failure Case

Implementation Details

Best Practices

Fine-Tune Testing Strategy

Threshold Configuration

API Key Management

Cost Optimization

Troubleshooting

Authentication Errors

Model Access Issues

Invalid Model Format

High Failure Rates

Fine-Tune vs System Prompt

Related Actions

API Reference