Overview
Evaluate OpenAI fine-tuned models to ensure they meet safety and quality standards before deploying to production. These examples work with any OpenAI fine-tuned model, including GPT-4 and GPT-3.5 fine-tunes.Basic Single-Turn Fine-Tune Evaluation
Evaluate a fine-tuned model with a simple workflow:name: Evaluate Fine-Tune
on:
workflow_dispatch:
jobs:
evaluate:
runs-on: ubuntu-latest
steps:
- name: Checkout repository
uses: actions/checkout@v6
- name: Run fine-tune evaluation
uses: circuitbreakerlabs/actions/singleturn-evaluate-openai-finetune@v1
with:
fail-action-threshold: "0.80"
fail-case-threshold: "0.5"
variations: "1"
maximum-iteration-layers: "1"
model-name: "ft:gpt-4o-2024-08-06:org-name:model-name:abc123"
circuit-breaker-labs-api-key: ${{ secrets.CBL_API_KEY }}
openai-api-key: ${{ secrets.OPENAI_API_KEY }}
The
model-name must be the fully qualified fine-tune ID from OpenAI, which looks like ft:gpt-4o-2024-08-06:org-name:model-name:abc123.Evaluating After Fine-Tune Completion
Automatically evaluate a model after fine-tuning completes. This workflow assumes you have a fine-tuning job that outputs the model ID:name: Fine-Tune and Evaluate
on:
workflow_dispatch:
inputs:
training_file_id:
description: 'OpenAI training file ID'
required: true
jobs:
fine-tune:
runs-on: ubuntu-latest
outputs:
model-id: ${{ steps.finetune.outputs.model-id }}
steps:
- name: Checkout repository
uses: actions/checkout@v6
- name: Create fine-tuning job
id: finetune
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
run: |
# Create fine-tuning job
response=$(curl -s https://api.openai.com/v1/fine_tuning/jobs \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $OPENAI_API_KEY" \
-d '{
"training_file": "${{ github.event.inputs.training_file_id }}",
"model": "gpt-4o-2024-08-06"
}')
job_id=$(echo $response | jq -r '.id')
echo "job-id=$job_id" >> $GITHUB_OUTPUT
# Wait for completion (simplified - production should poll)
sleep 3600 # Wait 1 hour
# Get completed model ID
job_status=$(curl -s https://api.openai.com/v1/fine_tuning/jobs/$job_id \
-H "Authorization: Bearer $OPENAI_API_KEY")
model_id=$(echo $job_status | jq -r '.fine_tuned_model')
echo "model-id=$model_id" >> $GITHUB_OUTPUT
evaluate:
needs: fine-tune
runs-on: ubuntu-latest
steps:
- name: Checkout repository
uses: actions/checkout@v6
- name: Evaluate fine-tuned model
uses: circuitbreakerlabs/actions/singleturn-evaluate-openai-finetune@v1
with:
fail-action-threshold: "0.80"
fail-case-threshold: "0.5"
variations: "3"
maximum-iteration-layers: "2"
model-name: ${{ needs.fine-tune.outputs.model-id }}
circuit-breaker-labs-api-key: ${{ secrets.CBL_API_KEY }}
openai-api-key: ${{ secrets.OPENAI_API_KEY }}
In production, use proper polling logic instead of
sleep to wait for fine-tuning completion. Consider using OpenAI’s webhooks or the fine-tuning events API.Multi-Turn Fine-Tune Evaluation
For conversational fine-tuned models, use multi-turn evaluation:name: Evaluate Multi-Turn Fine-Tune
on:
workflow_dispatch:
inputs:
model_id:
description: 'Fine-tuned model ID'
required: true
jobs:
evaluate:
runs-on: ubuntu-latest
steps:
- name: Checkout repository
uses: actions/checkout@v6
- name: Run multi-turn fine-tune evaluation
uses: circuitbreakerlabs/actions/multiturn-evaluate-openai-finetune@v1
with:
fail-action-threshold: "0.80"
fail-case-threshold: "0.5"
max-turns: "6"
test-types: "jailbreak context_shift instruction_following"
model-name: ${{ github.event.inputs.model_id }}
circuit-breaker-labs-api-key: ${{ secrets.CBL_API_KEY }}
openai-api-key: ${{ secrets.OPENAI_API_KEY }}
max-turns must be an even number (2, 4, 6, 8, etc.) to ensure complete conversation exchanges.Comparing Models
Evaluate and compare multiple fine-tuned models in parallel:name: Compare Fine-Tuned Models
on:
workflow_dispatch:
jobs:
evaluate-baseline:
runs-on: ubuntu-latest
steps:
- name: Checkout repository
uses: actions/checkout@v6
- name: Evaluate baseline model
uses: circuitbreakerlabs/actions/singleturn-evaluate-openai-finetune@v1
with:
fail-action-threshold: "0.80"
fail-case-threshold: "0.5"
variations: "3"
maximum-iteration-layers: "2"
model-name: "ft:gpt-4o-2024-08-06:org:baseline:ver1"
circuit-breaker-labs-api-key: ${{ secrets.CBL_API_KEY }}
openai-api-key: ${{ secrets.OPENAI_API_KEY }}
evaluate-candidate:
runs-on: ubuntu-latest
steps:
- name: Checkout repository
uses: actions/checkout@v6
- name: Evaluate candidate model
uses: circuitbreakerlabs/actions/singleturn-evaluate-openai-finetune@v1
with:
fail-action-threshold: "0.80"
fail-case-threshold: "0.5"
variations: "3"
maximum-iteration-layers: "2"
model-name: "ft:gpt-4o-2024-08-06:org:candidate:ver2"
circuit-breaker-labs-api-key: ${{ secrets.CBL_API_KEY }}
openai-api-key: ${{ secrets.OPENAI_API_KEY }}
Run model comparisons in parallel jobs to save time. Review the Circuit Breaker Labs dashboard to compare results side-by-side.
Reading Model Configuration
Store model IDs and evaluation parameters in a configuration file:Create finetune_config.json
finetune_config.json
{
"model_id": "ft:gpt-4o-2024-08-06:org-name:model-name:abc123",
"evaluation": {
"variations": 3,
"maximum_iteration_layers": 2,
"fail_action_threshold": 0.85,
"fail_case_threshold": 0.5
}
}
Create workflow
name: Evaluate Fine-Tune from Config
on:
push:
paths:
- "finetune_config.json"
workflow_dispatch:
jobs:
evaluate:
runs-on: ubuntu-latest
steps:
- name: Checkout repository
uses: actions/checkout@v6
- name: Read configuration
id: config
run: |
MODEL_ID=$(jq -r '.model_id' finetune_config.json)
VARIATIONS=$(jq -r '.evaluation.variations' finetune_config.json)
LAYERS=$(jq -r '.evaluation.maximum_iteration_layers' finetune_config.json)
FAIL_ACTION=$(jq -r '.evaluation.fail_action_threshold' finetune_config.json)
FAIL_CASE=$(jq -r '.evaluation.fail_case_threshold' finetune_config.json)
echo "model-id=$MODEL_ID" >> $GITHUB_OUTPUT
echo "variations=$VARIATIONS" >> $GITHUB_OUTPUT
echo "layers=$LAYERS" >> $GITHUB_OUTPUT
echo "fail-action=$FAIL_ACTION" >> $GITHUB_OUTPUT
echo "fail-case=$FAIL_CASE" >> $GITHUB_OUTPUT
- name: Evaluate fine-tuned model
uses: circuitbreakerlabs/actions/singleturn-evaluate-openai-finetune@v1
with:
fail-action-threshold: ${{ steps.config.outputs.fail-action }}
fail-case-threshold: ${{ steps.config.outputs.fail-case }}
variations: ${{ steps.config.outputs.variations }}
maximum-iteration-layers: ${{ steps.config.outputs.layers }}
model-name: ${{ steps.config.outputs.model-id }}
circuit-breaker-labs-api-key: ${{ secrets.CBL_API_KEY }}
openai-api-key: ${{ secrets.OPENAI_API_KEY }}
Scheduled Model Evaluation
Run regular evaluations to catch model regressions:name: Scheduled Model Evaluation
on:
schedule:
# Run every 6 hours
- cron: '0 */6 * * *'
workflow_dispatch:
jobs:
evaluate-production-model:
runs-on: ubuntu-latest
steps:
- name: Checkout repository
uses: actions/checkout@v6
- name: Single-turn evaluation
uses: circuitbreakerlabs/actions/singleturn-evaluate-openai-finetune@v1
with:
fail-action-threshold: "0.90"
fail-case-threshold: "0.5"
variations: "5"
maximum-iteration-layers: "3"
model-name: "ft:gpt-4o-2024-08-06:org:production:current"
circuit-breaker-labs-api-key: ${{ secrets.CBL_API_KEY }}
openai-api-key: ${{ secrets.OPENAI_API_KEY }}
- name: Multi-turn evaluation
uses: circuitbreakerlabs/actions/multiturn-evaluate-openai-finetune@v1
with:
fail-action-threshold: "0.90"
fail-case-threshold: "0.5"
max-turns: "6"
test-types: "jailbreak context_shift consistency"
model-name: "ft:gpt-4o-2024-08-06:org:production:current"
circuit-breaker-labs-api-key: ${{ secrets.CBL_API_KEY }}
openai-api-key: ${{ secrets.OPENAI_API_KEY }}
Scheduled evaluations help detect degradation over time. Set stricter thresholds (
0.90 instead of 0.80) for production models.Complete Production Example
A comprehensive workflow for fine-tuned model evaluation:name: Production Fine-Tune Evaluation
on:
workflow_dispatch:
inputs:
model_id:
description: 'Fine-tuned model ID'
required: true
environment:
description: 'Environment'
required: true
type: choice
options:
- staging
- production
push:
branches:
- main
paths:
- "finetune_config.json"
jobs:
load-config:
runs-on: ubuntu-latest
outputs:
model-id: ${{ steps.get-model.outputs.model-id }}
steps:
- name: Checkout repository
uses: actions/checkout@v6
- name: Get model ID
id: get-model
run: |
if [ -n "${{ github.event.inputs.model_id }}" ]; then
echo "model-id=${{ github.event.inputs.model_id }}" >> $GITHUB_OUTPUT
else
MODEL_ID=$(jq -r '.model_id' finetune_config.json)
echo "model-id=$MODEL_ID" >> $GITHUB_OUTPUT
fi
evaluate-single-turn:
needs: load-config
runs-on: ubuntu-latest
steps:
- name: Checkout repository
uses: actions/checkout@v6
- name: Run single-turn evaluation
uses: circuitbreakerlabs/actions/singleturn-evaluate-openai-finetune@v1
with:
fail-action-threshold: "0.85"
fail-case-threshold: "0.5"
variations: "5"
maximum-iteration-layers: "3"
model-name: ${{ needs.load-config.outputs.model-id }}
test-case-groups: "security compliance safety"
circuit-breaker-labs-api-key: ${{ secrets.CBL_API_KEY }}
openai-api-key: ${{ secrets.OPENAI_API_KEY }}
evaluate-multi-turn:
needs: load-config
runs-on: ubuntu-latest
steps:
- name: Checkout repository
uses: actions/checkout@v6
- name: Run multi-turn evaluation
uses: circuitbreakerlabs/actions/multiturn-evaluate-openai-finetune@v1
with:
fail-action-threshold: "0.85"
fail-case-threshold: "0.5"
max-turns: "6"
test-types: "jailbreak context_shift instruction_following"
model-name: ${{ needs.load-config.outputs.model-id }}
test-case-groups: "security compliance safety"
circuit-breaker-labs-api-key: ${{ secrets.CBL_API_KEY }}
openai-api-key: ${{ secrets.OPENAI_API_KEY }}
This production example supports both manual input and configuration file, runs single-turn and multi-turn evaluations in parallel, and filters for critical test case groups.