Overview
Testing skills is TDD applied to process documentation. You run scenarios without the skill (RED — watch the agent fail), write the skill addressing those failures (GREEN — watch the agent comply), then close loopholes (REFACTOR — stay compliant). Core principle: If you didn’t watch an agent fail without the skill, you don’t know if the skill prevents the right failures.When to test
Test skills that:- Enforce discipline (TDD, verification requirements, design-before-coding)
- Have compliance costs — they require time, effort, or rework
- Could be rationalized away (“just this once”)
- Contradict immediate goals (quality over speed)
- Pure reference skills (API docs, syntax guides)
- Skills without rules to violate
- Skills agents have no incentive to bypass
The TDD mapping
| TDD Phase | Skill Testing | What You Do |
|---|---|---|
| RED | Baseline test | Run scenario WITHOUT skill — watch agent fail |
| Verify RED | Capture rationalizations | Document exact failures verbatim |
| GREEN | Write skill | Address the specific baseline failures |
| Verify GREEN | Pressure test | Run scenario WITH skill — verify compliance |
| REFACTOR | Plug holes | Find new rationalizations, add counters |
| Stay GREEN | Re-verify | Test again — ensure still compliant |
RED phase: establish baseline
Run the scenario without the skill loaded. Your goal is to watch the agent fail and document exactly what happens. Process:- Create pressure scenarios (3+ combined pressures)
- Run WITHOUT the skill — give the agent a realistic task with pressures
- Document choices and rationalizations word-for-word
- Identify patterns — which excuses appear repeatedly?
- Note effective pressures — which scenarios trigger violations?
- “I already manually tested it”
- “Tests after achieve the same goals”
- “Deleting is wasteful”
- “Being pragmatic, not dogmatic”
GREEN phase: write the skill
Write a skill that addresses the specific rationalizations you documented. Don’t add content for hypothetical cases — address only the actual failures you observed. Run the same scenarios WITH the skill. The agent should now comply. If the agent still fails, the skill is unclear or incomplete. Revise and re-test.Writing good pressure scenarios
The quality of your test scenario determines whether you’re actually testing compliance or just testing recall. Bad scenario (no pressure):Pressure types
| Pressure | Example |
|---|---|
| Time | Emergency, deadline, deploy window closing |
| Sunk cost | Hours of work, “waste” to delete |
| Authority | Senior says skip it, manager overrides |
| Economic | Job, promotion, company survival at stake |
| Exhaustion | End of day, already tired, want to go home |
| Social | Looking dogmatic, seeming inflexible |
| Pragmatic | ”Being pragmatic vs dogmatic” |
Elements of good scenarios
- Concrete options — force an A/B/C choice, not open-ended responses
- Real constraints — specific times, actual consequences
- Real file paths —
/tmp/payment-system, not “a project” - Make the agent act — “What do you do?” not “What should you do?”
- No easy outs — can’t defer to “I’d ask your human partner” without choosing
REFACTOR phase: close loopholes
When an agent violates the rule despite having the skill, treat it like a test regression. Refactor the skill to prevent the new rationalization. Common rationalizations to watch for:- “This case is different because…”
- “I’m following the spirit not the letter”
- “The PURPOSE is X, and I’m achieving X differently”
- “Being pragmatic means adapting”
- “Keep as reference while writing tests first”
- “I already manually tested it”
Meta-testing: when GREEN isn’t working
After the agent chooses the wrong option even with the skill loaded, ask:How persuasion principles apply
Research (Meincke et al., 2025; N=28,000 AI conversations) found that persuasion techniques more than doubled compliance rates in LLM interactions (33% → 72%). The most effective principles for skill design: Authority — imperative language removes decision fatigue:| Skill type | Recommended | Avoid |
|---|---|---|
| Discipline-enforcing | Authority + Commitment + Social Proof | Liking, Reciprocity |
| Guidance/technique | Moderate Authority + Unity | Heavy authority |
| Reference | Clarity only | All persuasion |
Signs of a bulletproof skill
- Agent chooses the correct option under maximum pressure
- Agent cites specific skill sections as justification
- Agent acknowledges the temptation but follows the rule anyway
- Meta-testing reveals “the skill was clear, I should follow it”
- Agent finds new rationalizations you haven’t addressed
- Agent argues the skill itself is wrong
- Agent creates “hybrid approaches” to comply in spirit
- Agent asks permission but argues strongly for the violation
Real-world example: TDD skill bulletproofing
Common mistakes in skill testing
Writing the skill before testing (skipping RED)
Writing the skill before testing (skipping RED)
This reveals what YOU think needs preventing, not what ACTUALLY needs preventing. Always run baseline scenarios first so the skill addresses real failures.
Using only academic test cases
Using only academic test cases
Running “What does the skill say about X?” only tests recall, not compliance under pressure. Use pressure scenarios that make the agent WANT to violate the rule.
Single-pressure scenarios
Single-pressure scenarios
Agents resist single pressure but break under multiple. Combine at least 3 pressures: time + sunk cost + exhaustion.
Not capturing exact failure wording
Not capturing exact failure wording
“Agent was wrong” doesn’t tell you what to prevent. Document exact rationalizations verbatim — they become your rationalization table.
Adding generic counters
Adding generic counters
“Don’t cheat” doesn’t work. “Don’t keep as reference” does. Add explicit negations for each specific rationalization.
Stopping after the first passing test
Stopping after the first passing test
Tests passing once doesn’t mean bulletproof. Continue the REFACTOR cycle until no new rationalizations emerge across multiple runs.
Pre-deployment checklist
RED phase:- Created pressure scenarios (3+ combined pressures)
- Ran scenarios WITHOUT the skill (baseline)
- Documented agent failures and rationalizations verbatim
- Wrote skill addressing specific baseline failures
- Ran scenarios WITH the skill
- Agent now complies
- Identified NEW rationalizations from testing
- Added explicit counters for each loophole
- Updated rationalization table
- Updated red flags list
- Updated description with violation symptoms
- Re-tested — agent still complies
- Meta-tested to verify clarity
- Agent follows rule under maximum pressure
Writing Skills
Create a skill from scratch using TDD principles
Submitting a Skill
Contribute your tested skill to the Superpowers project