Skip to main content

Overview

The Skill Creator skill enables you to build, test, and optimize OpenCode skills through an iterative development process. It supports creating skills from scratch, improving existing skills, running evaluations, benchmarking performance, and optimizing skill descriptions for better triggering accuracy.

When to Use This Skill

Use the Skill Creator skill when you need to:
  • Create a new skill from scratch
  • Update or optimize an existing skill
  • Run evaluations to test skill effectiveness
  • Benchmark skill performance with variance analysis
  • Optimize a skill’s description for better triggering accuracy
  • Convert a manual workflow into a reusable skill

High-Level Workflow

The skill creation process follows an iterative cycle:
1

Decide and Plan

Determine what the skill should do and roughly how it should work
2

Write Draft

Create initial SKILL.md with frontmatter and instructions
3

Test

Create test prompts and run Claude with access to the skill
4

Evaluate

Review results qualitatively and quantitatively using the eval viewer
5

Iterate

Rewrite the skill based on feedback and repeat until satisfied
6

Optimize

Run description optimization to improve triggering accuracy
The process is flexible—you can jump in at any stage. If a user already has a draft, start with testing and iteration. If they just want to “vibe,” skip formal evaluations.

Creating a Skill

Capture Intent

Start by understanding the user’s intent. The current conversation might already contain a workflow to capture. Key questions to answer:
  1. What should this skill enable Claude to do?
  2. When should this skill trigger? (what user phrases/contexts)
  3. What’s the expected output format?
  4. Should we set up test cases?
Skills with objectively verifiable outputs benefit from test cases:
  • File transforms
  • Data extraction
  • Code generation
  • Fixed workflow steps
Skills with subjective outputs often don’t need them:
  • Writing style improvements
  • Creative content
  • Art generation
Suggest the appropriate default based on the skill type, but let the user decide.

Interview and Research

Proactively ask questions about:
  • Edge cases and error scenarios
  • Input/output formats and examples
  • Example files or data
  • Success criteria
  • Dependencies and required tools
Check available MCPs for relevant research capabilities. Come prepared with context to reduce burden on the user.

Write the SKILL.md

Based on the interview, create a skill with these components: YAML Frontmatter (Required):
---
name: skill-identifier
description: When to trigger and what it does. Include specific contexts for use.
compatibility: Required tools, dependencies (optional)
---
Important: Claude tends to “undertrigger” skills. Make descriptions slightly “pushy” by including all contexts where the skill would be useful, not just the primary use case.Example: Instead of “How to build a dashboard”, use “How to build a dashboard to display data. Use this skill whenever the user mentions dashboards, data visualization, internal metrics, or wants to display company data, even if they don’t explicitly ask for a ‘dashboard.’”

Skill Anatomy

skill-name/
├── SKILL.md (required)
│   ├── YAML frontmatter (name, description)
│   └── Markdown instructions
└── Bundled Resources (optional)
    ├── scripts/    - Executable code for deterministic tasks
    ├── references/ - Docs loaded into context as needed
    └── assets/     - Files used in output (templates, icons)

Progressive Disclosure

Skills use a three-level loading system:
  1. Metadata (name + description) - Always in context (~100 words)
  2. SKILL.md body - In context when skill triggers (<500 lines ideal)
  3. Bundled resources - Loaded as needed (unlimited size)
  • Keep SKILL.md under 500 lines when possible
  • Reference external files clearly with guidance on when to read them
  • For large reference files (>300 lines), include a table of contents
  • Scripts can execute without loading into context
Domain organization - When supporting multiple frameworks:
cloud-deploy/
├── SKILL.md (workflow + selection logic)
└── references/
    ├── aws.md
    ├── gcp.md
    └── azure.md
Claude reads only the relevant reference file.

Writing Patterns

Prefer using the imperative form in instructions. Defining output formats:
## Report structure
ALWAYS use this exact template:
# [Title]
## Executive summary
## Key findings
## Recommendations
Examples pattern:
## Commit message format
**Example 1:**
Input: Added user authentication with JWT tokens
Output: feat(auth): implement JWT-based authentication

**Example 2:**
Input: Fixed bug where login form crashed
Output: fix(auth): prevent crash on empty login form submission

Writing Style

Explain why things are important rather than using heavy-handed “MUST” statements. Use theory of mind to make skills general and not narrowly tied to specific examples. Write a draft, then review with fresh eyes and improve.

Running and Evaluating Test Cases

This is a continuous sequence—don’t stop partway through.

Test Case Format

Save test cases to evals/evals.json:
{
  "skill_name": "example-skill",
  "evals": [
    {
      "id": 1,
      "prompt": "User's task prompt",
      "expected_output": "Description of expected result",
      "files": [],
      "assertions": []
    }
  ]
}

Workspace Organization

Put results in <skill-name>-workspace/ as a sibling to the skill directory:
skill-name-workspace/
├── iteration-1/
│   ├── eval-0-descriptive-name/
│   │   ├── with_skill/
│   │   │   ├── outputs/
│   │   │   ├── grading.json
│   │   │   └── timing.json
│   │   ├── without_skill/  (or old_skill/)
│   │   │   ├── outputs/
│   │   │   ├── grading.json
│   │   │   └── timing.json
│   │   └── eval_metadata.json
│   └── benchmark.json
└── iteration-2/
    └── ...

Step-by-Step Evaluation Process

1

Spawn All Runs in Parallel

For each test case, spawn two subagents in the same turn:With-skill run:
  • Execute with skill path provided
  • Save outputs to iteration-N/eval-ID/with_skill/outputs/
Baseline run:
  • New skill: Run without any skill (without_skill/)
  • Existing skill: Run with snapshot of old version (old_skill/)
Create eval_metadata.json for each test case with descriptive names.
2

Draft Assertions While Runs Progress

Don’t wait idle—draft quantitative assertions:
{
  "eval_id": 0,
  "eval_name": "descriptive-name",
  "prompt": "The user's task prompt",
  "assertions": [
    {
      "text": "Output file exists at expected path",
      "type": "file_exists"
    },
    {
      "text": "Generated CSV contains all required columns",
      "type": "custom"
    }
  ]
}
Good assertions are objectively verifiable with descriptive names. Avoid forcing assertions onto subjective skills.
3

Capture Timing Data

When subagent tasks complete, save the notification data immediately:
{
  "total_tokens": 84852,
  "duration_ms": 23332,
  "total_duration_seconds": 23.3
}
This data only comes through task notifications and isn’t persisted elsewhere.
4

Grade, Aggregate, and Launch Viewer

Once all runs complete:
  1. Grade each run against assertions, save to grading.json:
    {
      "expectations": [
        {
          "text": "Output file exists",
          "passed": true,
          "evidence": "File found at outputs/result.csv"
        }
      ]
    }
    
  2. Aggregate benchmark data:
    python -m scripts.aggregate_benchmark \
      workspace/iteration-N \
      --skill-name skill-name
    
  3. Analyze results - Surface patterns hidden in aggregate stats
  4. Launch the viewer:
    nohup python scripts/eval-viewer/generate_review.py \
      workspace/iteration-N \
      --skill-name "my-skill" \
      --benchmark workspace/iteration-N/benchmark.json \
      > /dev/null 2>&1 &
    VIEWER_PID=$!
    
    For iteration 2+, add --previous-workspace workspace/iteration-N-1
5

Read Feedback

When the user finishes reviewing, read feedback.json:
{
  "reviews": [
    {
      "run_id": "eval-0-with_skill",
      "feedback": "the chart is missing axis labels",
      "timestamp": "..."
    },
    {
      "run_id": "eval-1-with_skill",
      "feedback": "",
      "timestamp": "..."
    }
  ],
  "status": "complete"
}
Empty feedback means the user approved. Focus improvements on cases with specific complaints.

What Users See in the Viewer

Outputs Tab:
  • Prompt that was given
  • Files produced by the skill, rendered inline
  • Previous output (iteration 2+, collapsed)
  • Formal grades (if grading ran, collapsed)
  • Feedback textbox (auto-saves)
  • Previous feedback (iteration 2+)
Benchmark Tab:
  • Pass rates, timing, and token usage for each configuration
  • Per-eval breakdowns
  • Analyst observations
Headless environments: Use --static <output_path> to write standalone HTML instead of starting a server. Feedback downloads as feedback.json when user clicks “Submit All Reviews”.

Improving the Skill

This is the heart of the iteration loop.

Key Principles

Skills will be used millions of times across many prompts. Don’t overfit to test examples. Instead of fiddly changes or oppressive constraints, try:
  • Different metaphors or working patterns
  • Explaining the underlying principles
  • Removing unnecessary restrictions
Remove things that aren’t pulling their weight. Read transcripts, not just final outputs. If the skill makes the model waste time on unproductive tasks, remove those parts and test the result.
Today’s LLMs are smart with good theory of mind. When given good context, they can go beyond rote instructions.
  • Explain why behind every instruction
  • If you’re writing ALWAYS or NEVER in all caps, that’s a yellow flag
  • Reframe and explain the reasoning instead
  • Help the model understand why something is important
Read transcripts from test runs. If all test cases resulted in the subagent writing similar helper scripts (e.g., create_docx.py, build_chart.py), bundle that script:
  • Write it once
  • Put it in scripts/
  • Tell the skill to use it
  • Save every future invocation from reinventing the wheel

The Iteration Loop

1

Apply Improvements

Edit the skill based on feedback and analysis
2

Rerun Test Cases

Create new iteration-N+1/ directory with all test cases and baseline runs
3

Launch Viewer

Use --previous-workspace to show iteration-to-iteration comparison
4

Review and Repeat

Continue until user is happy, feedback is empty, or progress plateaus

Description Optimization

The description field in SKILL.md frontmatter determines whether Claude invokes a skill. After creating or improving a skill, optimize the description for better triggering accuracy.

Generate Trigger Eval Queries

Create 20 realistic eval queries—mix of should-trigger and should-not-trigger:
[
  {
    "query": "ok so my boss just sent me this xlsx file (its in my downloads, called something like 'Q4 sales final FINAL v2.xlsx') and she wants me to add a column that shows the profit margin as a percentage. The revenue is in column C and costs are in column D i think",
    "should_trigger": true
  },
  {
    "query": "another realistic prompt with context",
    "should_trigger": false
  }
]
Good queries are concrete and specific with:
  • File paths and names
  • Personal context about the user’s job or situation
  • Column names, values, company names, URLs
  • Casual speech, lowercase, abbreviations, typos
  • Varying lengths with focus on edge cases
Bad queries are abstract:
  • “Format this data”
  • “Extract text from PDF”
  • “Create a chart”

Query Types

Should-trigger queries (8-10):
  • Different phrasings of the same intent (formal and casual)
  • Cases where user doesn’t explicitly name the skill
  • Uncommon use cases
  • Competitive cases where this skill should win
Should-not-trigger queries (8-10):
  • Near-misses sharing keywords but needing something different
  • Adjacent domains
  • Ambiguous phrasing where naive keyword match would trigger but shouldn’t
  • Cases where query touches on skill functionality but another tool is more appropriate

Run Optimization Loop

python -m scripts.run_loop \
  --eval-set path/to/trigger-eval.json \
  --skill-path path/to/skill \
  --model model-id \
  --max-iterations 5 \
  --verbose
This script:
  • Splits eval set into 60% train and 40% test
  • Evaluates current description (3 runs per query for reliability)
  • Calls Claude with extended thinking to propose improvements
  • Re-evaluates on both train and test sets
  • Iterates up to 5 times
  • Opens HTML report showing results
  • Returns JSON with best_description (selected by test score to avoid overfitting)
Skills appear in Claude’s available_skills list with name + description. Claude decides whether to consult a skill based on that description.Important: Claude only consults skills for tasks it can’t easily handle on its own. Simple one-step queries like “read this PDF” may not trigger even if the description matches perfectly.Your eval queries should be substantive enough that Claude would actually benefit from consulting a skill. Complex, multi-step, or specialized queries reliably trigger skills when descriptions match.

Apply Results

Take best_description from JSON output and update SKILL.md frontmatter. Show the user before/after and report the scores.

Environment-Specific Instructions

Full workflow supported:
  • Parallel subagent execution
  • Browser-based eval viewer
  • Quantitative benchmarking
  • Description optimization with claude -p
  • Blind comparison testing

Advanced Features

Blind Comparison

For rigorous comparison between two skill versions:
  1. Give two outputs to an independent agent without identifying which is which
  2. Let it judge quality
  3. Analyze why the winner won
This is optional and most users won’t need it. The human review loop is usually sufficient.

Packaging Skills

python -m scripts.package_skill path/to/skill-folder
Creates a .skill file ready for installation and distribution.

Key Reminders

Core Loop:
  1. Figure out what the skill is about
  2. Draft or edit the skill
  3. Run Claude-with-skill on test prompts
  4. Create benchmark.json and run eval-viewer/generate_review.py
  5. Evaluate outputs with the user
  6. Run quantitative evals
  7. Repeat until satisfied
  8. Package the final skill

Build docs developers (and LLMs) love