Skill Creator

Overview

The Skill Creator skill enables you to build, test, and optimize OpenCode skills through an iterative development process. It supports creating skills from scratch, improving existing skills, running evaluations, benchmarking performance, and optimizing skill descriptions for better triggering accuracy.

When to Use This Skill

Use the Skill Creator skill when you need to:

Create a new skill from scratch
Update or optimize an existing skill
Run evaluations to test skill effectiveness
Benchmark skill performance with variance analysis
Optimize a skill’s description for better triggering accuracy
Convert a manual workflow into a reusable skill

High-Level Workflow

The skill creation process follows an iterative cycle:

Decide and Plan

Determine what the skill should do and roughly how it should work

Write Draft

Create initial SKILL.md with frontmatter and instructions

Test

Create test prompts and run Claude with access to the skill

Evaluate

Review results qualitatively and quantitatively using the eval viewer

Iterate

Rewrite the skill based on feedback and repeat until satisfied

Optimize

Run description optimization to improve triggering accuracy

The process is flexible—you can jump in at any stage. If a user already has a draft, start with testing and iteration. If they just want to “vibe,” skip formal evaluations.

Creating a Skill

Capture Intent

Start by understanding the user’s intent. The current conversation might already contain a workflow to capture. Key questions to answer:

What should this skill enable Claude to do?
When should this skill trigger? (what user phrases/contexts)
What’s the expected output format?
Should we set up test cases?

When to Use Test Cases

Skills with objectively verifiable outputs benefit from test cases:

File transforms
Data extraction
Code generation
Fixed workflow steps

Skills with subjective outputs often don’t need them:

Writing style improvements
Creative content
Art generation

Suggest the appropriate default based on the skill type, but let the user decide.

Interview and Research

Proactively ask questions about:

Edge cases and error scenarios
Input/output formats and examples
Example files or data
Success criteria
Dependencies and required tools

Check available MCPs for relevant research capabilities. Come prepared with context to reduce burden on the user.

Write the SKILL.md

Based on the interview, create a skill with these components: YAML Frontmatter (Required):

---
name: skill-identifier
description: When to trigger and what it does. Include specific contexts for use.
compatibility: Required tools, dependencies (optional)
---

Important: Claude tends to “undertrigger” skills. Make descriptions slightly “pushy” by including all contexts where the skill would be useful, not just the primary use case.Example: Instead of “How to build a dashboard”, use “How to build a dashboard to display data. Use this skill whenever the user mentions dashboards, data visualization, internal metrics, or wants to display company data, even if they don’t explicitly ask for a ‘dashboard.’”

Skill Anatomy

skill-name/
├── SKILL.md (required)
│   ├── YAML frontmatter (name, description)
│   └── Markdown instructions
└── Bundled Resources (optional)
    ├── scripts/    - Executable code for deterministic tasks
    ├── references/ - Docs loaded into context as needed
    └── assets/     - Files used in output (templates, icons)

Progressive Disclosure

Skills use a three-level loading system:

Metadata (name + description) - Always in context (~100 words)
SKILL.md body - In context when skill triggers (<500 lines ideal)
Bundled resources - Loaded as needed (unlimited size)

Key Patterns for Organization

Keep SKILL.md under 500 lines when possible
Reference external files clearly with guidance on when to read them
For large reference files (>300 lines), include a table of contents
Scripts can execute without loading into context

Domain organization - When supporting multiple frameworks:

cloud-deploy/
├── SKILL.md (workflow + selection logic)
└── references/
    ├── aws.md
    ├── gcp.md
    └── azure.md

Claude reads only the relevant reference file.

Writing Patterns

Prefer using the imperative form in instructions. Defining output formats:

## Report structure
ALWAYS use this exact template:
# [Title]
## Executive summary
## Key findings
## Recommendations

Examples pattern:

## Commit message format
**Example 1:**
Input: Added user authentication with JWT tokens
Output: feat(auth): implement JWT-based authentication

**Example 2:**
Input: Fixed bug where login form crashed
Output: fix(auth): prevent crash on empty login form submission

Writing Style

Explain why things are important rather than using heavy-handed “MUST” statements. Use theory of mind to make skills general and not narrowly tied to specific examples. Write a draft, then review with fresh eyes and improve.

Running and Evaluating Test Cases

This is a continuous sequence—don’t stop partway through.

Test Case Format

Save test cases to evals/evals.json:

{
  "skill_name": "example-skill",
  "evals": [
    {
      "id": 1,
      "prompt": "User's task prompt",
      "expected_output": "Description of expected result",
      "files": [],
      "assertions": []
    }
  ]
}

Workspace Organization

Put results in <skill-name>-workspace/ as a sibling to the skill directory:

skill-name-workspace/
├── iteration-1/
│   ├── eval-0-descriptive-name/
│   │   ├── with_skill/
│   │   │   ├── outputs/
│   │   │   ├── grading.json
│   │   │   └── timing.json
│   │   ├── without_skill/  (or old_skill/)
│   │   │   ├── outputs/
│   │   │   ├── grading.json
│   │   │   └── timing.json
│   │   └── eval_metadata.json
│   └── benchmark.json
└── iteration-2/
    └── ...

Step-by-Step Evaluation Process

Spawn All Runs in Parallel

For each test case, spawn two subagents in the same turn:With-skill run:

Execute with skill path provided
Save outputs to iteration-N/eval-ID/with_skill/outputs/

Baseline run:

New skill: Run without any skill (without_skill/)
Existing skill: Run with snapshot of old version (old_skill/)

Create eval_metadata.json for each test case with descriptive names.

Draft Assertions While Runs Progress

Don’t wait idle—draft quantitative assertions:

{
  "eval_id": 0,
  "eval_name": "descriptive-name",
  "prompt": "The user's task prompt",
  "assertions": [
    {
      "text": "Output file exists at expected path",
      "type": "file_exists"
    },
    {
      "text": "Generated CSV contains all required columns",
      "type": "custom"
    }
  ]
}

Good assertions are objectively verifiable with descriptive names. Avoid forcing assertions onto subjective skills.

Capture Timing Data

When subagent tasks complete, save the notification data immediately:

{
  "total_tokens": 84852,
  "duration_ms": 23332,
  "total_duration_seconds": 23.3
}

This data only comes through task notifications and isn’t persisted elsewhere.

Grade, Aggregate, and Launch Viewer

Once all runs complete:

Grade each run against assertions, save to grading.json:

{
  "expectations": [
    {
      "text": "Output file exists",
      "passed": true,
      "evidence": "File found at outputs/result.csv"
    }
  ]
}

Aggregate benchmark data:

python -m scripts.aggregate_benchmark \
  workspace/iteration-N \
  --skill-name skill-name

Analyze results - Surface patterns hidden in aggregate stats

Launch the viewer:

nohup python scripts/eval-viewer/generate_review.py \
  workspace/iteration-N \
  --skill-name "my-skill" \
  --benchmark workspace/iteration-N/benchmark.json \
  > /dev/null 2>&1 &
VIEWER_PID=$!

For iteration 2+, add --previous-workspace workspace/iteration-N-1

Read Feedback

When the user finishes reviewing, read feedback.json:

{
  "reviews": [
    {
      "run_id": "eval-0-with_skill",
      "feedback": "the chart is missing axis labels",
      "timestamp": "..."
    },
    {
      "run_id": "eval-1-with_skill",
      "feedback": "",
      "timestamp": "..."
    }
  ],
  "status": "complete"
}

Empty feedback means the user approved. Focus improvements on cases with specific complaints.

What Users See in the Viewer

Outputs Tab:

Prompt that was given
Files produced by the skill, rendered inline
Previous output (iteration 2+, collapsed)
Formal grades (if grading ran, collapsed)
Feedback textbox (auto-saves)
Previous feedback (iteration 2+)

Benchmark Tab:

Pass rates, timing, and token usage for each configuration
Per-eval breakdowns
Analyst observations

Headless environments: Use --static <output_path> to write standalone HTML instead of starting a server. Feedback downloads as feedback.json when user clicks “Submit All Reviews”.

Improving the Skill

This is the heart of the iteration loop.

Key Principles

1. Generalize from Feedback

Skills will be used millions of times across many prompts. Don’t overfit to test examples. Instead of fiddly changes or oppressive constraints, try:

Different metaphors or working patterns
Explaining the underlying principles
Removing unnecessary restrictions

2. Keep the Prompt Lean

Remove things that aren’t pulling their weight. Read transcripts, not just final outputs. If the skill makes the model waste time on unproductive tasks, remove those parts and test the result.

3. Explain the Why

Today’s LLMs are smart with good theory of mind. When given good context, they can go beyond rote instructions.

Explain why behind every instruction
If you’re writing ALWAYS or NEVER in all caps, that’s a yellow flag
Reframe and explain the reasoning instead
Help the model understand why something is important

4. Look for Repeated Work

Read transcripts from test runs. If all test cases resulted in the subagent writing similar helper scripts (e.g., create_docx.py, build_chart.py), bundle that script:

Write it once
Put it in scripts/
Tell the skill to use it
Save every future invocation from reinventing the wheel

The Iteration Loop

Apply Improvements

Edit the skill based on feedback and analysis

Rerun Test Cases

Create new iteration-N+1/ directory with all test cases and baseline runs

Launch Viewer

Use --previous-workspace to show iteration-to-iteration comparison

Review and Repeat

Continue until user is happy, feedback is empty, or progress plateaus

Description Optimization

The description field in SKILL.md frontmatter determines whether Claude invokes a skill. After creating or improving a skill, optimize the description for better triggering accuracy.

Generate Trigger Eval Queries

Create 20 realistic eval queries—mix of should-trigger and should-not-trigger:

[
  {
    "query": "ok so my boss just sent me this xlsx file (its in my downloads, called something like 'Q4 sales final FINAL v2.xlsx') and she wants me to add a column that shows the profit margin as a percentage. The revenue is in column C and costs are in column D i think",
    "should_trigger": true
  },
  {
    "query": "another realistic prompt with context",
    "should_trigger": false
  }
]

Good queries are concrete and specific with:

File paths and names
Personal context about the user’s job or situation
Column names, values, company names, URLs
Casual speech, lowercase, abbreviations, typos
Varying lengths with focus on edge cases

Bad queries are abstract:

“Format this data”
“Extract text from PDF”
“Create a chart”

Query Types

Should-trigger queries (8-10):

Different phrasings of the same intent (formal and casual)
Cases where user doesn’t explicitly name the skill
Uncommon use cases
Competitive cases where this skill should win

Should-not-trigger queries (8-10):

Near-misses sharing keywords but needing something different
Adjacent domains
Ambiguous phrasing where naive keyword match would trigger but shouldn’t
Cases where query touches on skill functionality but another tool is more appropriate

Run Optimization Loop

python -m scripts.run_loop \
  --eval-set path/to/trigger-eval.json \
  --skill-path path/to/skill \
  --model model-id \
  --max-iterations 5 \
  --verbose

This script:

Splits eval set into 60% train and 40% test
Evaluates current description (3 runs per query for reliability)
Calls Claude with extended thinking to propose improvements
Re-evaluates on both train and test sets
Iterates up to 5 times
Opens HTML report showing results
Returns JSON with best_description (selected by test score to avoid overfitting)

How Skill Triggering Works

Skills appear in Claude’s available_skills list with name + description. Claude decides whether to consult a skill based on that description.Important: Claude only consults skills for tasks it can’t easily handle on its own. Simple one-step queries like “read this PDF” may not trigger even if the description matches perfectly.Your eval queries should be substantive enough that Claude would actually benefit from consulting a skill. Complex, multi-step, or specialized queries reliably trigger skills when descriptions match.

Apply Results

Take best_description from JSON output and update SKILL.md frontmatter. Show the user before/after and report the scores.

Environment-Specific Instructions

Claude Code
Claude.ai
Cowork

Full workflow supported:

Parallel subagent execution
Browser-based eval viewer
Quantitative benchmarking
Description optimization with claude -p
Blind comparison testing

Adapted workflow:

No subagents—run test cases sequentially yourself
No baseline runs
Present results inline instead of browser viewer
Skip quantitative benchmarking
Skip description optimization (requires claude CLI)
Skip blind comparison
Can still package skills with package_skill.py

Mostly full workflow:

Subagents available (run in series if timeout issues)
Use --static <output_path> for eval viewer
Feedback downloads as feedback.json file
Description optimization supported
Important: Generate eval viewer BEFORE evaluating inputs yourself

Advanced Features

For rigorous comparison between two skill versions:

Give two outputs to an independent agent without identifying which is which
Let it judge quality
Analyze why the winner won

This is optional and most users won’t need it. The human review loop is usually sufficient.

Packaging Skills

python -m scripts.package_skill path/to/skill-folder

Creates a .skill file ready for installation and distribution.

Key Reminders

Core Loop:

Figure out what the skill is about
Draft or edit the skill
Run Claude-with-skill on test prompts
Create benchmark.json and run eval-viewer/generate_review.py
Evaluate outputs with the user
Run quantitative evals
Repeat until satisfied
Package the final skill

Document Skills

Creative & Design

Development & Technical

Enterprise & Communication

Overview

When to Use This Skill

High-Level Workflow

Creating a Skill

Capture Intent

Interview and Research

Write the SKILL.md

Skill Anatomy

Progressive Disclosure

Writing Patterns

Writing Style

Running and Evaluating Test Cases

Test Case Format

Workspace Organization

Step-by-Step Evaluation Process

What Users See in the Viewer

Improving the Skill

Key Principles

The Iteration Loop

Description Optimization

Generate Trigger Eval Queries

Query Types

Run Optimization Loop

Apply Results

Environment-Specific Instructions

Advanced Features

Blind Comparison

Packaging Skills

Key Reminders

Build docs developers (and LLMs) love

Document Skills

Creative & Design

Development & Technical

Enterprise & Communication

​Overview

​When to Use This Skill

​High-Level Workflow

​Creating a Skill

​Capture Intent

​Interview and Research

​Write the SKILL.md

​Skill Anatomy

​Progressive Disclosure

​Writing Patterns

​Writing Style

​Running and Evaluating Test Cases

​Test Case Format

​Workspace Organization

​Step-by-Step Evaluation Process

​What Users See in the Viewer

​Improving the Skill

​Key Principles

​The Iteration Loop

​Description Optimization

​Generate Trigger Eval Queries

​Query Types

​Run Optimization Loop

​Apply Results

​Environment-Specific Instructions

​Advanced Features

​Blind Comparison

​Packaging Skills

​Key Reminders

Build docs developers (and LLMs) love

Overview

When to Use This Skill

High-Level Workflow

Creating a Skill

Capture Intent

Interview and Research

Write the SKILL.md

Skill Anatomy

Progressive Disclosure

Writing Patterns

Writing Style

Running and Evaluating Test Cases

Test Case Format

Workspace Organization

Step-by-Step Evaluation Process

What Users See in the Viewer

Improving the Skill

Key Principles

The Iteration Loop

Description Optimization

Generate Trigger Eval Queries

Query Types

Run Optimization Loop

Apply Results

Environment-Specific Instructions

Advanced Features

Blind Comparison

Packaging Skills

Key Reminders