Overview
The Skill Creator skill enables you to build, test, and optimize OpenCode skills through an iterative development process. It supports creating skills from scratch, improving existing skills, running evaluations, benchmarking performance, and optimizing skill descriptions for better triggering accuracy.When to Use This Skill
Use the Skill Creator skill when you need to:- Create a new skill from scratch
- Update or optimize an existing skill
- Run evaluations to test skill effectiveness
- Benchmark skill performance with variance analysis
- Optimize a skill’s description for better triggering accuracy
- Convert a manual workflow into a reusable skill
High-Level Workflow
The skill creation process follows an iterative cycle:The process is flexible—you can jump in at any stage. If a user already has a draft, start with testing and iteration. If they just want to “vibe,” skip formal evaluations.
Creating a Skill
Capture Intent
Start by understanding the user’s intent. The current conversation might already contain a workflow to capture. Key questions to answer:- What should this skill enable Claude to do?
- When should this skill trigger? (what user phrases/contexts)
- What’s the expected output format?
- Should we set up test cases?
When to Use Test Cases
When to Use Test Cases
Skills with objectively verifiable outputs benefit from test cases:
- File transforms
- Data extraction
- Code generation
- Fixed workflow steps
- Writing style improvements
- Creative content
- Art generation
Interview and Research
Proactively ask questions about:- Edge cases and error scenarios
- Input/output formats and examples
- Example files or data
- Success criteria
- Dependencies and required tools
Write the SKILL.md
Based on the interview, create a skill with these components: YAML Frontmatter (Required):Important: Claude tends to “undertrigger” skills. Make descriptions slightly “pushy” by including all contexts where the skill would be useful, not just the primary use case.Example: Instead of “How to build a dashboard”, use “How to build a dashboard to display data. Use this skill whenever the user mentions dashboards, data visualization, internal metrics, or wants to display company data, even if they don’t explicitly ask for a ‘dashboard.’”
Skill Anatomy
Progressive Disclosure
Skills use a three-level loading system:- Metadata (name + description) - Always in context (~100 words)
- SKILL.md body - In context when skill triggers (<500 lines ideal)
- Bundled resources - Loaded as needed (unlimited size)
Key Patterns for Organization
Key Patterns for Organization
- Keep SKILL.md under 500 lines when possible
- Reference external files clearly with guidance on when to read them
- For large reference files (>300 lines), include a table of contents
- Scripts can execute without loading into context
Writing Patterns
Prefer using the imperative form in instructions. Defining output formats:Writing Style
Explain why things are important rather than using heavy-handed “MUST” statements. Use theory of mind to make skills general and not narrowly tied to specific examples. Write a draft, then review with fresh eyes and improve.
Running and Evaluating Test Cases
This is a continuous sequence—don’t stop partway through.Test Case Format
Save test cases toevals/evals.json:
Workspace Organization
Put results in<skill-name>-workspace/ as a sibling to the skill directory:
Step-by-Step Evaluation Process
Spawn All Runs in Parallel
For each test case, spawn two subagents in the same turn:With-skill run:
- Execute with skill path provided
- Save outputs to
iteration-N/eval-ID/with_skill/outputs/
- New skill: Run without any skill (
without_skill/) - Existing skill: Run with snapshot of old version (
old_skill/)
eval_metadata.json for each test case with descriptive names.Draft Assertions While Runs Progress
Don’t wait idle—draft quantitative assertions:Good assertions are objectively verifiable with descriptive names. Avoid forcing assertions onto subjective skills.
Capture Timing Data
When subagent tasks complete, save the notification data immediately:This data only comes through task notifications and isn’t persisted elsewhere.
Grade, Aggregate, and Launch Viewer
Once all runs complete:
-
Grade each run against assertions, save to
grading.json: -
Aggregate benchmark data:
- Analyze results - Surface patterns hidden in aggregate stats
-
Launch the viewer:
For iteration 2+, add
--previous-workspace workspace/iteration-N-1
What Users See in the Viewer
Outputs Tab:- Prompt that was given
- Files produced by the skill, rendered inline
- Previous output (iteration 2+, collapsed)
- Formal grades (if grading ran, collapsed)
- Feedback textbox (auto-saves)
- Previous feedback (iteration 2+)
- Pass rates, timing, and token usage for each configuration
- Per-eval breakdowns
- Analyst observations
Headless environments: Use
--static <output_path> to write standalone HTML instead of starting a server. Feedback downloads as feedback.json when user clicks “Submit All Reviews”.Improving the Skill
This is the heart of the iteration loop.Key Principles
1. Generalize from Feedback
1. Generalize from Feedback
Skills will be used millions of times across many prompts. Don’t overfit to test examples. Instead of fiddly changes or oppressive constraints, try:
- Different metaphors or working patterns
- Explaining the underlying principles
- Removing unnecessary restrictions
2. Keep the Prompt Lean
2. Keep the Prompt Lean
Remove things that aren’t pulling their weight. Read transcripts, not just final outputs. If the skill makes the model waste time on unproductive tasks, remove those parts and test the result.
3. Explain the Why
3. Explain the Why
Today’s LLMs are smart with good theory of mind. When given good context, they can go beyond rote instructions.
- Explain why behind every instruction
- If you’re writing ALWAYS or NEVER in all caps, that’s a yellow flag
- Reframe and explain the reasoning instead
- Help the model understand why something is important
4. Look for Repeated Work
4. Look for Repeated Work
Read transcripts from test runs. If all test cases resulted in the subagent writing similar helper scripts (e.g.,
create_docx.py, build_chart.py), bundle that script:- Write it once
- Put it in
scripts/ - Tell the skill to use it
- Save every future invocation from reinventing the wheel
The Iteration Loop
Description Optimization
The description field in SKILL.md frontmatter determines whether Claude invokes a skill. After creating or improving a skill, optimize the description for better triggering accuracy.Generate Trigger Eval Queries
Create 20 realistic eval queries—mix of should-trigger and should-not-trigger:Good queries are concrete and specific with:
- File paths and names
- Personal context about the user’s job or situation
- Column names, values, company names, URLs
- Casual speech, lowercase, abbreviations, typos
- Varying lengths with focus on edge cases
- “Format this data”
- “Extract text from PDF”
- “Create a chart”
Query Types
Should-trigger queries (8-10):- Different phrasings of the same intent (formal and casual)
- Cases where user doesn’t explicitly name the skill
- Uncommon use cases
- Competitive cases where this skill should win
- Near-misses sharing keywords but needing something different
- Adjacent domains
- Ambiguous phrasing where naive keyword match would trigger but shouldn’t
- Cases where query touches on skill functionality but another tool is more appropriate
Run Optimization Loop
- Splits eval set into 60% train and 40% test
- Evaluates current description (3 runs per query for reliability)
- Calls Claude with extended thinking to propose improvements
- Re-evaluates on both train and test sets
- Iterates up to 5 times
- Opens HTML report showing results
- Returns JSON with
best_description(selected by test score to avoid overfitting)
How Skill Triggering Works
How Skill Triggering Works
Skills appear in Claude’s
available_skills list with name + description. Claude decides whether to consult a skill based on that description.Important: Claude only consults skills for tasks it can’t easily handle on its own. Simple one-step queries like “read this PDF” may not trigger even if the description matches perfectly.Your eval queries should be substantive enough that Claude would actually benefit from consulting a skill. Complex, multi-step, or specialized queries reliably trigger skills when descriptions match.Apply Results
Takebest_description from JSON output and update SKILL.md frontmatter. Show the user before/after and report the scores.
Environment-Specific Instructions
- Claude Code
- Claude.ai
- Cowork
Full workflow supported:
- Parallel subagent execution
- Browser-based eval viewer
- Quantitative benchmarking
- Description optimization with
claude -p - Blind comparison testing
Advanced Features
Blind Comparison
For rigorous comparison between two skill versions:- Give two outputs to an independent agent without identifying which is which
- Let it judge quality
- Analyze why the winner won
Packaging Skills
.skill file ready for installation and distribution.
Key Reminders
Core Loop:
- Figure out what the skill is about
- Draft or edit the skill
- Run Claude-with-skill on test prompts
- Create benchmark.json and run
eval-viewer/generate_review.py - Evaluate outputs with the user
- Run quantitative evals
- Repeat until satisfied
- Package the final skill