This document provides a comprehensive overview of Skill Lab’s architecture, including the tech stack, data flow, core components, and design patterns.
Tech Stack
Runtime Dependencies
| Package | Version | Purpose |
|---|
| Python | ≥3.10 | Runtime (uses modern features like X | Y union syntax) |
| Typer | ≥0.9.0 | CLI framework built on Click with type hints |
| Rich | ≥13.0.0 | Terminal formatting (tables, panels, colors) |
| PyYAML | ≥6.0 | YAML frontmatter parsing |
| anthropic | ≥0.39.0 | LLM-based test generation (optional, pip install skill-lab[generate]) |
Development Dependencies
| Package | Purpose |
|---|
| pytest | Test framework |
| pytest-cov | Test coverage reporting |
| mypy | Static type checking (strict mode enabled) |
| ruff | Fast linter (replaces flake8, isort) |
| types-PyYAML | Type stubs for PyYAML |
Directory Structure
src/skill_lab/
├── cli.py # Entry point - Typer CLI commands
├── __main__.py # Allows `python -m skill_lab`
├── core/
│ ├── models.py # Data classes (Skill, CheckResult, etc.)
│ ├── registry.py # Check auto-discovery system
│ ├── constants.py # Shared constants
│ ├── scoring.py # Quality score calculation
│ ├── tokens.py # Token estimation utility
│ ├── utils.py # Shared utilities (generic Registry[T])
│ └── exceptions.py # Custom exception hierarchy
├── parsers/
│ ├── skill_parser.py # SKILL.md parser (YAML + markdown)
│ └── trace_parser.py # JSONL trace parser
├── checks/
│ ├── base.py # StaticCheck abstract base class
│ └── static/ # Check implementations
│ ├── structure.py # 7 checks
│ ├── schema.py # 9 checks (declarative FieldRule)
│ ├── naming.py # 1 check
│ └── content.py # 11 checks
├── evaluators/
│ ├── static_evaluator.py # Orchestrates static check execution
│ └── trace_evaluator.py # Orchestrates trace check execution
├── tracechecks/ # Trace analysis
│ ├── registry.py # TraceCheckRegistry
│ ├── trace_check_loader.py # Load check definitions from YAML
│ └── handlers/ # Trace check handler implementations
│ ├── base.py
│ ├── command_presence.py
│ ├── file_creation.py
│ ├── event_sequence.py
│ ├── loop_detection.py
│ └── efficiency.py
├── exporters/ # Output format renderers
│ └── prompt_exporter.py # XML/Markdown/JSON prompt export
├── triggers/ # Trigger testing
│ ├── generator.py # LLM-based trigger test generation
│ ├── test_loader.py # Load test cases from YAML
│ ├── trace_analyzer.py # Analyze execution traces
│ └── trigger_evaluator.py # Orchestrates trigger tests
├── runtimes/ # Runtime adapters
│ ├── base.py # RuntimeAdapter abstract base class
│ ├── codex_runtime.py # OpenAI Codex CLI adapter
│ └── claude_runtime.py # Claude Code CLI adapter
└── reporters/
├── console_reporter.py # Rich terminal output
└── json_reporter.py # JSON output
Data Flow
The following diagram illustrates how data flows through the system during a static evaluation:
┌─────────────────────────────────────────────────────────────────┐
│ USER: sklab evaluate ./my-skill │
└─────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ CLI (cli.py) │
│ • Parses arguments with Typer │
│ • Creates StaticEvaluator │
│ • Dispatches to appropriate reporter │
└─────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ StaticEvaluator (evaluators/static_evaluator.py) │
│ • Imports check modules → triggers @register_check decorators │
│ • Calls parse_skill() to get Skill object │
│ • Retrieves all checks from registry │
│ • Executes each check.run(skill) │
│ • Calculates score and builds summary │
│ • Returns EvaluationReport │
└─────────────────────────────────────────────────────────────────┘
│ │
▼ ▼
┌───────────────────────────┐ ┌─────────────────────────────────┐
│ SkillParser │ │ CheckRegistry │
│ (parsers/skill_parser.py)│ │ (core/registry.py) │
│ │ │ │
│ 1. Read SKILL.md │ │ Global singleton holding all │
│ 2. Extract YAML │ │ registered check classes │
│ 3. Parse with custom │ │ │
│ loader │ │ Methods: │
│ 4. Detect subfolders │ │ • register(check_class) │
│ 5. Return Skill object │ │ • get_all() │
│ │ │ • get_by_dimension(dim) │
└───────────────────────────┘ └─────────────────────────────────┘
Core Components
Data Models (core/models.py)
All models use immutable dataclasses with Python 3.10+ union syntax:
Enumerations
class Severity(str, Enum):
ERROR = "error" # Must fix (weight: 1.0)
WARNING = "warning" # Should fix (weight: 0.5)
INFO = "info" # Suggestion (weight: 0.25)
class EvalDimension(str, Enum):
STRUCTURE = "structure" # 30% weight
NAMING = "naming" # 20% weight
DESCRIPTION = "description" # 25% weight
CONTENT = "content" # 25% weight
EXECUTION = "execution" # 0% (evaluated separately)
Immutable Data Classes
@dataclass(frozen=True)
class Skill:
path: Path
metadata: SkillMetadata | None
body: str
has_scripts: bool
has_references: bool
has_assets: bool
parse_errors: tuple[str, ...]
@dataclass(frozen=True)
class CheckResult:
check_id: str
check_name: str
passed: bool
severity: Severity
dimension: EvalDimension
message: str
details: dict | None
location: str | None
All data models are immutable (frozen=True) to prevent accidental modifications and ensure thread safety.
Check Registration Pattern
Skill Lab uses a decorator-based auto-discovery pattern for both static checks and trace handlers.
1. Generic Registry Base Class
# core/utils.py
class Registry(Generic[T]):
"""Generic registry for auto-discovery patterns."""
def __init__(self, id_extractor: Callable[[type[T]], str]) -> None:
self._items: dict[str, type[T]] = {}
self._id_extractor = id_extractor
def register(self, item_class: type[T]) -> type[T]:
item_id = self._id_extractor(item_class)
self._items[item_id] = item_class
return item_class
def get_all(self) -> list[type[T]]:
return list(self._items.values())
2. Specialized Check Registry
# core/registry.py
class CheckRegistry(Registry["StaticCheck"]):
"""Registry for static checks with dimension filtering."""
def __init__(self) -> None:
super().__init__(id_extractor=lambda cls: cls.check_id)
def get_by_dimension(self, dimension: str) -> list[type[StaticCheck]]:
...
registry = CheckRegistry() # Global singleton
def register_check(check_class):
return registry.register(check_class)
3. Check Definition
# checks/static/structure.py
@register_check # ← Adds to global registry when module loads
class SkillMdExistsCheck(StaticCheck):
check_id = "structure.skill-md-exists"
check_name = "SKILL.md Exists"
severity = Severity.ERROR
dimension = EvalDimension.STRUCTURE
def run(self, skill: Skill) -> CheckResult:
if (skill.path / "SKILL.md").exists():
return self._pass("SKILL.md found")
return self._fail("SKILL.md not found")
Why This Pattern?
- Zero manual wiring - Add decorator, checks are auto-discovered
- Easy testing -
registry.clear() for isolation
- Selective execution - Pass
check_ids to run a subset
- Type safety - Generic base class provides type checking
Two Check Patterns
Skill Lab supports two distinct patterns for defining static checks:
1. Behavioral Checks (Hand-written)
For complex validation logic requiring custom implementation:
@register_check
class BodyNotEmptyCheck(StaticCheck):
check_id = "content.body-not-empty"
check_name = "Body Not Empty"
severity = Severity.ERROR
dimension = EvalDimension.CONTENT
def run(self, skill: Skill) -> CheckResult:
if skill.body and skill.body.strip():
return self._pass("Skill body contains content")
return self._fail("Skill body is empty")
Used in:
structure.py (7 checks)
naming.py (1 check)
content.py (11 checks)
2. Schema-Based Checks (Declarative)
For simple field validation using declarative rules:
# schema.py
FRONTMATTER_SCHEMA = [
FieldRule(
field_name="name",
required=True,
expected_type="str",
max_length=100,
check_id="schema.name-exists",
check_name="Name Exists",
description="Ensures frontmatter has a 'name' field",
),
# ... more rules
]
The _make_schema_check() factory automatically creates check classes from these rules.
When adding new frontmatter fields, update both FRONTMATTER_SCHEMA in schema.py and SPEC_FRONTMATTER_FIELDS in structure.py to keep them in sync.
Scoring Algorithm (core/scoring.py)
Step 1: Calculate Per-Dimension Score
For each dimension:
score = (passed_weight / total_weight) × 100
Weights by severity:
- ERROR = 1.0
- WARNING = 0.5
- INFO = 0.25
Step 2: Calculate Weighted Average
DIMENSION_WEIGHTS = {
STRUCTURE: 0.30, # 30%
NAMING: 0.20, # 20%
DESCRIPTION: 0.25, # 25%
CONTENT: 0.25, # 25%
}
final_score = Σ(dimension_score × dimension_weight)
Example Calculation
Structure: 5 checks, all pass → 100 × 0.30 = 30.0
Naming: 5 checks, 1 ERROR fails → 80 × 0.20 = 16.0
Description: 5 checks, all pass → 100 × 0.25 = 25.0
Content: 6 checks, 1 WARNING fails → 90 × 0.25 = 22.5
──────────────────────────────────────────────────────
Final Score: 93.5
Parser (parsers/skill_parser.py)
The skill parser handles:
- Frontmatter extraction via regex:
^---\n(.*?)^---\n
- YAML parsing with custom
_SkillYAMLLoader that prevents implicit type coercion
- Metadata extraction - pulls
name and description fields
- Subfolder detection - checks for
/scripts, /references, /assets
- Graceful error handling - collects errors in
parse_errors tuple
The custom YAML loader prevents yes→True, null→None coercion, keeping all values as strings for consistent validation.
Exception Hierarchy (core/exceptions.py)
All custom exceptions extend SkillLabError:
class SkillLabError(Exception):
"""Base exception with context and suggestions."""
def __init__(
self,
message: str,
*,
context: dict[str, Any] | None = None,
suggestion: str | None = None,
) -> None:
...
class ParseError(SkillLabError):
"""Errors during parsing (YAML, traces, etc.)."""
class CheckExecutionError(SkillLabError):
"""Errors during check execution."""
class ValidationError(SkillLabError):
"""Errors validating skill structure or content."""
Usage:
raise ParseError(
"Invalid YAML frontmatter",
context={"line": 5, "file": "SKILL.md"},
suggestion="Ensure frontmatter starts and ends with '---'"
)
CLI Commands
Built with Typer for automatic help generation and type validation:
# Global options
sklab -v, --version # Show version
sklab -h, --help # Show help
# Main evaluation (defaults to current directory)
sklab evaluate [./my-skill] [-f console|json] [-o file.json] [-V] [-s]
# Quick validation (exit 0 or 1)
sklab validate [./my-skill] [-s]
# List available checks
sklab list-checks [-d structure|naming|description|content] [-s]
# Trigger testing
sklab trigger [./my-skill] [-t explicit|implicit|contextual|negative]
# Generate trigger tests via LLM (requires ANTHROPIC_API_KEY)
sklab generate [./my-skill] [-m MODEL] [--force]
# Skill metadata inspector
sklab info [./my-skill] [--json] [-f FIELD]
# Multi-format prompt export
sklab prompt [./skill-a ./skill-b] [-f xml|markdown|json]
Evaluation Flags:
-V / --verbose: Show all checks, not just failures
-s / --spec-only: Only run spec-required checks (10 checks)
--suggestions-only: List only quality suggestion checks (18 checks)
Trigger Testing
Trigger testing verifies that skills activate correctly for different prompt types.
Trigger Types
| Type | Description | Example |
|---|
| EXPLICIT | Skill named directly with $ prefix | $create-react-app for a todo list |
| IMPLICIT | Describes exact scenario without naming skill | I need to scaffold a new React application |
| CONTEXTUAL | Realistic noisy prompt with domain context | Building a dashboard, can you set up React? |
| NEGATIVE | Should NOT trigger (catches false positives) | How do I fix this useState hook? |
Runtime Adapters
Runtime adapters execute skills and capture traces:
class RuntimeAdapter(ABC):
@abstractmethod
def execute(self, prompt: str, skill_path: Path, trace_path: Path) -> int:
...
@abstractmethod
def parse_trace(self, trace_path: Path) -> Iterator[TraceEvent]:
...
Supported Runtimes:
ClaudeRuntime - Executes via claude --print --output-format stream-json
CodexRuntime - Executes via codex exec --json --full-auto
Design Decisions
| Decision | Rationale |
|---|
Immutable models (frozen=True) | Ensures check results can’t be accidentally modified |
| Error collection vs throwing | Parser collects errors in tuple, evaluation continues |
| Decorator-based registration | No central file listing all checks needed |
| Weighted scoring | Different severities and dimensions have different impact |
| Strict typing | mypy strict mode enforced in pyproject.toml |
| Generic Registry[T] | Reusable base for CheckRegistry and TraceCheckRegistry |
| Base class helpers | _require_metadata(), _skill_md_location() reduce repetition |
| T | None over Optional[T] | Python 3.10+ union syntax for cleaner type annotations |
| Custom YAML loader | Prevents yes→True, null→None coercion |
| NFKC Unicode normalization | Ensures precomposed/decomposed forms match in naming checks |
Adding a New Check
See the Contributing Guidelines for detailed instructions on adding new checks.
Quick reference:
- Create class extending
StaticCheck in appropriate checks/static/ module
- Define class attributes:
check_id, check_name, severity, dimension
- Set
spec_required = True if required by Agent Skills spec
- Implement
run(skill: Skill) -> CheckResult
- Add
@register_check decorator
- Add tests to
tests/test_checks.py
For schema-based checks, simply add a FieldRule to FRONTMATTER_SCHEMA.