Skip to main content
This document provides a comprehensive overview of Skill Lab’s architecture, including the tech stack, data flow, core components, and design patterns.

Tech Stack

Runtime Dependencies

PackageVersionPurpose
Python≥3.10Runtime (uses modern features like X | Y union syntax)
Typer≥0.9.0CLI framework built on Click with type hints
Rich≥13.0.0Terminal formatting (tables, panels, colors)
PyYAML≥6.0YAML frontmatter parsing
anthropic≥0.39.0LLM-based test generation (optional, pip install skill-lab[generate])

Development Dependencies

PackagePurpose
pytestTest framework
pytest-covTest coverage reporting
mypyStatic type checking (strict mode enabled)
ruffFast linter (replaces flake8, isort)
types-PyYAMLType stubs for PyYAML

Directory Structure

src/skill_lab/
├── cli.py                    # Entry point - Typer CLI commands
├── __main__.py               # Allows `python -m skill_lab`
├── core/
│   ├── models.py             # Data classes (Skill, CheckResult, etc.)
│   ├── registry.py           # Check auto-discovery system
│   ├── constants.py          # Shared constants
│   ├── scoring.py            # Quality score calculation
│   ├── tokens.py             # Token estimation utility
│   ├── utils.py              # Shared utilities (generic Registry[T])
│   └── exceptions.py         # Custom exception hierarchy
├── parsers/
│   ├── skill_parser.py       # SKILL.md parser (YAML + markdown)
│   └── trace_parser.py       # JSONL trace parser
├── checks/
│   ├── base.py               # StaticCheck abstract base class
│   └── static/               # Check implementations
│       ├── structure.py      # 7 checks
│       ├── schema.py         # 9 checks (declarative FieldRule)
│       ├── naming.py         # 1 check
│       └── content.py        # 11 checks
├── evaluators/
│   ├── static_evaluator.py   # Orchestrates static check execution
│   └── trace_evaluator.py    # Orchestrates trace check execution
├── tracechecks/              # Trace analysis
│   ├── registry.py           # TraceCheckRegistry
│   ├── trace_check_loader.py # Load check definitions from YAML
│   └── handlers/             # Trace check handler implementations
│       ├── base.py
│       ├── command_presence.py
│       ├── file_creation.py
│       ├── event_sequence.py
│       ├── loop_detection.py
│       └── efficiency.py
├── exporters/                # Output format renderers
│   └── prompt_exporter.py    # XML/Markdown/JSON prompt export
├── triggers/                 # Trigger testing
│   ├── generator.py          # LLM-based trigger test generation
│   ├── test_loader.py        # Load test cases from YAML
│   ├── trace_analyzer.py     # Analyze execution traces
│   └── trigger_evaluator.py  # Orchestrates trigger tests
├── runtimes/                 # Runtime adapters
│   ├── base.py               # RuntimeAdapter abstract base class
│   ├── codex_runtime.py      # OpenAI Codex CLI adapter
│   └── claude_runtime.py     # Claude Code CLI adapter
└── reporters/
    ├── console_reporter.py   # Rich terminal output
    └── json_reporter.py      # JSON output

Data Flow

The following diagram illustrates how data flows through the system during a static evaluation:
┌─────────────────────────────────────────────────────────────────┐
│                USER: sklab evaluate ./my-skill                  │
└─────────────────────────────────────────────────────────────────┘


┌─────────────────────────────────────────────────────────────────┐
│  CLI (cli.py)                                                   │
│  • Parses arguments with Typer                                  │
│  • Creates StaticEvaluator                                      │
│  • Dispatches to appropriate reporter                           │
└─────────────────────────────────────────────────────────────────┘


┌─────────────────────────────────────────────────────────────────┐
│  StaticEvaluator (evaluators/static_evaluator.py)               │
│  • Imports check modules → triggers @register_check decorators  │
│  • Calls parse_skill() to get Skill object                      │
│  • Retrieves all checks from registry                           │
│  • Executes each check.run(skill)                               │
│  • Calculates score and builds summary                          │
│  • Returns EvaluationReport                                     │
└─────────────────────────────────────────────────────────────────┘
                │                               │
                ▼                               ▼
┌───────────────────────────┐   ┌─────────────────────────────────┐
│  SkillParser              │   │  CheckRegistry                  │
│  (parsers/skill_parser.py)│   │  (core/registry.py)             │
│                           │   │                                 │
│  1. Read SKILL.md         │   │  Global singleton holding all   │
│  2. Extract YAML          │   │  registered check classes       │
│  3. Parse with custom     │   │                                 │
│     loader                │   │  Methods:                       │
│  4. Detect subfolders     │   │  • register(check_class)        │
│  5. Return Skill object   │   │  • get_all()                    │
│                           │   │  • get_by_dimension(dim)        │
└───────────────────────────┘   └─────────────────────────────────┘

Core Components

Data Models (core/models.py)

All models use immutable dataclasses with Python 3.10+ union syntax:

Enumerations

class Severity(str, Enum):
    ERROR = "error"      # Must fix (weight: 1.0)
    WARNING = "warning"  # Should fix (weight: 0.5)
    INFO = "info"        # Suggestion (weight: 0.25)

class EvalDimension(str, Enum):
    STRUCTURE = "structure"      # 30% weight
    NAMING = "naming"            # 20% weight
    DESCRIPTION = "description"  # 25% weight
    CONTENT = "content"          # 25% weight
    EXECUTION = "execution"      # 0% (evaluated separately)

Immutable Data Classes

@dataclass(frozen=True)
class Skill:
    path: Path
    metadata: SkillMetadata | None
    body: str
    has_scripts: bool
    has_references: bool
    has_assets: bool
    parse_errors: tuple[str, ...]

@dataclass(frozen=True)
class CheckResult:
    check_id: str
    check_name: str
    passed: bool
    severity: Severity
    dimension: EvalDimension
    message: str
    details: dict | None
    location: str | None
All data models are immutable (frozen=True) to prevent accidental modifications and ensure thread safety.

Check Registration Pattern

Skill Lab uses a decorator-based auto-discovery pattern for both static checks and trace handlers.

1. Generic Registry Base Class

# core/utils.py
class Registry(Generic[T]):
    """Generic registry for auto-discovery patterns."""

    def __init__(self, id_extractor: Callable[[type[T]], str]) -> None:
        self._items: dict[str, type[T]] = {}
        self._id_extractor = id_extractor

    def register(self, item_class: type[T]) -> type[T]:
        item_id = self._id_extractor(item_class)
        self._items[item_id] = item_class
        return item_class

    def get_all(self) -> list[type[T]]:
        return list(self._items.values())

2. Specialized Check Registry

# core/registry.py
class CheckRegistry(Registry["StaticCheck"]):
    """Registry for static checks with dimension filtering."""

    def __init__(self) -> None:
        super().__init__(id_extractor=lambda cls: cls.check_id)

    def get_by_dimension(self, dimension: str) -> list[type[StaticCheck]]:
        ...

registry = CheckRegistry()  # Global singleton

def register_check(check_class):
    return registry.register(check_class)

3. Check Definition

# checks/static/structure.py
@register_check  # ← Adds to global registry when module loads
class SkillMdExistsCheck(StaticCheck):
    check_id = "structure.skill-md-exists"
    check_name = "SKILL.md Exists"
    severity = Severity.ERROR
    dimension = EvalDimension.STRUCTURE

    def run(self, skill: Skill) -> CheckResult:
        if (skill.path / "SKILL.md").exists():
            return self._pass("SKILL.md found")
        return self._fail("SKILL.md not found")

Why This Pattern?

  • Zero manual wiring - Add decorator, checks are auto-discovered
  • Easy testing - registry.clear() for isolation
  • Selective execution - Pass check_ids to run a subset
  • Type safety - Generic base class provides type checking

Two Check Patterns

Skill Lab supports two distinct patterns for defining static checks:

1. Behavioral Checks (Hand-written)

For complex validation logic requiring custom implementation:
@register_check
class BodyNotEmptyCheck(StaticCheck):
    check_id = "content.body-not-empty"
    check_name = "Body Not Empty"
    severity = Severity.ERROR
    dimension = EvalDimension.CONTENT

    def run(self, skill: Skill) -> CheckResult:
        if skill.body and skill.body.strip():
            return self._pass("Skill body contains content")
        return self._fail("Skill body is empty")
Used in:
  • structure.py (7 checks)
  • naming.py (1 check)
  • content.py (11 checks)

2. Schema-Based Checks (Declarative)

For simple field validation using declarative rules:
# schema.py
FRONTMATTER_SCHEMA = [
    FieldRule(
        field_name="name",
        required=True,
        expected_type="str",
        max_length=100,
        check_id="schema.name-exists",
        check_name="Name Exists",
        description="Ensures frontmatter has a 'name' field",
    ),
    # ... more rules
]
The _make_schema_check() factory automatically creates check classes from these rules.
When adding new frontmatter fields, update both FRONTMATTER_SCHEMA in schema.py and SPEC_FRONTMATTER_FIELDS in structure.py to keep them in sync.

Scoring Algorithm (core/scoring.py)

Step 1: Calculate Per-Dimension Score

For each dimension:
score = (passed_weight / total_weight) × 100
Weights by severity:
  • ERROR = 1.0
  • WARNING = 0.5
  • INFO = 0.25

Step 2: Calculate Weighted Average

DIMENSION_WEIGHTS = {
    STRUCTURE: 0.30,     # 30%
    NAMING: 0.20,        # 20%
    DESCRIPTION: 0.25,   # 25%
    CONTENT: 0.25,       # 25%
}

final_score = Σ(dimension_score × dimension_weight)

Example Calculation

Structure: 5 checks, all pass       → 100 × 0.30 = 30.0
Naming: 5 checks, 1 ERROR fails     →  80 × 0.20 = 16.0
Description: 5 checks, all pass     → 100 × 0.25 = 25.0
Content: 6 checks, 1 WARNING fails  →  90 × 0.25 = 22.5
──────────────────────────────────────────────────────
Final Score: 93.5

Parser (parsers/skill_parser.py)

The skill parser handles:
  1. Frontmatter extraction via regex: ^---\n(.*?)^---\n
  2. YAML parsing with custom _SkillYAMLLoader that prevents implicit type coercion
  3. Metadata extraction - pulls name and description fields
  4. Subfolder detection - checks for /scripts, /references, /assets
  5. Graceful error handling - collects errors in parse_errors tuple
The custom YAML loader prevents yesTrue, nullNone coercion, keeping all values as strings for consistent validation.

Exception Hierarchy (core/exceptions.py)

All custom exceptions extend SkillLabError:
class SkillLabError(Exception):
    """Base exception with context and suggestions."""
    def __init__(
        self,
        message: str,
        *,
        context: dict[str, Any] | None = None,
        suggestion: str | None = None,
    ) -> None:
        ...

class ParseError(SkillLabError):
    """Errors during parsing (YAML, traces, etc.)."""

class CheckExecutionError(SkillLabError):
    """Errors during check execution."""

class ValidationError(SkillLabError):
    """Errors validating skill structure or content."""
Usage:
raise ParseError(
    "Invalid YAML frontmatter",
    context={"line": 5, "file": "SKILL.md"},
    suggestion="Ensure frontmatter starts and ends with '---'"
)

CLI Commands

Built with Typer for automatic help generation and type validation:
# Global options
sklab -v, --version              # Show version
sklab -h, --help                 # Show help

# Main evaluation (defaults to current directory)
sklab evaluate [./my-skill] [-f console|json] [-o file.json] [-V] [-s]

# Quick validation (exit 0 or 1)
sklab validate [./my-skill] [-s]

# List available checks
sklab list-checks [-d structure|naming|description|content] [-s]

# Trigger testing
sklab trigger [./my-skill] [-t explicit|implicit|contextual|negative]

# Generate trigger tests via LLM (requires ANTHROPIC_API_KEY)
sklab generate [./my-skill] [-m MODEL] [--force]

# Skill metadata inspector
sklab info [./my-skill] [--json] [-f FIELD]

# Multi-format prompt export
sklab prompt [./skill-a ./skill-b] [-f xml|markdown|json]
Evaluation Flags:
  • -V / --verbose: Show all checks, not just failures
  • -s / --spec-only: Only run spec-required checks (10 checks)
  • --suggestions-only: List only quality suggestion checks (18 checks)

Trigger Testing

Trigger testing verifies that skills activate correctly for different prompt types.

Trigger Types

TypeDescriptionExample
EXPLICITSkill named directly with $ prefix$create-react-app for a todo list
IMPLICITDescribes exact scenario without naming skillI need to scaffold a new React application
CONTEXTUALRealistic noisy prompt with domain contextBuilding a dashboard, can you set up React?
NEGATIVEShould NOT trigger (catches false positives)How do I fix this useState hook?

Runtime Adapters

Runtime adapters execute skills and capture traces:
class RuntimeAdapter(ABC):
    @abstractmethod
    def execute(self, prompt: str, skill_path: Path, trace_path: Path) -> int:
        ...

    @abstractmethod
    def parse_trace(self, trace_path: Path) -> Iterator[TraceEvent]:
        ...
Supported Runtimes:
  • ClaudeRuntime - Executes via claude --print --output-format stream-json
  • CodexRuntime - Executes via codex exec --json --full-auto

Design Decisions

DecisionRationale
Immutable models (frozen=True)Ensures check results can’t be accidentally modified
Error collection vs throwingParser collects errors in tuple, evaluation continues
Decorator-based registrationNo central file listing all checks needed
Weighted scoringDifferent severities and dimensions have different impact
Strict typingmypy strict mode enforced in pyproject.toml
Generic Registry[T]Reusable base for CheckRegistry and TraceCheckRegistry
Base class helpers_require_metadata(), _skill_md_location() reduce repetition
T | None over Optional[T]Python 3.10+ union syntax for cleaner type annotations
Custom YAML loaderPrevents yesTrue, nullNone coercion
NFKC Unicode normalizationEnsures precomposed/decomposed forms match in naming checks

Adding a New Check

See the Contributing Guidelines for detailed instructions on adding new checks. Quick reference:
  1. Create class extending StaticCheck in appropriate checks/static/ module
  2. Define class attributes: check_id, check_name, severity, dimension
  3. Set spec_required = True if required by Agent Skills spec
  4. Implement run(skill: Skill) -> CheckResult
  5. Add @register_check decorator
  6. Add tests to tests/test_checks.py
For schema-based checks, simply add a FieldRule to FRONTMATTER_SCHEMA.

Build docs developers (and LLMs) love