Architecture Overview

This document provides a comprehensive overview of Skill Lab’s architecture, including the tech stack, data flow, core components, and design patterns.

Tech Stack

Runtime Dependencies

Package	Version	Purpose
Python	≥3.10	Runtime (uses modern features like `X \| Y` union syntax)
Typer	≥0.9.0	CLI framework built on Click with type hints
Rich	≥13.0.0	Terminal formatting (tables, panels, colors)
PyYAML	≥6.0	YAML frontmatter parsing
anthropic	≥0.39.0	LLM-based test generation (optional, `pip install skill-lab[generate]`)

Development Dependencies

Package	Purpose
pytest	Test framework
pytest-cov	Test coverage reporting
mypy	Static type checking (strict mode enabled)
ruff	Fast linter (replaces flake8, isort)
types-PyYAML	Type stubs for PyYAML

Directory Structure

src/skill_lab/
├── cli.py                    # Entry point - Typer CLI commands
├── __main__.py               # Allows `python -m skill_lab`
├── core/
│   ├── models.py             # Data classes (Skill, CheckResult, etc.)
│   ├── registry.py           # Check auto-discovery system
│   ├── constants.py          # Shared constants
│   ├── scoring.py            # Quality score calculation
│   ├── tokens.py             # Token estimation utility
│   ├── utils.py              # Shared utilities (generic Registry[T])
│   └── exceptions.py         # Custom exception hierarchy
├── parsers/
│   ├── skill_parser.py       # SKILL.md parser (YAML + markdown)
│   └── trace_parser.py       # JSONL trace parser
├── checks/
│   ├── base.py               # StaticCheck abstract base class
│   └── static/               # Check implementations
│       ├── structure.py      # 7 checks
│       ├── schema.py         # 9 checks (declarative FieldRule)
│       ├── naming.py         # 1 check
│       └── content.py        # 11 checks
├── evaluators/
│   ├── static_evaluator.py   # Orchestrates static check execution
│   └── trace_evaluator.py    # Orchestrates trace check execution
├── tracechecks/              # Trace analysis
│   ├── registry.py           # TraceCheckRegistry
│   ├── trace_check_loader.py # Load check definitions from YAML
│   └── handlers/             # Trace check handler implementations
│       ├── base.py
│       ├── command_presence.py
│       ├── file_creation.py
│       ├── event_sequence.py
│       ├── loop_detection.py
│       └── efficiency.py
├── exporters/                # Output format renderers
│   └── prompt_exporter.py    # XML/Markdown/JSON prompt export
├── triggers/                 # Trigger testing
│   ├── generator.py          # LLM-based trigger test generation
│   ├── test_loader.py        # Load test cases from YAML
│   ├── trace_analyzer.py     # Analyze execution traces
│   └── trigger_evaluator.py  # Orchestrates trigger tests
├── runtimes/                 # Runtime adapters
│   ├── base.py               # RuntimeAdapter abstract base class
│   ├── codex_runtime.py      # OpenAI Codex CLI adapter
│   └── claude_runtime.py     # Claude Code CLI adapter
└── reporters/
    ├── console_reporter.py   # Rich terminal output
    └── json_reporter.py      # JSON output

Data Flow

The following diagram illustrates how data flows through the system during a static evaluation:

┌─────────────────────────────────────────────────────────────────┐
│                USER: sklab evaluate ./my-skill                  │
└─────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────┐
│  CLI (cli.py)                                                   │
│  • Parses arguments with Typer                                  │
│  • Creates StaticEvaluator                                      │
│  • Dispatches to appropriate reporter                           │
└─────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────┐
│  StaticEvaluator (evaluators/static_evaluator.py)               │
│  • Imports check modules → triggers @register_check decorators  │
│  • Calls parse_skill() to get Skill object                      │
│  • Retrieves all checks from registry                           │
│  • Executes each check.run(skill)                               │
│  • Calculates score and builds summary                          │
│  • Returns EvaluationReport                                     │
└─────────────────────────────────────────────────────────────────┘
                │                               │
                ▼                               ▼
┌───────────────────────────┐   ┌─────────────────────────────────┐
│  SkillParser              │   │  CheckRegistry                  │
│  (parsers/skill_parser.py)│   │  (core/registry.py)             │
│                           │   │                                 │
│  1. Read SKILL.md         │   │  Global singleton holding all   │
│  2. Extract YAML          │   │  registered check classes       │
│  3. Parse with custom     │   │                                 │
│     loader                │   │  Methods:                       │
│  4. Detect subfolders     │   │  • register(check_class)        │
│  5. Return Skill object   │   │  • get_all()                    │
│                           │   │  • get_by_dimension(dim)        │
└───────────────────────────┘   └─────────────────────────────────┘

Core Components

Data Models (`core/models.py`)

All models use immutable dataclasses with Python 3.10+ union syntax:

Enumerations

class Severity(str, Enum):
    ERROR = "error"      # Must fix (weight: 1.0)
    WARNING = "warning"  # Should fix (weight: 0.5)
    INFO = "info"        # Suggestion (weight: 0.25)

class EvalDimension(str, Enum):
    STRUCTURE = "structure"      # 30% weight
    NAMING = "naming"            # 20% weight
    DESCRIPTION = "description"  # 25% weight
    CONTENT = "content"          # 25% weight
    EXECUTION = "execution"      # 0% (evaluated separately)

Immutable Data Classes

@dataclass(frozen=True)
class Skill:
    path: Path
    metadata: SkillMetadata | None
    body: str
    has_scripts: bool
    has_references: bool
    has_assets: bool
    parse_errors: tuple[str, ...]

@dataclass(frozen=True)
class CheckResult:
    check_id: str
    check_name: str
    passed: bool
    severity: Severity
    dimension: EvalDimension
    message: str
    details: dict | None
    location: str | None

All data models are immutable (frozen=True) to prevent accidental modifications and ensure thread safety.

Check Registration Pattern

Skill Lab uses a decorator-based auto-discovery pattern for both static checks and trace handlers.

1. Generic Registry Base Class

# core/utils.py
class Registry(Generic[T]):
    """Generic registry for auto-discovery patterns."""

    def __init__(self, id_extractor: Callable[[type[T]], str]) -> None:
        self._items: dict[str, type[T]] = {}
        self._id_extractor = id_extractor

    def register(self, item_class: type[T]) -> type[T]:
        item_id = self._id_extractor(item_class)
        self._items[item_id] = item_class
        return item_class

    def get_all(self) -> list[type[T]]:
        return list(self._items.values())

2. Specialized Check Registry

# core/registry.py
class CheckRegistry(Registry["StaticCheck"]):
    """Registry for static checks with dimension filtering."""

    def __init__(self) -> None:
        super().__init__(id_extractor=lambda cls: cls.check_id)

    def get_by_dimension(self, dimension: str) -> list[type[StaticCheck]]:
        ...

registry = CheckRegistry()  # Global singleton

def register_check(check_class):
    return registry.register(check_class)

3. Check Definition

# checks/static/structure.py
@register_check  # ← Adds to global registry when module loads
class SkillMdExistsCheck(StaticCheck):
    check_id = "structure.skill-md-exists"
    check_name = "SKILL.md Exists"
    severity = Severity.ERROR
    dimension = EvalDimension.STRUCTURE

    def run(self, skill: Skill) -> CheckResult:
        if (skill.path / "SKILL.md").exists():
            return self._pass("SKILL.md found")
        return self._fail("SKILL.md not found")

Why This Pattern?

Zero manual wiring - Add decorator, checks are auto-discovered
Easy testing - registry.clear() for isolation
Selective execution - Pass check_ids to run a subset
Type safety - Generic base class provides type checking

Two Check Patterns

Skill Lab supports two distinct patterns for defining static checks:

1. Behavioral Checks (Hand-written)

For complex validation logic requiring custom implementation:

@register_check
class BodyNotEmptyCheck(StaticCheck):
    check_id = "content.body-not-empty"
    check_name = "Body Not Empty"
    severity = Severity.ERROR
    dimension = EvalDimension.CONTENT

    def run(self, skill: Skill) -> CheckResult:
        if skill.body and skill.body.strip():
            return self._pass("Skill body contains content")
        return self._fail("Skill body is empty")

Used in:

structure.py (7 checks)
naming.py (1 check)
content.py (11 checks)

2. Schema-Based Checks (Declarative)

For simple field validation using declarative rules:

# schema.py
FRONTMATTER_SCHEMA = [
    FieldRule(
        field_name="name",
        required=True,
        expected_type="str",
        max_length=100,
        check_id="schema.name-exists",
        check_name="Name Exists",
        description="Ensures frontmatter has a 'name' field",
    ),
    # ... more rules
]

The _make_schema_check() factory automatically creates check classes from these rules.

When adding new frontmatter fields, update both FRONTMATTER_SCHEMA in schema.py and SPEC_FRONTMATTER_FIELDS in structure.py to keep them in sync.

Scoring Algorithm (`core/scoring.py`)

Step 1: Calculate Per-Dimension Score

For each dimension:

score = (passed_weight / total_weight) × 100

Weights by severity:

ERROR = 1.0
WARNING = 0.5
INFO = 0.25

Step 2: Calculate Weighted Average

DIMENSION_WEIGHTS = {
    STRUCTURE: 0.30,     # 30%
    NAMING: 0.20,        # 20%
    DESCRIPTION: 0.25,   # 25%
    CONTENT: 0.25,       # 25%
}

final_score = Σ(dimension_score × dimension_weight)

Example Calculation

Structure: 5 checks, all pass       → 100 × 0.30 = 30.0
Naming: 5 checks, 1 ERROR fails     →  80 × 0.20 = 16.0
Description: 5 checks, all pass     → 100 × 0.25 = 25.0
Content: 6 checks, 1 WARNING fails  →  90 × 0.25 = 22.5
──────────────────────────────────────────────────────
Final Score: 93.5

Parser (`parsers/skill_parser.py`)

The skill parser handles:

Frontmatter extraction via regex: ^---\n(.*?)^---\n
YAML parsing with custom _SkillYAMLLoader that prevents implicit type coercion
Metadata extraction - pulls name and description fields
Subfolder detection - checks for /scripts, /references, /assets
Graceful error handling - collects errors in parse_errors tuple

The custom YAML loader prevents yes→True, null→None coercion, keeping all values as strings for consistent validation.

Exception Hierarchy (`core/exceptions.py`)

All custom exceptions extend SkillLabError:

class SkillLabError(Exception):
    """Base exception with context and suggestions."""
    def __init__(
        self,
        message: str,
        *,
        context: dict[str, Any] | None = None,
        suggestion: str | None = None,
    ) -> None:
        ...

class ParseError(SkillLabError):
    """Errors during parsing (YAML, traces, etc.)."""

class CheckExecutionError(SkillLabError):
    """Errors during check execution."""

class ValidationError(SkillLabError):
    """Errors validating skill structure or content."""

Usage:

raise ParseError(
    "Invalid YAML frontmatter",
    context={"line": 5, "file": "SKILL.md"},
    suggestion="Ensure frontmatter starts and ends with '---'"
)

CLI Commands

Built with Typer for automatic help generation and type validation:

# Global options
sklab -v, --version              # Show version
sklab -h, --help                 # Show help

# Main evaluation (defaults to current directory)
sklab evaluate [./my-skill] [-f console|json] [-o file.json] [-V] [-s]

# Quick validation (exit 0 or 1)
sklab validate [./my-skill] [-s]

# List available checks
sklab list-checks [-d structure|naming|description|content] [-s]

# Trigger testing
sklab trigger [./my-skill] [-t explicit|implicit|contextual|negative]

# Generate trigger tests via LLM (requires ANTHROPIC_API_KEY)
sklab generate [./my-skill] [-m MODEL] [--force]

# Skill metadata inspector
sklab info [./my-skill] [--json] [-f FIELD]

# Multi-format prompt export
sklab prompt [./skill-a ./skill-b] [-f xml|markdown|json]

Evaluation Flags:

-V / --verbose: Show all checks, not just failures
-s / --spec-only: Only run spec-required checks (10 checks)
--suggestions-only: List only quality suggestion checks (18 checks)

Trigger Testing

Trigger testing verifies that skills activate correctly for different prompt types.

Trigger Types

Type	Description	Example
EXPLICIT	Skill named directly with $ prefix	`$create-react-app for a todo list`
IMPLICIT	Describes exact scenario without naming skill	`I need to scaffold a new React application`
CONTEXTUAL	Realistic noisy prompt with domain context	`Building a dashboard, can you set up React?`
NEGATIVE	Should NOT trigger (catches false positives)	`How do I fix this useState hook?`

Runtime Adapters

Runtime adapters execute skills and capture traces:

class RuntimeAdapter(ABC):
    @abstractmethod
    def execute(self, prompt: str, skill_path: Path, trace_path: Path) -> int:
        ...

    @abstractmethod
    def parse_trace(self, trace_path: Path) -> Iterator[TraceEvent]:
        ...

Supported Runtimes:

ClaudeRuntime - Executes via claude --print --output-format stream-json
CodexRuntime - Executes via codex exec --json --full-auto

Design Decisions

Decision	Rationale
Immutable models (`frozen=True`)	Ensures check results can’t be accidentally modified
Error collection vs throwing	Parser collects errors in tuple, evaluation continues
Decorator-based registration	No central file listing all checks needed
Weighted scoring	Different severities and dimensions have different impact
Strict typing	mypy strict mode enforced in `pyproject.toml`
Generic Registry[T]	Reusable base for CheckRegistry and TraceCheckRegistry
Base class helpers	`_require_metadata()`, `_skill_md_location()` reduce repetition
T \| None over Optional[T]	Python 3.10+ union syntax for cleaner type annotations
Custom YAML loader	Prevents `yes`→`True`, `null`→`None` coercion
NFKC Unicode normalization	Ensures precomposed/decomposed forms match in naming checks

Adding a New Check

See the Contributing Guidelines for detailed instructions on adding new checks. Quick reference:

Create class extending StaticCheck in appropriate checks/static/ module
Define class attributes: check_id, check_name, severity, dimension
Set spec_required = True if required by Agent Skills spec
Implement run(skill: Skill) -> CheckResult
Add @register_check decorator
Add tests to tests/test_checks.py

For schema-based checks, simply add a FieldRule to FRONTMATTER_SCHEMA.

Get Started

Guides

Core Concepts

Development

Architecture Overview

Tech Stack

Runtime Dependencies

Development Dependencies

Directory Structure

Data Flow

Core Components

Data Models (`core/models.py`)

Enumerations

Immutable Data Classes

Check Registration Pattern

1. Generic Registry Base Class

2. Specialized Check Registry

3. Check Definition

Why This Pattern?

Two Check Patterns

1. Behavioral Checks (Hand-written)

2. Schema-Based Checks (Declarative)

Scoring Algorithm (`core/scoring.py`)

Step 1: Calculate Per-Dimension Score

Step 2: Calculate Weighted Average

Example Calculation

Parser (`parsers/skill_parser.py`)

Exception Hierarchy (`core/exceptions.py`)

CLI Commands

Trigger Testing

Trigger Types

Runtime Adapters

Design Decisions

Adding a New Check

Build docs developers (and LLMs) love

Get Started

Guides

Core Concepts

Development

​Tech Stack

​Runtime Dependencies

​Development Dependencies

​Directory Structure

​Data Flow

​Core Components

​Data Models (core/models.py)

​Enumerations

​Immutable Data Classes

​Check Registration Pattern

​1. Generic Registry Base Class

​2. Specialized Check Registry

​3. Check Definition

​Why This Pattern?

​Two Check Patterns

​1. Behavioral Checks (Hand-written)

​2. Schema-Based Checks (Declarative)

​Scoring Algorithm (core/scoring.py)

​Step 1: Calculate Per-Dimension Score

​Step 2: Calculate Weighted Average

​Example Calculation

​Parser (parsers/skill_parser.py)

​Exception Hierarchy (core/exceptions.py)

​CLI Commands

​Trigger Testing

​Trigger Types

​Runtime Adapters

​Design Decisions

​Adding a New Check

Build docs developers (and LLMs) love

Tech Stack

Runtime Dependencies

Development Dependencies

Directory Structure

Data Flow

Core Components

Data Models (`core/models.py`)

Enumerations

Immutable Data Classes

Check Registration Pattern

1. Generic Registry Base Class

2. Specialized Check Registry

3. Check Definition

Why This Pattern?

Two Check Patterns

1. Behavioral Checks (Hand-written)

2. Schema-Based Checks (Declarative)

Scoring Algorithm (`core/scoring.py`)

Step 1: Calculate Per-Dimension Score

Step 2: Calculate Weighted Average

Example Calculation

Parser (`parsers/skill_parser.py`)

Exception Hierarchy (`core/exceptions.py`)

CLI Commands

Trigger Testing

Trigger Types

Runtime Adapters

Design Decisions

Adding a New Check