Scoring System

Overview

Dream Foundry uses a two-layer judging system to prevent “LLM rubber stamp” failures:

Layer A (Hard Gates): Deterministic checks - no LLM, pass/fail
Layer B (Rubric Score): Weighted scoring across Success, Quality, and Speed

If a candidate fails ANY hard gate, it’s disqualified regardless of rubric score.

Scoring Formula

Candidates are scored on three weighted criteria:

# From scoring.py:352
weights = {'success': 0.2, 'quality': 0.6, 'speed': 0.2}

total_score = (
    success_score * 0.2 +
    quality_score * 0.6 +
    speed_score * 0.2
)

Success: 20% | Quality: 60% | Speed: 20%Quality dominates the scoring because a fast, error-free implementation that produces garbage output is worthless.

Layer A: Hard Gates

These deterministic checks run BEFORE any scoring. They’re fast, reliable, and objective.

Gate 1: Produces Artifact

Does the candidate generate output files?

# From scoring_rubric.md:34
artifact_path = output_dir / "events.json"
alt_artifact = output_dir / "events.md"
produces_artifact = artifact_path.exists() or alt_artifact.exists()

Fails if: No output file exists

Gate 2: Artifact Not Empty

Is the output file meaningful?

# From scoring_rubric.md:48
content = artifact.read_text().strip()
not_empty = len(content) > 10

Fails if: Content ≤ 10 characters

Gate 3: Has Required Fields

Does the output contain necessary data?

# From scoring_rubric.md:59
has_fields = all(
    "title" in e and "date" in e and "url" in e
    for e in events
)

Fails if: Missing title, date, or URL fields

Gate 4: Minimum Event Count

Does the output have enough events?

# From scoring_rubric.md:97
min_count = count >= 3

Fails if: Fewer than 3 events

Gate 5: No Fatal Errors

Did the process complete successfully?

# From forge.py:169
if result.returncode != 0:
    error_occurred = True

Fails if: Non-zero exit code

Agent Delta

Fails Gate 5Crashes with divide-by-zero error:

events_found = 0
average = 100 / events_found  # BOOM!

Result: Disqualified

Agent Epsilon

Passes All GatesProduces output with required fields, but data is low quality (wrong dates, wrong locations).Result: Proceeds to scoring (but gets low quality score)

Layer B: Rubric Scoring

Only candidates that pass ALL hard gates proceed to rubric scoring.

Success Score (20%)

Binary: Did the candidate run without errors?

# From scoring.py:377
success = produced_output and not error_occurred

if not success:
    return 0.0  # Instant disqualification

success_score = 100.0

Calculation:

Success = 100 points
Failure = 0 points (disqualified)

Quality Score (60%)

Most important criterion. Validates output against strict requirements.

Validation Rules

Date Validation

Must be January 24-31, 2026:

# From scoring.py:62
VALID_DATE_START = datetime(2026, 1, 24)
VALID_DATE_END = datetime(2026, 1, 31)

def is_valid_date(date_str: str) -> bool:
    parsed = parse_date(date_str)
    if not parsed:
        return False
    return VALID_DATE_START <= parsed <= VALID_DATE_END

Location Validation

Must be SF, San Jose, Palo Alto, or Mountain View ONLY:

# From scoring.py:66
VALID_LOCATIONS = [
    "san francisco",
    "san jose",
    "palo alto",
    "mountain view",
]

def is_valid_location(location: str) -> bool:
    location_lower = location.lower()
    return any(valid_loc in location_lower for valid_loc in VALID_LOCATIONS)

AI Relevance

Must be AI/ML related:

# From scoring.py:87
AI_KEYWORDS = [
    'ai', 'machine learning', 'ml', 'llm', 'gpt', 'claude',
    'langchain', 'rag', 'neural', 'deep learning', 'nlp',
    'computer vision', 'generative', 'transformer', 'agent',
    'anthropic', 'openai', 'hugging face', 'pytorch',
    'tensorflow', 'data science', 'mlops', 'embedding',
    'vector', 'genai', 'diffusion', 'hackathon',
]

def is_ai_related(title: str) -> bool:
    title_lower = title.lower()
    return any(keyword in title_lower for keyword in AI_KEYWORDS)

URL Validation

Must be from known event platforms:

# From scoring.py:75
VALID_URL_PATTERNS = [
    r'https?://(www\.)?lu\.ma/',
    r'https?://(www\.)?luma\.com/',
    r'https?://(www\.)?meetup\.com/',
    r'https?://(www\.)?eventbrite\.(com|co\.uk)/',
    r'https?://(www\.)?partiful\.com/',
    r'https?://(www\.)?supermomos\.com/',
    r'https?://(www\.)?humanx\.co/',
    r'https?://(compute\.)?daytona\.io/',
]

Hackathon Requirement

Must include at least one hackathon event:

# From scoring.py:157
def has_hackathon_event(events: list[dict]) -> bool:
    for event in events:
        title = event.get('title', '').lower()
        event_type = event.get('event_type', '').lower()
        if 'hackathon' in title or 'hackathon' in event_type:
            return True
    return False

Quality Score Calculation

# From scoring.py:276
score = 0.0

# Valid events: up to 40 points (10+ valid = 40 points)
if total_events > 0:
    valid_ratio = valid_events / total_events
    score += valid_ratio * 40

# Hackathon event: 25 points
if hackathon_found:
    score += 25

# Event count bonus: up to 20 points (14+ events = 20 points)
score += min(20, (valid_events / 14) * 20)

# No invalid events bonus: 15 points
if len(invalid_date_events) == 0 and len(invalid_location_events) == 0:
    score += 15

Maximum Score: 100 points

Quality Breakdown Example

Agent Gamma’s quality score:

10 valid events out of 10 total → 40 points (100% valid ratio × 40)
Hackathon event included → 25 points
10 valid events → 14.3 points (10/14 × 20)
Zero invalid events → 15 points

Total: 94.3/100

Speed Score (20%)

Faster execution gets higher scores:

# From scoring.py:332
def calculate_speed_score(runtime_seconds: float, max_time: float = 30.0) -> float:
    if runtime_seconds <= 0:
        return 100.0
    if runtime_seconds >= max_time:
        return 0.0
    return round(100 * (1 - runtime_seconds / max_time), 1)

Calculation:

0 seconds = 100 points
30+ seconds = 0 points
Linear interpolation between

Speed Examples

Agent Alpha: 4.2s → 86.0 points
Agent Beta: 12.8s → 57.3 points
Agent Gamma: 6.5s → 78.3 points

Final Score Calculation

Putting it all together:

# From scoring.py:359
total = (
    success_score * weights['success'] +
    quality_score * weights['quality'] +
    speed_score * weights['speed']
)

Example: Agent Gamma

success_score = 100.0  # Ran without errors
quality_score = 94.3   # 10 valid events, hackathon included
speed_score = 78.3     # 6.5 seconds execution

total_score = (
    100.0 * 0.2 +    # Success: 20.0
    94.3 * 0.6 +     # Quality: 56.6
    78.3 * 0.2       # Speed: 15.7
) = 92.3

Example: Agent Alpha

success_score = 100.0  # Ran without errors
quality_score = 45.0   # Missing hackathon event, only 4 events
speed_score = 86.0     # Very fast: 4.2 seconds

total_score = (
    100.0 * 0.2 +    # Success: 20.0
    45.0 * 0.6 +     # Quality: 27.0
    86.0 * 0.2       # Speed: 17.2
) = 64.2

Example: Agent Delta

success_score = 0.0    # Crashed with divide-by-zero
quality_score = 0.0    # No output produced
speed_score = 0.0      # Failed before completion

total_score = 0.0      # Disqualified

Sentry Integration

Sentry provides reliability data for scoring:

# From scoring_rubric.md:252
def calculate_reliability_score(
    exit_code: int,
    exception_count: int,
    sentry_event_count: int,
) -> float:
    if exit_code != 0:
        return 0  # Hard fail

    base = 100
    exception_penalty = exception_count * 10
    sentry_penalty = sentry_event_count * 5

    return max(0, base - exception_penalty - sentry_penalty)

Penalty Structure:

Exception: -10 points each
Sentry event: -5 points each
Non-zero exit code: Instant 0

Scoreboard Display

Results are shown in a clear, sortable format:

╔══════════════════════════════════════════════════════════════╗
║                    THE FORGE - SCOREBOARD                     ║
╠══════════════════════════════════════════════════════════════╣
║  Candidate         │ Gates │ Rel  │ Qual │ Spd  │ Fmt  │ TOTAL║
╠══════════════════════════════════════════════════════════════╣
║  🥇 Agent Gamma    │ ✅ 5/5 │  95  │  94  │  78  │  90  │ 92.3 ║
║  🥈 Agent Beta     │ ✅ 5/5 │  90  │  88  │  57  │  85  │ 84.1 ║
║  🥉 Agent Alpha    │ ✅ 5/5 │  100 │  45  │  86  │  75  │ 64.2 ║
╠══════════════════════════════════════════════════════════════╣
║  ❌ Agent Epsilon  │ ✅ 5/5 │  100 │  12  │  92  │  60  │ 34.4 ║
║  ❌ Agent Delta    │ ❌ 0/5 │  0   │  0   │  0   │  0   │ DQ   ║
╚══════════════════════════════════════════════════════════════╝

Key Insights

Quality Dominates

With 60% weight, quality is the primary differentiator. Fast garbage loses to slow excellence.

Hard Gates Prevent Gaming

Deterministic checks ensure LLMs can’t “rubber stamp” bad outputs. You either pass or you don’t.

Speed Matters, But Less

At 20%, speed is a tiebreaker. Agent Alpha’s speed can’t overcome its quality deficit.

Strict Validation

Every event is validated against 4 criteria: date, location, AI-relevance, URL. No shortcuts.

Get Started

Core Concepts

Guides

Integrations

Overview

Scoring Formula

Layer A: Hard Gates

Gate 1: Produces Artifact

Gate 2: Artifact Not Empty

Gate 3: Has Required Fields

Gate 4: Minimum Event Count

Gate 5: No Fatal Errors

Agent Delta

Agent Epsilon

Layer B: Rubric Scoring

Success Score (20%)

Quality Score (60%)

Validation Rules

Quality Score Calculation

Speed Score (20%)

Final Score Calculation

Example: Agent Gamma

Example: Agent Alpha

Example: Agent Delta

Sentry Integration

Scoreboard Display

Key Insights

Quality Dominates

Hard Gates Prevent Gaming

Speed Matters, But Less

Strict Validation

Next Steps

Agent Strategies

Architecture

Build docs developers (and LLMs) love

Get Started

Core Concepts

Guides

Integrations

​Overview

​Scoring Formula

​Layer A: Hard Gates

​Gate 1: Produces Artifact

​Gate 2: Artifact Not Empty

​Gate 3: Has Required Fields

​Gate 4: Minimum Event Count

​Gate 5: No Fatal Errors

Agent Delta

Agent Epsilon

​Layer B: Rubric Scoring

​Success Score (20%)

​Quality Score (60%)

​Validation Rules

​Quality Score Calculation

​Speed Score (20%)

​Final Score Calculation

​Example: Agent Gamma

​Example: Agent Alpha

​Example: Agent Delta

​Sentry Integration

​Scoreboard Display

​Key Insights

Quality Dominates

Hard Gates Prevent Gaming

Speed Matters, But Less

Strict Validation

​Next Steps

Agent Strategies

Architecture

Build docs developers (and LLMs) love

Overview

Scoring Formula

Layer A: Hard Gates

Gate 1: Produces Artifact

Gate 2: Artifact Not Empty

Gate 3: Has Required Fields

Gate 4: Minimum Event Count

Gate 5: No Fatal Errors

Layer B: Rubric Scoring

Success Score (20%)

Quality Score (60%)

Validation Rules

Quality Score Calculation

Speed Score (20%)

Final Score Calculation

Example: Agent Gamma

Example: Agent Alpha

Example: Agent Delta

Sentry Integration

Scoreboard Display

Key Insights

Next Steps