Evaluation Metrics

Timepoint Pro provides built-in evaluation metrics to validate simulation quality and consistency.

Running Evaluation

python cli.py mode=evaluate

Evaluates all entities currently stored in the database.

Core Metrics

Three primary metrics assess simulation quality:

Temporal Coherence

Consistency of entities across timepoints

Knowledge Consistency

Information conservation compliance

Biological Plausibility

Constraint enforcement validation

Temporal Coherence Score

Measures behavioral consistency across timepoints.

Formula

violations = 0
for each consecutive timepoint pair:
    if personality_traits_changed_significantly:
        violations += 1

score = 1.0 - (violations / num_transitions)

What It Validates

Behavioral Inertia

Personality traits should remain stable over timeChecks:

Personality trait consistency
Character arc plausibility
No sudden personality shifts

Example violations:

Cautious character becomes reckless overnight
Reserved person suddenly becomes extroverted
Core values change without cause

Trait Persistence

Core characteristics persist unless causally justifiedChecks:

Trait stability across timepoints
Gradual vs. sudden changes
Causal explanations for shifts

Score Interpretation

1.00
0.80-0.99
0.60-0.79
<0.60

Perfect CoherenceNo behavioral violations detected. Entities maintain consistent personalities across all timepoints.

Knowledge Consistency Score

Validates information conservation - entities can only know what they’ve been exposed to.

Formula

if knowledge_properly_sourced:
    score = 1.0
else:
    score = 0.0

What It Validates

Exposure Event Tracking

Every knowledge item must have a sourceChecks:

All knowledge has recorded exposure event
Source entity or event exists
Timestamp is causally valid

Example violations:

Entity knows information without witnessing it
Knowledge appears without source
Anachronistic information (knows future events)

Information Propagation

Knowledge spreads through valid pathsChecks:

Information flows along relationship edges
No spontaneous knowledge generation
Social network constraints respected

Example violations:

Entity knows secrets without connection to source
Information spreads faster than possible
Knowledge crosses disconnected graph components

Temporal Causality

Knowledge can only come from past eventsChecks:

Exposure timestamp < current timepoint
No future information leak
Proper causal chain

Example violations:

Entity knows outcome before it happens
Future information influences past decisions
Causal chain broken

Score Interpretation

ValidAll knowledge properly sourced. No information conservation violations.

Biological Plausibility Score

Measures constraint enforcement and physical/resource realism.

Formula

violations = 0
for each action:
    if violates_constraints:
        violations += 1

score = 1.0 - (violations / num_actions)

What It Validates

Physical Constraints

Actions respect physical limitationsChecks:

Movement speed plausible
Energy expenditure realistic
Physical capabilities within human range

Example violations:

Entity travels impossible distance in timespan
Action requires more energy than available
Superhuman abilities without justification

Resource Constraints

Actions consume appropriate resourcesChecks:

Energy budget tracking
Resource availability
Consumption rates

Example violations:

Entity acts without sufficient energy
Resource consumption exceeds supply
Negative resource balances

Embodied States

Physical and emotional states influence behaviorChecks:

Fatigue affects performance
Stress influences decisions
Physiological needs matter

Example violations:

Exhausted entity performs at peak
Emotional state ignored in decision-making
Physical needs not reflected in behavior

Score Interpretation

1.00
0.80-0.99
0.60-0.79
<0.60

Fully PlausibleNo constraint violations. All actions respect physical and resource limitations.

Example Output

Evaluating 5 entities:

  george_washington:
    Temporal Coherence:      0.95
    Knowledge Consistency:   1.00
    Biological Plausibility: 0.92

  john_adams:
    Temporal Coherence:      0.88
    Knowledge Consistency:   1.00
    Biological Plausibility: 0.87

  thomas_jefferson:
    Temporal Coherence:      0.91
    Knowledge Consistency:   1.00
    Biological Plausibility: 0.89

  alexander_hamilton:
    Temporal Coherence:      0.93
    Knowledge Consistency:   1.00
    Biological Plausibility: 0.94

  james_madison:
    Temporal Coherence:      0.87
    Knowledge Consistency:   1.00
    Biological Plausibility: 0.85

Resolution Distribution:
  SCENE: 3 entities
  DIALOG: 2 entities

Cost: $0.00 (evaluation uses cached data)
Tokens: 0

Resolution Distribution

Evaluation also reports entity resolution levels:

TENSOR_ONLY
SCENE
DIALOG
FULL_CONTEXT

Minimal detail, compressed representation only~200 tokens per entity

Generated Reports

Evaluation generates two report files:

{
  "entities_evaluated": 5,
  "resolution_distribution": {
    "SCENE": 3,
    "DIALOG": 2
  },
  "cost": 0.00,
  "tokens": 0,
  "timestamp": "2024-12-07T12:34:56"
}

Validation Integration

Evaluation metrics use the same validators as training:

validate_behavioral_inertia() - Temporal coherence
validate_information_conservation() - Knowledge consistency
validate_biological_constraints() - Biological plausibility

See Validation for implementation details.

When to Evaluate

Run evaluation after:

Training

After mode=train or mode=temporal_train to validate entity quality

Simulation

After running templates with ./run.sh to check consistency

Debugging

When investigating unexpected entity behavior

Before Export

Before exporting data to ensure quality

Next Steps

Interactive Queries

Query your evaluated entities

Training

Improve entity quality with better training

Validation

Learn about validation system

CLI Overview

Back to CLI overview

CLI Commands

REST API

Python API

Export Formats

Running Evaluation

Core Metrics

Temporal Coherence

Knowledge Consistency

Biological Plausibility

Temporal Coherence Score

Formula

What It Validates

Score Interpretation

Knowledge Consistency Score

Formula

What It Validates

Score Interpretation

Biological Plausibility Score

Formula

What It Validates

Score Interpretation

Example Output

Resolution Distribution

Generated Reports

Validation Integration

When to Evaluate

Next Steps

Interactive Queries

Training

Validation

CLI Overview

Build docs developers (and LLMs) love

CLI Commands

REST API

Python API

Export Formats

​Running Evaluation

​Core Metrics

Temporal Coherence

Knowledge Consistency

Biological Plausibility

​Temporal Coherence Score

​Formula

​What It Validates

​Score Interpretation

​Knowledge Consistency Score

​Formula

​What It Validates

​Score Interpretation

​Biological Plausibility Score

​Formula

​What It Validates

​Score Interpretation

​Example Output

​Resolution Distribution

​Generated Reports

​Validation Integration

​When to Evaluate

​Next Steps

Interactive Queries

Training

Validation

CLI Overview

Build docs developers (and LLMs) love

Running Evaluation

Core Metrics

Temporal Coherence Score

Formula

What It Validates

Score Interpretation

Knowledge Consistency Score

Formula

What It Validates

Score Interpretation

Biological Plausibility Score

Formula

What It Validates

Score Interpretation

Example Output

Resolution Distribution

Generated Reports

Validation Integration

When to Evaluate

Next Steps