Experiment Manager

Overview

The Experiment Manager provides a structured way to track experiments, log metrics, manage checkpoints, and maintain experiment history with automatic versioning.

ExperimentManager

Class for managing experiment lifecycle, metrics, and checkpoints.

Constructor

ExperimentManager(log_dir="experiments/logs")

Parameters:

log_dir (str): Directory where experiment logs are stored. Defaults to experiments/logs.

Methods

start_experiment

Initiates a new experiment and creates an experiment record. Parameters:

config_name

str

required

Name of the configuration used for this experiment.

hyperparameters

dict

required

Dictionary of hyperparameters for the experiment (e.g., learning rate, batch size).

metadata

dict

required

Additional metadata about the experiment. The following fields are automatically added if not provided:

precision: Precision mode
model_size: Model size description
dataset_version: Dataset version identifier
hardware_constraint_mode: Hardware constraint settings

experiment_id

str

Custom experiment ID. If not provided, derived from config_name.

Returns: ExperimentRecord - The created experiment record Example:

from experiment_manager import ExperimentManager

manager = ExperimentManager(log_dir="my_experiments")

record = manager.start_experiment(
    config_name="mnist_baseline",
    hyperparameters={
        "learning_rate": 0.01,
        "batch_size": 32,
        "epochs": 10
    },
    metadata={
        "precision": "float32",
        "model_size": "medium",
        "dataset_version": "v1.0"
    }
)

print(f"Started experiment {record.experiment_id} v{record.version}")

log_metrics

Logs metrics for the active experiment. Parameters:

metrics (dict): Dictionary mapping metric names to lists of values (e.g., per-epoch losses)

Returns: None Raises: RuntimeError if no active experiment Example:

manager.log_metrics({
    "loss": [0.5, 0.4, 0.3, 0.25],
    "accuracy": [0.85, 0.88, 0.90, 0.92],
    "val_loss": [0.52, 0.45, 0.38, 0.33]
})

add_checkpoint

Records a checkpoint path for the active experiment. Parameters:

checkpoint_path (str): File path to the saved checkpoint

Returns: None Raises: RuntimeError if no active experiment Example:

manager.add_checkpoint("checkpoints/model_epoch_5.npz")
manager.add_checkpoint("checkpoints/model_final.npz")

read_history

Reads the complete history for an experiment ID. Parameters:

experiment_id (str): Experiment identifier

Returns: List of dictionaries, one per version Example:

history = manager.read_history("mnist_baseline")
for entry in history:
    print(f"Version {entry['version']}: accuracy = {entry['metrics'].get('accuracy', [])}")

ExperimentRecord

Dataclass representing a single experiment run.

Fields

experiment_id

str

Unique identifier for the experiment series.

version

int

Version number, auto-incremented for each run of the same experiment.

created_at

str

ISO 8601 timestamp of when the experiment was created.

config_name

str

Name of the configuration used.

hyperparameters

dict

Hyperparameters used in the experiment.

metadata

dict

Additional metadata about the experiment.

metrics

dict

Dictionary mapping metric names to lists of values. Defaults to empty dict.

checkpoints

list

List of checkpoint file paths. Defaults to empty list.

Complete Workflow Example

from experiment_manager import ExperimentManager
import numpy as np

# Initialize manager
manager = ExperimentManager(log_dir="experiments/logs")

# Start experiment
record = manager.start_experiment(
    config_name="neural_net_experiment",
    hyperparameters={
        "learning_rate": 0.001,
        "batch_size": 64,
        "epochs": 20,
        "optimizer": "sgd"
    },
    metadata={
        "precision": "float16",
        "model_size": "large",
        "dataset_version": "v2.1",
        "hardware_constraint_mode": "memory_limited"
    }
)

# Simulate training and log metrics
train_losses = []
train_accuracies = []

for epoch in range(20):
    # ... training code ...
    train_losses.append(np.random.uniform(0.3, 0.1))
    train_accuracies.append(np.random.uniform(0.85, 0.95))
    
    # Save checkpoint every 5 epochs
    if (epoch + 1) % 5 == 0:
        checkpoint_path = f"checkpoints/model_epoch_{epoch+1}.npz"
        # ... save model ...
        manager.add_checkpoint(checkpoint_path)

# Log final metrics
manager.log_metrics({
    "loss": train_losses,
    "accuracy": train_accuracies
})

# View experiment history
history = manager.read_history(record.experiment_id)
print(f"Total versions: {len(history)}")
for h in history:
    print(f"  v{h['version']}: {h['created_at']}")

Notes

Each experiment ID maintains its own history file in JSON format
Versions are automatically incremented when starting a new experiment with the same ID
All metrics are converted to JSON-compatible types automatically
The active experiment is persisted after every operation (start, log_metrics, add_checkpoint)
Only one experiment can be active at a time per ExperimentManager instance

Core Components

Configuration

Training & Evaluation

Analysis Tools

CLI Scripts

Overview