Skip to main content

Overview

The Experiment Manager provides a structured way to track experiments, log metrics, manage checkpoints, and maintain experiment history with automatic versioning.

ExperimentManager

Class for managing experiment lifecycle, metrics, and checkpoints.

Constructor

ExperimentManager(log_dir="experiments/logs")
Parameters:
  • log_dir (str): Directory where experiment logs are stored. Defaults to experiments/logs.

Methods

start_experiment

Initiates a new experiment and creates an experiment record. Parameters:
config_name
str
required
Name of the configuration used for this experiment.
hyperparameters
dict
required
Dictionary of hyperparameters for the experiment (e.g., learning rate, batch size).
metadata
dict
required
Additional metadata about the experiment. The following fields are automatically added if not provided:
  • precision: Precision mode
  • model_size: Model size description
  • dataset_version: Dataset version identifier
  • hardware_constraint_mode: Hardware constraint settings
experiment_id
str
Custom experiment ID. If not provided, derived from config_name.
Returns: ExperimentRecord - The created experiment record Example:
from experiment_manager import ExperimentManager

manager = ExperimentManager(log_dir="my_experiments")

record = manager.start_experiment(
    config_name="mnist_baseline",
    hyperparameters={
        "learning_rate": 0.01,
        "batch_size": 32,
        "epochs": 10
    },
    metadata={
        "precision": "float32",
        "model_size": "medium",
        "dataset_version": "v1.0"
    }
)

print(f"Started experiment {record.experiment_id} v{record.version}")

log_metrics

Logs metrics for the active experiment. Parameters:
  • metrics (dict): Dictionary mapping metric names to lists of values (e.g., per-epoch losses)
Returns: None Raises: RuntimeError if no active experiment Example:
manager.log_metrics({
    "loss": [0.5, 0.4, 0.3, 0.25],
    "accuracy": [0.85, 0.88, 0.90, 0.92],
    "val_loss": [0.52, 0.45, 0.38, 0.33]
})

add_checkpoint

Records a checkpoint path for the active experiment. Parameters:
  • checkpoint_path (str): File path to the saved checkpoint
Returns: None Raises: RuntimeError if no active experiment Example:
manager.add_checkpoint("checkpoints/model_epoch_5.npz")
manager.add_checkpoint("checkpoints/model_final.npz")

read_history

Reads the complete history for an experiment ID. Parameters:
  • experiment_id (str): Experiment identifier
Returns: List of dictionaries, one per version Example:
history = manager.read_history("mnist_baseline")
for entry in history:
    print(f"Version {entry['version']}: accuracy = {entry['metrics'].get('accuracy', [])}")

ExperimentRecord

Dataclass representing a single experiment run.

Fields

experiment_id
str
Unique identifier for the experiment series.
version
int
Version number, auto-incremented for each run of the same experiment.
created_at
str
ISO 8601 timestamp of when the experiment was created.
config_name
str
Name of the configuration used.
hyperparameters
dict
Hyperparameters used in the experiment.
metadata
dict
Additional metadata about the experiment.
metrics
dict
Dictionary mapping metric names to lists of values. Defaults to empty dict.
checkpoints
list
List of checkpoint file paths. Defaults to empty list.

Complete Workflow Example

from experiment_manager import ExperimentManager
import numpy as np

# Initialize manager
manager = ExperimentManager(log_dir="experiments/logs")

# Start experiment
record = manager.start_experiment(
    config_name="neural_net_experiment",
    hyperparameters={
        "learning_rate": 0.001,
        "batch_size": 64,
        "epochs": 20,
        "optimizer": "sgd"
    },
    metadata={
        "precision": "float16",
        "model_size": "large",
        "dataset_version": "v2.1",
        "hardware_constraint_mode": "memory_limited"
    }
)

# Simulate training and log metrics
train_losses = []
train_accuracies = []

for epoch in range(20):
    # ... training code ...
    train_losses.append(np.random.uniform(0.3, 0.1))
    train_accuracies.append(np.random.uniform(0.85, 0.95))
    
    # Save checkpoint every 5 epochs
    if (epoch + 1) % 5 == 0:
        checkpoint_path = f"checkpoints/model_epoch_{epoch+1}.npz"
        # ... save model ...
        manager.add_checkpoint(checkpoint_path)

# Log final metrics
manager.log_metrics({
    "loss": train_losses,
    "accuracy": train_accuracies
})

# View experiment history
history = manager.read_history(record.experiment_id)
print(f"Total versions: {len(history)}")
for h in history:
    print(f"  v{h['version']}: {h['created_at']}")

Notes

  • Each experiment ID maintains its own history file in JSON format
  • Versions are automatically incremented when starting a new experiment with the same ID
  • All metrics are converted to JSON-compatible types automatically
  • The active experiment is persisted after every operation (start, log_metrics, add_checkpoint)
  • Only one experiment can be active at a time per ExperimentManager instance

Build docs developers (and LLMs) love