Skip to main content

Why Structured Training Matters

Training ML models is inherently experimental. You’ll run dozens or hundreds of experiments, tweaking hyperparameters, architectures, and datasets. Without proper workflow management:
  • Lost experiments: “What config gave us 92% accuracy?”
  • Unreproducible results: “The model worked yesterday…”
  • Wasted compute: Re-running failed experiments because you forgot to log something
  • Team chaos: Everyone has their own training scripts with hardcoded values
A good training workflow provides reproducibility, configurability, and observability.

Configuration Management

Hydra

Hydra separates code from configuration, making experiments reproducible and composable:
# config.yaml
model:
  name: bert-base
  dropout: 0.1
training:
  batch_size: 32
  learning_rate: 2e-5
  epochs: 3
import hydra
from omegaconf import DictConfig

@hydra.main(config_path=".", config_name="config", version_base=None)
def train(cfg: DictConfig):
    model = load_model(cfg.model.name, dropout=cfg.model.dropout)
    # Training code uses cfg instead of hardcoded values
    train_model(model, lr=cfg.training.learning_rate)

if __name__ == "__main__":
    train()
Override from CLI:
python train.py training.learning_rate=5e-5 model.dropout=0.2
Hydra automatically logs the full config with each run, making experiments reproducible.
Hydra’s composition lets you create config variants (e.g., config/model/bert-base.yaml, config/model/roberta.yaml) and mix them: python train.py model=roberta

Experiment Tracking

Weights & Biases (W&B)

W&B is the industry standard for tracking experiments:
import wandb

wandb.init(project="my-project", config={"lr": 1e-3, "batch_size": 32})

for epoch in range(num_epochs):
    loss = train_epoch()
    val_acc = evaluate()
    wandb.log({"loss": loss, "val_acc": val_acc, "epoch": epoch})

wandb.log_artifact(model, name="final-model", type="model")
Key features:
  • Interactive dashboards with real-time plots
  • Hyperparameter sweeps
  • Model versioning and lineage
  • Collaboration and sharing
  • Free for personal/academic use

MLflow

Open-source, self-hosted, integrates with many frameworks

Neptune.ai

Strong metadata search, good for large teams

Comet ML

Great UI, built-in model registry

TensorBoard

Simple, works offline, but limited features
For production systems, consider MLflow for its model registry and deployment integrations. For research, W&B or Neptune.ai provide the best UX.

Project Structure

A well-organized project makes collaboration easier:
my-ml-project/
├── configs/           # Hydra configs
│   ├── config.yaml
│   ├── model/
│   └── training/
├── src/
│   ├── data/          # Dataloaders
│   ├── models/        # Model definitions
│   ├── training/      # Training loops
│   └── utils/
├── tests/
├── notebooks/         # Exploration (not for production)
├── pyproject.toml     # Dependencies
└── README.md
Use uv or poetry for dependency management instead of raw pip. They create reproducible environments and handle version resolution.

Code Quality

Ruff

Ruff is a fast Python linter and formatter:
# Format code
ruff format .

# Check for issues
ruff check .
Ruff replaces multiple tools (Black, isort, flake8, pylint) with one fast binary written in Rust.
Add ruff format and ruff check to your CI pipeline. Enforce formatting before merging PRs.

Classic Example: BERT Fine-tuning

Module 3 includes a complete example of fine-tuning BERT for text classification:
  • Hydra for configuration
  • W&B for experiment tracking
  • Hugging Face Transformers
  • Proper train/val/test splits
  • Metric logging and model checkpointing

Modern Example: LLM Fine-tuning

Module 3 also covers fine-tuning modern LLMs (Phi-3):
  • LoRA (Low-Rank Adaptation) for parameter-efficient training
  • Quantization (4-bit/8-bit) to fit on consumer GPUs
  • Instruction tuning datasets
  • Evaluation on domain-specific tasks
For LLMs, prefer LoRA or QLoRA over full fine-tuning. They’re faster, use less memory, and often generalize better.

Testing LLM Outputs

LLMs introduce non-determinism. Testing requires different strategies:

DeepEval

Evaluate RAG systems, check hallucinations, measure relevance

Promptfoo

CLI for testing prompts across models and configs

Ragas

Metrics for retrieval and generation quality

UpTrain

Monitor prompt performance over time
Example test:
from deepeval import assert_test
from deepeval.metrics import HallucinationMetric
from deepeval.test_case import LLMTestCase

test_case = LLMTestCase(
    input="What is the capital of France?",
    actual_output="Paris is the capital.",
    context=["Paris is the capital and largest city of France."]
)

metric = HallucinationMetric(threshold=0.5)
assert_test(test_case, [metric])

Distributed Training

For large models, single-GPU training isn’t enough:
  • Data Parallel: Replicate model on each GPU, split batches
  • Model Parallel: Split model layers across GPUs
  • Pipeline Parallel: Like model parallel, but with pipelining
  • FSDP (Fully Sharded Data Parallel): Shard model parameters across GPUs
PyTorch Lightning and DeepSpeed abstract distributed training. Start with Lightning’s built-in DDP, then move to DeepSpeed ZeRO for 100B+ models.
Instead of manual tuning, use automated search:
  • Ray Tune: Distributed HPO with early stopping
  • Optuna: Bayesian optimization
  • Weights & Biases Sweeps: Integrated with W&B
  • AutoGluon: AutoML with minimal code
# W&B sweep config
sweep_config = {
    'method': 'bayes',
    'metric': {'name': 'val_acc', 'goal': 'maximize'},
    'parameters': {
        'learning_rate': {'min': 1e-5, 'max': 1e-3},
        'batch_size': {'values': [16, 32, 64]},
    }
}

sweep_id = wandb.sweep(sweep_config, project='my-project')
wandb.agent(sweep_id, function=train)

Model Cards

Document your models for transparency and reproducibility:
  • What: Architecture, dataset, training procedure
  • Why: Intended use case and limitations
  • How: Performance metrics, biases, ethical considerations
Hugging Face popularized model cards. See GPT-4 System Card for a production example.

Hands-On Examples

Explore training workflows in Module 3:
  • BERT fine-tuning with Hydra + W&B
  • Phi-3 fine-tuning with LoRA
  • LLM evaluation with DeepEval
  • Project structure best practices

Next Steps

Pipeline Orchestration

Automate training at scale

Model Serving

Deploy trained models

Further Reading

Build docs developers (and LLMs) love