Practice Exercises

Complete hands-on exercises to implement training workflows, experiment tracking, and model testing for your ML project.

Prerequisites

Before starting:

Have a training pipeline (from test task or your own project)
Set up a GitHub repository
Create a W&B account
Review the classic and generative examples

You can use the HuggingFace text classification example as a starting point.

Homework 5: Training & Experiments

Learning Objectives

Experiment Tracking

Set up W&B logging and track training experiments

Hyperparameter Search

Run systematic hyperparameter optimization

Model Documentation

Create comprehensive model cards

Distributed Training

Scale training to multiple GPUs

Reading List

Core Reading

Advanced Topics

Distributed Training

Tasks

Update Design Document

Add experiment management section to your Google Doc:

Chosen experiment tracking tool (W&B, Neptune, MLflow)
Model card template and structure
Hyperparameter search strategy
Model versioning approach

PR1: W&B Experiment Logging

Implement experiment tracking with Weights & Biases:

from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(
    output_dir="results",
    report_to=["wandb"],
    run_name="bert-experiment-1",
    logging_steps=100,
    eval_steps=100,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    compute_metrics=compute_metrics,
)

trainer.train()

Requirements:

Log training/validation metrics
Track hyperparameters
Save model checkpoints
Create W&B project link

PR2: Hyperparameter Search

Implement hyperparameter search with W&B sweeps:

sweep.yaml

program: train.py
method: bayes
metric:
  name: eval_f1
  goal: maximize
parameters:
  learning_rate:
    min: 0.00001
    max: 0.0001
  per_device_train_batch_size:
    values: [16, 32, 64]
  num_train_epochs:
    values: [3, 5, 7]

wandb sweep sweep.yaml
wandb agent your-entity/project/sweep-id

Requirements:

Define search space
Run at least 10 experiments
Document best hyperparameters
Compare results in W&B

PR3: Model Card

Create comprehensive model card:

# Automatic generation
trainer.create_model_card(
    finetuned_from=model_args.model_name_or_path,
    tasks="text-classification",
    language="en",
    dataset_tags="sst2",
)

Or use the Model Card Toolkit.Requirements:

Model details (architecture, training data)
Intended use cases
Performance metrics
Limitations and biases
Training procedure
Evaluation results

PR4 (Optional): MosaicBERT Tutorial

Replicate the MosaicBERT tutorial to train BERT from scratch efficiently.Focus areas:

Efficient training techniques
Cost optimization
Performance benchmarking

PR5 (Optional): NNI Hyperparameter Search

Implement hyperparameter search using Microsoft NNI:

import nni

def main():
    # Get parameters from NNI
    params = nni.get_next_parameter()
    
    # Train with params
    model = train_model(
        learning_rate=params['learning_rate'],
        batch_size=params['batch_size'],
    )
    
    # Report result
    nni.report_final_result(accuracy)

if __name__ == '__main__':
    main()

Compare with W&B sweeps:

Search efficiency
Ease of use
Visualization capabilities

PR6 (Optional): Distributed Training

Implement multi-GPU training:Option A - PyTorch DDP:

torchrun --nproc_per_node=4 train.py

Option B - Accelerate:

accelerate config
accelerate launch train.py

Option C - Ray:

from ray.train.torch import TorchTrainer

trainer = TorchTrainer(
    train_func=train_function,
    scaling_config={"num_workers": 4},
)
trainer.fit()

Deliverables

Required
Optional

Success Criteria

✅ All required PRs merged
✅ W&B project shows multiple experiments
✅ Model card includes all required sections
✅ Hyperparameter search identifies best config
✅ Design document describes experiment strategy

Homework 6: Testing & CI

Learning Objectives

Code Testing

Write unit tests for training code

Data Validation

Test data quality and schema

Model Testing

Validate model behavior and performance

CI/CD Integration

Automate testing in GitHub Actions

Reading List

Testing Fundamentals

ML-Specific Testing

LLM Testing

CI/CD

Tasks

Update Testing Plan

Add testing strategy to your Google Doc:

Test coverage goals
Data validation approach
Model behavioral tests
CI/CD pipeline design

PR1: Code Tests

Write unit tests for training code:

test_code.py

import pytest
from your_package.utils import compute_metrics

def test_compute_metrics():
    """Test metrics calculation."""
    eval_pred = EvalPrediction(
        predictions=np.array([[0.1, 0.9], [0.8, 0.2]]),
        label_ids=np.array([1, 0])
    )
    
    metrics = compute_metrics(eval_pred)
    
    assert "f1" in metrics
    assert 0 <= metrics["f1"] <= 1

Test coverage:

Utility functions
Data preprocessing
Metric computation
Configuration loading

PR2: Data Tests

Implement data validation tests:

test_data.py

from great_expectations.dataset import PandasDataset

def test_data_shape(df: PandasDataset):
    """Test dataset dimensions."""
    assert df.shape[0] > 1000
    assert df.shape[1] == 3

def test_data_schema(df: PandasDataset):
    """Test column schema."""
    assert df.expect_table_columns_to_match_ordered_list(
        column_list=["text", "label", "id"]
    )["success"]

def test_data_quality(df: PandasDataset):
    """Test data quality."""
    assert df.expect_column_values_to_not_be_null(
        column="text"
    )["success"]
    
    assert df.expect_column_values_to_be_in_set(
        column="label",
        value_set=[0, 1]
    )["success"]

Use Great Expectations or Deepchecks.

PR3: Model Tests

Write model behavioral tests:

test_model.py

def test_overfit_batch(trainer):
    """Test model can overfit small batch."""
    train_result = trainer.train()
    assert train_result.metrics["train_loss"] < 0.01

def test_invariance():
    """Test predictions are invariant to irrelevant changes."""
    text1 = "I am flying to NYC"
    text2 = "I am flying to Toronto"
    
    pred1 = model.predict(text1)
    pred2 = model.predict(text2)
    
    # Sentiment should be similar
    assert abs(pred1 - pred2) < 0.1

def test_directional():
    """Test predictions change in expected direction."""
    positive = "This movie is excellent!"
    negative = "This movie is terrible!"
    
    pred_pos = model.predict(positive)
    pred_neg = model.predict(negative)
    
    assert pred_pos > pred_neg

Reference: Made With ML Testing Guide

PR4: Model Registry

Implement model versioning with W&B:

import wandb

def upload_to_registry(model_name: str, model_path: Path):
    """Upload model to W&B registry."""
    with wandb.init() as _:
        art = wandb.Artifact(model_name, type="model")
        art.add_file(model_path / "config.json")
        art.add_file(model_path / "model.safetensors")
        art.add_file(model_path / "tokenizer.json")
        art.add_file(model_path / "README.md")
        wandb.log_artifact(art)

def load_from_registry(model_name: str):
    """Download model from registry."""
    with wandb.init() as run:
        artifact = run.use_artifact(model_name, type="model")
        artifact_dir = artifact.download()
        return artifact_dir

PR5 (Optional): Model Interpretability

Use LIT (Learning Interpretability Tool) or similar:

from lit_nlp import notebook
from lit_nlp.examples.datasets import glue
from lit_nlp.examples.models import glue_models

# Load dataset and model
datasets = {'sst2_dev': glue.SST2Data('validation')}
models = {'sst2': glue_models.SST2Model('path/to/model')}

# Start LIT
widget = notebook.LitWidget(models, datasets, height=800)
widget.render()

For other domains (CV, audio, tabular), find equivalent tools.

PR6 (Optional): LLM API Testing

Test LLM API with DeepEval or Promptfoo:

from deepeval import assert_test
from deepeval.metrics import HallucinationMetric
from deepeval.test_case import LLMTestCase

def test_llm_hallucination():
    """Test for hallucinations."""
    test_case = LLMTestCase(
        input="What is the capital of France?",
        actual_output="Paris is the capital of France.",
        context=["Paris is the capital of France."]
    )
    
    metric = HallucinationMetric(threshold=0.5)
    assert_test(test_case, [metric])

CI/CD Pipeline

Set up GitHub Actions workflow:

.github/workflows/test.yml

name: Tests

on: [push, pull_request]

jobs:
  test:
    runs-on: ubuntu-latest
    
    steps:
    - uses: actions/checkout@v2
    
    - name: Set up Python
      uses: actions/setup-python@v2
      with:
        python-version: '3.10'
    
    - name: Install dependencies
      run: |
        pip install -r requirements.txt
        pip install pytest pytest-cov
    
    - name: Run code tests
      run: pytest tests/test_code.py
    
    - name: Run data tests
      run: pytest tests/test_data.py
    
    - name: Run model tests
      run: pytest tests/test_model.py
    
    - name: Upload coverage
      uses: codecov/codecov-action@v2

Deliverables

Required
Optional

Google Doc updated with testing plan
PR1: Code tests with CI integration
PR2: Data validation tests
PR3: Model behavioral tests
PR4: Model registry integration
All tests pass in CI

Success Criteria

✅ All required PRs merged
✅ Tests run automatically in CI
✅ Code coverage > 80%
✅ Data validation catches quality issues
✅ Model tests verify expected behavior
✅ Models versioned in registry

Resources

Classic Example

Reference implementation for BERT training

Generative Example

Reference implementation for Phi-3 training

Made With ML

Comprehensive testing guide

W&B Documentation

Complete Weights & Biases guide

Getting Help

If you get stuck:

Review the reference implementations in module-3/
Check the reading list for relevant articles
Search W&B documentation
Ask questions in course discussion forum
Compare with HuggingFace examples

Remember: The goal is to build production-ready training workflows, not just achieve high accuracy. Focus on reproducibility, testing, and documentation.

LLM Training (Phi-3)

Pipeline Orchestration Overview

⌘I

Module 1: Infrastructure

Module 2: Data Management

Module 3: Training Workflows

Module 4: Pipeline Orchestration

Module 5: Model Serving

Module 6: Optimization

Module 7: Monitoring

Module 8: Cloud Platforms

Practice Exercises

Practice Exercises

Prerequisites

Homework 5: Training & Experiments

Learning Objectives

Experiment Tracking

Hyperparameter Search

Model Documentation

Distributed Training

Reading List

Tasks

Deliverables

Success Criteria

Homework 6: Testing & CI

Learning Objectives

Code Testing

Data Validation

Model Testing

CI/CD Integration

Reading List

Tasks

CI/CD Pipeline

Deliverables

Success Criteria

Resources

Classic Example

Generative Example

Made With ML

W&B Documentation

Getting Help

Build docs developers (and LLMs) love

Module 1: Infrastructure

Module 2: Data Management

Module 3: Training Workflows

Module 4: Pipeline Orchestration

Module 5: Model Serving

Module 6: Optimization

Module 7: Monitoring

Module 8: Cloud Platforms

​Practice Exercises

​Prerequisites

​Homework 5: Training & Experiments

​Learning Objectives

Experiment Tracking

Hyperparameter Search

Model Documentation

Distributed Training

​Reading List

​Tasks

​Deliverables

​Success Criteria

​Homework 6: Testing & CI

​Learning Objectives

Code Testing

Data Validation

Model Testing

CI/CD Integration

​Reading List

​Tasks

​CI/CD Pipeline

​Deliverables

​Success Criteria

​Resources

Classic Example

Generative Example

Made With ML

W&B Documentation

​Getting Help

Build docs developers (and LLMs) love

Practice Exercises

Prerequisites

Homework 5: Training & Experiments

Learning Objectives

Reading List

Tasks

Deliverables

Success Criteria

Homework 6: Testing & CI

Learning Objectives

Reading List

Tasks

CI/CD Pipeline

Deliverables

Success Criteria

Resources

Getting Help