Skip to main content

Practice Exercises

Complete hands-on exercises to implement training workflows, experiment tracking, and model testing for your ML project.

Prerequisites

Before starting:
  • Have a training pipeline (from test task or your own project)
  • Set up a GitHub repository
  • Create a W&B account
  • Review the classic and generative examples
You can use the HuggingFace text classification example as a starting point.

Homework 5: Training & Experiments

Learning Objectives

Experiment Tracking

Set up W&B logging and track training experiments

Hyperparameter Search

Run systematic hyperparameter optimization

Model Documentation

Create comprehensive model cards

Distributed Training

Scale training to multiple GPUs

Reading List

Tasks

1

Update Design Document

Add experiment management section to your Google Doc:
  • Chosen experiment tracking tool (W&B, Neptune, MLflow)
  • Model card template and structure
  • Hyperparameter search strategy
  • Model versioning approach
2

PR1: W&B Experiment Logging

Implement experiment tracking with Weights & Biases:
from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(
    output_dir="results",
    report_to=["wandb"],
    run_name="bert-experiment-1",
    logging_steps=100,
    eval_steps=100,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    compute_metrics=compute_metrics,
)

trainer.train()
Requirements:
  • Log training/validation metrics
  • Track hyperparameters
  • Save model checkpoints
  • Create W&B project link
3

PR2: Hyperparameter Search

Implement hyperparameter search with W&B sweeps:
sweep.yaml
program: train.py
method: bayes
metric:
  name: eval_f1
  goal: maximize
parameters:
  learning_rate:
    min: 0.00001
    max: 0.0001
  per_device_train_batch_size:
    values: [16, 32, 64]
  num_train_epochs:
    values: [3, 5, 7]
wandb sweep sweep.yaml
wandb agent your-entity/project/sweep-id
Requirements:
  • Define search space
  • Run at least 10 experiments
  • Document best hyperparameters
  • Compare results in W&B
4

PR3: Model Card

Create comprehensive model card:
# Automatic generation
trainer.create_model_card(
    finetuned_from=model_args.model_name_or_path,
    tasks="text-classification",
    language="en",
    dataset_tags="sst2",
)
Or use the Model Card Toolkit.Requirements:
  • Model details (architecture, training data)
  • Intended use cases
  • Performance metrics
  • Limitations and biases
  • Training procedure
  • Evaluation results
5

PR4 (Optional): MosaicBERT Tutorial

Replicate the MosaicBERT tutorial to train BERT from scratch efficiently.Focus areas:
  • Efficient training techniques
  • Cost optimization
  • Performance benchmarking
6

PR5 (Optional): NNI Hyperparameter Search

Implement hyperparameter search using Microsoft NNI:
import nni

def main():
    # Get parameters from NNI
    params = nni.get_next_parameter()
    
    # Train with params
    model = train_model(
        learning_rate=params['learning_rate'],
        batch_size=params['batch_size'],
    )
    
    # Report result
    nni.report_final_result(accuracy)

if __name__ == '__main__':
    main()
Compare with W&B sweeps:
  • Search efficiency
  • Ease of use
  • Visualization capabilities
7

PR6 (Optional): Distributed Training

Implement multi-GPU training:Option A - PyTorch DDP:
torchrun --nproc_per_node=4 train.py
Option B - Accelerate:
accelerate config
accelerate launch train.py
Option C - Ray:
from ray.train.torch import TorchTrainer

trainer = TorchTrainer(
    train_func=train_function,
    scaling_config={"num_workers": 4},
)
trainer.fit()

Deliverables

  • Google Doc updated with experiment section
  • PR1: W&B experiment logging
  • PR2: Hyperparameter search
  • PR3: Model card
  • Public W&B project link
  • All PRs merged to main branch

Success Criteria

  • ✅ All required PRs merged
  • ✅ W&B project shows multiple experiments
  • ✅ Model card includes all required sections
  • ✅ Hyperparameter search identifies best config
  • ✅ Design document describes experiment strategy

Homework 6: Testing & CI

Learning Objectives

Code Testing

Write unit tests for training code

Data Validation

Test data quality and schema

Model Testing

Validate model behavior and performance

CI/CD Integration

Automate testing in GitHub Actions

Reading List

Tasks

1

Update Testing Plan

Add testing strategy to your Google Doc:
  • Test coverage goals
  • Data validation approach
  • Model behavioral tests
  • CI/CD pipeline design
2

PR1: Code Tests

Write unit tests for training code:
test_code.py
import pytest
from your_package.utils import compute_metrics

def test_compute_metrics():
    """Test metrics calculation."""
    eval_pred = EvalPrediction(
        predictions=np.array([[0.1, 0.9], [0.8, 0.2]]),
        label_ids=np.array([1, 0])
    )
    
    metrics = compute_metrics(eval_pred)
    
    assert "f1" in metrics
    assert 0 <= metrics["f1"] <= 1
Test coverage:
  • Utility functions
  • Data preprocessing
  • Metric computation
  • Configuration loading
3

PR2: Data Tests

Implement data validation tests:
test_data.py
from great_expectations.dataset import PandasDataset

def test_data_shape(df: PandasDataset):
    """Test dataset dimensions."""
    assert df.shape[0] > 1000
    assert df.shape[1] == 3

def test_data_schema(df: PandasDataset):
    """Test column schema."""
    assert df.expect_table_columns_to_match_ordered_list(
        column_list=["text", "label", "id"]
    )["success"]

def test_data_quality(df: PandasDataset):
    """Test data quality."""
    assert df.expect_column_values_to_not_be_null(
        column="text"
    )["success"]
    
    assert df.expect_column_values_to_be_in_set(
        column="label",
        value_set=[0, 1]
    )["success"]
Use Great Expectations or Deepchecks.
4

PR3: Model Tests

Write model behavioral tests:
test_model.py
def test_overfit_batch(trainer):
    """Test model can overfit small batch."""
    train_result = trainer.train()
    assert train_result.metrics["train_loss"] < 0.01

def test_invariance():
    """Test predictions are invariant to irrelevant changes."""
    text1 = "I am flying to NYC"
    text2 = "I am flying to Toronto"
    
    pred1 = model.predict(text1)
    pred2 = model.predict(text2)
    
    # Sentiment should be similar
    assert abs(pred1 - pred2) < 0.1

def test_directional():
    """Test predictions change in expected direction."""
    positive = "This movie is excellent!"
    negative = "This movie is terrible!"
    
    pred_pos = model.predict(positive)
    pred_neg = model.predict(negative)
    
    assert pred_pos > pred_neg
Reference: Made With ML Testing Guide
5

PR4: Model Registry

Implement model versioning with W&B:
import wandb

def upload_to_registry(model_name: str, model_path: Path):
    """Upload model to W&B registry."""
    with wandb.init() as _:
        art = wandb.Artifact(model_name, type="model")
        art.add_file(model_path / "config.json")
        art.add_file(model_path / "model.safetensors")
        art.add_file(model_path / "tokenizer.json")
        art.add_file(model_path / "README.md")
        wandb.log_artifact(art)

def load_from_registry(model_name: str):
    """Download model from registry."""
    with wandb.init() as run:
        artifact = run.use_artifact(model_name, type="model")
        artifact_dir = artifact.download()
        return artifact_dir
6

PR5 (Optional): Model Interpretability

Use LIT (Learning Interpretability Tool) or similar:
from lit_nlp import notebook
from lit_nlp.examples.datasets import glue
from lit_nlp.examples.models import glue_models

# Load dataset and model
datasets = {'sst2_dev': glue.SST2Data('validation')}
models = {'sst2': glue_models.SST2Model('path/to/model')}

# Start LIT
widget = notebook.LitWidget(models, datasets, height=800)
widget.render()
For other domains (CV, audio, tabular), find equivalent tools.
7

PR6 (Optional): LLM API Testing

Test LLM API with DeepEval or Promptfoo:
from deepeval import assert_test
from deepeval.metrics import HallucinationMetric
from deepeval.test_case import LLMTestCase

def test_llm_hallucination():
    """Test for hallucinations."""
    test_case = LLMTestCase(
        input="What is the capital of France?",
        actual_output="Paris is the capital of France.",
        context=["Paris is the capital of France."]
    )
    
    metric = HallucinationMetric(threshold=0.5)
    assert_test(test_case, [metric])

CI/CD Pipeline

Set up GitHub Actions workflow:
.github/workflows/test.yml
name: Tests

on: [push, pull_request]

jobs:
  test:
    runs-on: ubuntu-latest
    
    steps:
    - uses: actions/checkout@v2
    
    - name: Set up Python
      uses: actions/setup-python@v2
      with:
        python-version: '3.10'
    
    - name: Install dependencies
      run: |
        pip install -r requirements.txt
        pip install pytest pytest-cov
    
    - name: Run code tests
      run: pytest tests/test_code.py
    
    - name: Run data tests
      run: pytest tests/test_data.py
    
    - name: Run model tests
      run: pytest tests/test_model.py
    
    - name: Upload coverage
      uses: codecov/codecov-action@v2

Deliverables

  • Google Doc updated with testing plan
  • PR1: Code tests with CI integration
  • PR2: Data validation tests
  • PR3: Model behavioral tests
  • PR4: Model registry integration
  • All tests pass in CI

Success Criteria

  • ✅ All required PRs merged
  • ✅ Tests run automatically in CI
  • ✅ Code coverage > 80%
  • ✅ Data validation catches quality issues
  • ✅ Model tests verify expected behavior
  • ✅ Models versioned in registry

Resources

Classic Example

Reference implementation for BERT training

Generative Example

Reference implementation for Phi-3 training

Made With ML

Comprehensive testing guide

W&B Documentation

Complete Weights & Biases guide

Getting Help

If you get stuck:
  1. Review the reference implementations in module-3/
  2. Check the reading list for relevant articles
  3. Search W&B documentation
  4. Ask questions in course discussion forum
  5. Compare with HuggingFace examples
Remember: The goal is to build production-ready training workflows, not just achieve high accuracy. Focus on reproducibility, testing, and documentation.

Build docs developers (and LLMs) love