Train Models with Azure Machine Learning

Azure Machine Learning provides multiple ways to train machine learning models at scale, from interactive development to distributed training on powerful compute clusters.

Azure ML supports training with popular frameworks including PyTorch, TensorFlow, Scikit-learn, XGBoost, and more.

Training Methods

Python SDK

Programmatic job submission with full control

Azure CLI

Command-line training for automation and CI/CD

Studio UI

Visual interface for no-code training

Prerequisites

Before training models, ensure you have:

Azure Subscription

Active Azure subscription (create free account)

ML Workspace

Azure Machine Learning workspace (create workspace)

Development Tools

Python SDK v2: pip install azure-ai-ml
Azure CLI with ML extension: az extension add -n ml

Training Data

Data stored in Azure Storage or registered as data assets

Training Workflow

The typical training workflow in Azure Machine Learning:

Connect to Workspace

Authenticate and connect to your ML workspace

Prepare Data

Load and register training data as assets

Create Environment

Define software dependencies for training

Configure Compute

Select compute target for training job

Define Training Job

Specify training script, parameters, and resources

Submit Job

Execute training and monitor progress

Save trained model to model registry

Connect to Workspace

Python SDK
Azure CLI

from azure.ai.ml import MLClient
from azure.identity import DefaultAzureCredential

# Connect to workspace
subscription_id = "<SUBSCRIPTION_ID>"
resource_group = "<RESOURCE_GROUP>"
workspace_name = "<WORKSPACE_NAME>"

ml_client = MLClient(
    DefaultAzureCredential(),
    subscription_id,
    resource_group,
    workspace_name
)

print(f"Connected to workspace: {ml_client.workspace_name}")

# Login to Azure
az login

# Set active subscription
az account set --subscription <SUBSCRIPTION_ID>

# Set default workspace
az configure --defaults \
  group=<RESOURCE_GROUP> \
  workspace=<WORKSPACE_NAME>

Example: Train Scikit-learn Model

Complete example training an iris classification model:

1. Training Script

Create train.py:

import argparse
import pandas as pd
import mlflow
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, f1_score

# Parse arguments
parser = argparse.ArgumentParser()
parser.add_argument("--data", type=str, help="Path to training data")
parser.add_argument("--n_estimators", type=int, default=100)
parser.add_argument("--max_depth", type=int, default=5)
args = parser.parse_args()

# Enable autologging
mlflow.sklearn.autolog()

# Load data
df = pd.read_csv(args.data)
X = df.drop("target", axis=1)
y = df["target"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Train model
model = RandomForestClassifier(
    n_estimators=args.n_estimators,
    max_depth=args.max_depth,
    random_state=42
)

model.fit(X_train, y_train)

# Evaluate
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred, average="weighted")

print(f"Accuracy: {accuracy}")
print(f"F1 Score: {f1}")

# Log additional metrics
mlflow.log_metric("accuracy", accuracy)
mlflow.log_metric("f1_score", f1)

2. Submit Training Job

from azure.ai.ml import command, Input

job = command(
    code="./src",
    command="python train.py --data ${{inputs.data}} --n_estimators ${{inputs.n_estimators}}",
    inputs={
        "data": Input(
            type="uri_file",
            path="azureml://datastores/workspaceblobstore/paths/iris.csv"
        ),
        "n_estimators": 100,
    },
    environment="azureml://registries/azureml/environments/sklearn-1.5/versions/1",
    compute="cpu-cluster",
    display_name="iris-training",
    experiment_name="iris-classification",
    description="Train Random Forest on Iris dataset"
)

# Submit job
returned_job = ml_client.jobs.create_or_update(job)
print(f"Job submitted: {returned_job.name}")
print(f"Studio URL: {returned_job.studio_url}")

# Wait for completion (optional)
ml_client.jobs.stream(returned_job.name)

Training on Different Compute

Compute Cluster
Serverless Compute
Compute Instance

Use managed compute clusters for scalable training:

job = command(
    code="./src",
    command="python train.py",
    environment="azureml://environments/my-env/versions/1",
    compute="cpu-cluster",  # Existing cluster
    instance_count=1
)

When to use:

Large datasets requiring multiple nodes
Long-running training jobs
Distributed training
Hyperparameter tuning

Use on-demand compute without cluster management:

from azure.ai.ml.entities import ResourceConfiguration

job = command(
    code="./src",
    command="python train.py",
    environment="azureml://environments/my-env/versions/1",
    resources=ResourceConfiguration(
        instance_type="Standard_DS3_v2",
        instance_count=1
    ),
    # No compute parameter = serverless
)

When to use:

Quick experiments
No quota management needed
Variable workloads
Pay-per-use scenarios

Use dedicated development compute:

job = command(
    code="./src",
    command="python train.py",
    environment="azureml://environments/my-env/versions/1",
    compute="my-compute-instance"
)

When to use:

Interactive development
Small-scale training
Debugging
Testing before scaling

Using Curated Environments

Azure ML provides pre-built environments for common frameworks:

# PyTorch GPU environment
environment="azureml://registries/azureml/environments/pytorch-2.0-cuda11.7/versions/1"

# TensorFlow GPU environment  
environment="azureml://registries/azureml/environments/tensorflow-2.13-cuda11/versions/1"

# Scikit-learn CPU environment
environment="azureml://registries/azureml/environments/sklearn-1.5/versions/1"

# Custom environment
from azure.ai.ml.entities import Environment

custom_env = Environment(
    name="custom-training-env",
    description="Custom environment with specific packages",
    conda_file="environment.yml",
    image="mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04:latest"
)

ml_client.environments.create_or_update(custom_env)

Hyperparameter Tuning

Optimize model hyperparameters with sweep jobs:

from azure.ai.ml.sweep import Choice, Uniform, MedianStoppingPolicy

# Define search space
job_for_sweep = job(
    n_estimators=Choice([50, 100, 200]),
    max_depth=Choice([3, 5, 7, 10]),
    learning_rate=Uniform(0.001, 0.1)
)

# Configure sweep
sweep_job = job_for_sweep.sweep(
    sampling_algorithm="random",
    primary_metric="accuracy",
    goal="maximize",
    max_total_trials=20,
    max_concurrent_trials=4,
    early_termination_policy=MedianStoppingPolicy(
        delay_evaluation=5,
        evaluation_interval=2
    )
)

returned_sweep = ml_client.jobs.create_or_update(sweep_job)

Tracking Experiments

Organize training runs into experiments:

# Submit multiple runs to same experiment
for learning_rate in [0.001, 0.01, 0.1]:
    job = command(
        code="./src",
        command=f"python train.py --lr {learning_rate}",
        environment="azureml://environments/pytorch/versions/1",
        compute="gpu-cluster",
        experiment_name="learning-rate-comparison",  # Group related runs
        display_name=f"lr-{learning_rate}"
    )
    ml_client.jobs.create_or_update(job)

# Query experiment runs
from azure.ai.ml.entities import Job

runs = ml_client.jobs.list(
    parent_job_name="learning-rate-comparison"
)

for run in runs:
    print(f"{run.display_name}: {run.status}")

Logging Metrics and Artifacts

Track training progress with MLflow:

import mlflow

# Start MLflow run
with mlflow.start_run():
    # Log parameters
    mlflow.log_param("epochs", 10)
    mlflow.log_param("batch_size", 32)
    
    # Training loop
    for epoch in range(10):
        train_loss = train_epoch(model, data_loader)
        val_loss = validate(model, val_loader)
        
        # Log metrics
        mlflow.log_metric("train_loss", train_loss, step=epoch)
        mlflow.log_metric("val_loss", val_loss, step=epoch)
    
    # Log artifacts
    mlflow.log_artifact("confusion_matrix.png")
    mlflow.log_artifact("feature_importance.csv")
    
    # Log model
    mlflow.sklearn.log_model(model, "model")

Distributed Training

For large models and datasets, see:

Distributed Training Guide

Learn about PyTorch DDP, DeepSpeed, and TensorFlow distributed strategies

Best Practices

Use Data Assets

from azure.ai.ml.entities import Data
from azure.ai.ml.constants import AssetTypes

data_asset = Data(
    name="training-data",
    version="1.0",
    path="azureml://datastores/data/paths/train/",
    type=AssetTypes.URI_FOLDER
)
ml_client.data.create_or_update(data_asset)

Version Environments

Pin dependencies for reproducible training:

# environment.yml
dependencies:
  - python=3.10
  - pytorch=2.0.0
  - torchvision=0.15.0
  - cudatoolkit=11.7

Use Managed Identities

Avoid storing credentials in code:

from azure.identity import DefaultAzureCredential

credential = DefaultAzureCredential()
ml_client = MLClient(credential, subscription_id, rg, workspace)

Monitor Costs

Track spending with tags:

job = command(
    ...
    tags={"project": "fraud-detection", "cost-center": "ml-team"}
)

Training Examples by Framework

PyTorch

Deep learning with PyTorch on GPU clusters

TensorFlow

Neural networks with TensorFlow distributed

Scikit-learn

Traditional ML algorithms at scale

XGBoost

Gradient boosting for structured data

Next Steps

Distributed Training

Scale training across multiple GPUs

Deploy Models

Deploy trained models to endpoints

MLOps

Automate training pipelines

AutoML

Automated machine learning

Getting Started

Core Concepts

Training

Deployment

Component Reference

Training Models in Azure Machine Learning

Train Models with Azure Machine Learning

Training Methods

Python SDK

Azure CLI

Studio UI

Prerequisites

Training Workflow

Connect to Workspace

Example: Train Scikit-learn Model

1. Training Script

2. Submit Training Job

Training on Different Compute

Using Curated Environments

Hyperparameter Tuning

Tracking Experiments

Logging Metrics and Artifacts

Distributed Training

Distributed Training Guide

Best Practices

Training Examples by Framework

PyTorch

TensorFlow

Scikit-learn

XGBoost

Next Steps

Distributed Training

Deploy Models

MLOps

AutoML

Build docs developers (and LLMs) love

Getting Started

Core Concepts

Training

Deployment

Component Reference

​Train Models with Azure Machine Learning

​Training Methods

Python SDK

Azure CLI

Studio UI

​Prerequisites

​Training Workflow

​Connect to Workspace

​Example: Train Scikit-learn Model

​1. Training Script

​2. Submit Training Job

​Training on Different Compute

​Using Curated Environments

​Hyperparameter Tuning

​Tracking Experiments

​Logging Metrics and Artifacts

​Distributed Training

Distributed Training Guide

​Best Practices

​Training Examples by Framework

PyTorch

TensorFlow

Scikit-learn

XGBoost

​Next Steps

Distributed Training

Deploy Models

MLOps

AutoML

Build docs developers (and LLMs) love

Train Models with Azure Machine Learning

Training Methods

Prerequisites

Training Workflow

Connect to Workspace

Example: Train Scikit-learn Model

1. Training Script

2. Submit Training Job

Training on Different Compute

Using Curated Environments

Hyperparameter Tuning

Tracking Experiments

Logging Metrics and Artifacts

Distributed Training

Best Practices

Training Examples by Framework

Next Steps