Skip to main content

Deploy Models with Azure Machine Learning

After training machine learning models, deploy them to production for inference using Azure Machine Learning endpoints. Deploy for real-time predictions or batch processing at scale.
Azure ML provides managed endpoints with automatic scaling, monitoring, and security - no infrastructure management required.

Inference and Endpoints

Inference is the process of applying new input data to a machine learning model to generate outputs (predictions, classifications, clusters, etc.). An endpoint is a stable, durable URL that can be used to request predictions from your model.

Online Endpoints

Real-time inference with low latency

Batch Endpoints

Asynchronous processing of large datasets

Endpoint Anatomy

Endpoint

Provides:
  • Stable URL: e.g., https://my-endpoint.eastus.inference.ml.azure.com
  • Authentication: Key-based or Microsoft Entra ID
  • Authorization: Role-based access control

Deployment

Contains:
  • Model: Trained model files
  • Code: Scoring script (optional for MLflow models)
  • Environment: Software dependencies
  • Compute: Resources to run inference
One endpoint can contain multiple deployments, enabling A/B testing and safe rollouts.

Deployment Types

Best for: Real-time, low-latency inferenceFeatures:
  • Fully managed compute and scaling
  • Built-in monitoring and logging
  • Traffic splitting for A/B testing
  • Zero-downtime updates
  • Cost tracking per deployment
Use when:
  • Response time is critical (<1 second)
  • Request-response pattern
  • Small payloads (fits in HTTP request)
  • Need to scale based on traffic
from azure.ai.ml.entities import ManagedOnlineEndpoint

endpoint = ManagedOnlineEndpoint(
    name="my-endpoint",
    description="Production inference endpoint",
    auth_mode="key"
)

Quick Start: Deploy a Model

1. Register Your Model

from azure.ai.ml import MLClient
from azure.ai.ml.entities import Model
from azure.identity import DefaultAzureCredential

ml_client = MLClient(
    DefaultAzureCredential(),
    subscription_id="<subscription-id>",
    resource_group="<resource-group>",
    workspace_name="<workspace>"
)

# Register model
model = Model(
    path="./model",
    name="sklearn-classifier",
    description="Iris classification model",
    type="mlflow_model"  # or "custom_model"
)

registered_model = ml_client.models.create_or_update(model)
print(f"Registered: {registered_model.name} v{registered_model.version}")

2. Create Endpoint

from azure.ai.ml.entities import ManagedOnlineEndpoint

endpoint = ManagedOnlineEndpoint(
    name="iris-classifier-endpoint",
    description="Iris species classification",
    auth_mode="key",
    tags={"environment": "production", "team": "ml-ops"}
)

ml_client.online_endpoints.begin_create_or_update(endpoint).result()
print(f"Endpoint created: {endpoint.name}")

3. Deploy Model

from azure.ai.ml.entities import ManagedOnlineDeployment, Model

deployment = ManagedOnlineDeployment(
    name="blue",
    endpoint_name="iris-classifier-endpoint",
    model=registered_model,
    instance_type="Standard_DS3_v2",
    instance_count=1,
    request_settings={
        "request_timeout_ms": 90000,
        "max_concurrent_requests_per_instance": 1
    },
    liveness_probe={
        "initial_delay": 10,
        "period": 10,
        "timeout": 2,
        "failure_threshold": 3
    }
)

ml_client.online_deployments.begin_create_or_update(deployment).result()

# Route 100% traffic to deployment
endpoint.traffic = {"blue": 100}
ml_client.online_endpoints.begin_create_or_update(endpoint).result()

4. Test the Deployment

# Test with sample data
sample_data = {
    "data": [
        [5.1, 3.5, 1.4, 0.2],
        [6.2, 2.9, 4.3, 1.3]
    ]
}

# Invoke endpoint
response = ml_client.online_endpoints.invoke(
    endpoint_name="iris-classifier-endpoint",
    request_file="sample_request.json",
    deployment_name="blue"  # Optional: test specific deployment
)

print(f"Predictions: {response}")

Deployment Patterns

Blue-Green Deployment

Switch traffic between two deployments instantly:
# Deploy new version to "green"
green_deployment = ManagedOnlineDeployment(
    name="green",
    endpoint_name="iris-classifier-endpoint",
    model=new_model_version,
    instance_type="Standard_DS3_v2",
    instance_count=1
)
ml_client.online_deployments.begin_create_or_update(green_deployment).result()

# Test green deployment
response = ml_client.online_endpoints.invoke(
    endpoint_name="iris-classifier-endpoint",
    request_file="test.json",
    deployment_name="green"
)

# Switch all traffic to green
endpoint.traffic = {"green": 100}
ml_client.online_endpoints.begin_create_or_update(endpoint).result()

# Delete old blue deployment
ml_client.online_deployments.begin_delete(
    name="blue",
    endpoint_name="iris-classifier-endpoint"
).result()

Canary Deployment

Gradually shift traffic to test new version:
# Start with 10% traffic to new version
endpoint.traffic = {"blue": 90, "green": 10}
ml_client.online_endpoints.begin_create_or_update(endpoint).result()

# Monitor metrics, then increase
endpoint.traffic = {"blue": 50, "green": 50}
ml_client.online_endpoints.begin_create_or_update(endpoint).result()

# Complete rollout
endpoint.traffic = {"green": 100}
ml_client.online_endpoints.begin_create_or_update(endpoint).result()

A/B Testing

Compare two model versions in production:
# Split traffic evenly
endpoint.traffic = {
    "model-v1": 50,
    "model-v2": 50
}
ml_client.online_endpoints.begin_create_or_update(endpoint).result()

# Monitor business metrics to choose winner
# Then route 100% to better performing model

Scaling and Performance

Autoscaling

Configure automatic scaling based on metrics:
from azure.ai.ml.entities import OnlineScaleSettings

deployment = ManagedOnlineDeployment(
    name="blue",
    endpoint_name="my-endpoint",
    model=model,
    instance_type="Standard_DS3_v2",
    instance_count=1,
    scale_settings=OnlineScaleSettings(
        scale_type="TargetUtilization",
        min_instances=1,
        max_instances=10,
        target_utilization_percentage=70,
        polling_interval=10
    )
)

Resource Limits

Control compute resources:
deployment = ManagedOnlineDeployment(
    name="blue",
    endpoint_name="my-endpoint",
    model=model,
    instance_type="Standard_DS3_v2",
    instance_count=2,
    request_settings={
        "request_timeout_ms": 90000,
        "max_concurrent_requests_per_instance": 5,
        "max_queue_wait_ms": 60000
    }
)

Monitoring Deployments

View Metrics in Azure Portal

Key metrics to monitor:
  • Request latency (P50, P95, P99)
  • Requests per second
  • HTTP status codes
  • CPU/GPU utilization
  • Memory usage

Query Logs

# Get deployment logs
logs = ml_client.online_deployments.get_logs(
    name="blue",
    endpoint_name="iris-classifier-endpoint",
    lines=100
)
print(logs)

Application Insights Integration

from applicationinsights import TelemetryClient

tc = TelemetryClient('<instrumentation-key>')

# Log custom events
tc.track_event('PredictionMade', {
    'model_version': '1.2.0',
    'latency_ms': 45
})
tc.flush()

Security

Authentication

# Get endpoint keys
keys = ml_client.online_endpoints.get_keys(
    name="iris-classifier-endpoint"
)

# Make authenticated request
import requests

headers = {
    "Authorization": f"Bearer {keys.primary_key}",
    "Content-Type": "application/json"
}

response = requests.post(
    endpoint.scoring_uri,
    headers=headers,
    json=sample_data
)

Network Security

Deploy with private networking:
endpoint = ManagedOnlineEndpoint(
    name="secure-endpoint",
    public_network_access="disabled",
    identity={
        "type": "SystemAssigned"
    }
)

ml_client.online_endpoints.begin_create_or_update(endpoint).result()

Cost Optimization

Start with smaller instances and scale up:
Instance TypevCPUsRAMCost (Relative)
Standard_DS2_v227GB1x
Standard_DS3_v2414GB2x
Standard_DS4_v2828GB4x
Scale to zero during low-traffic periods:
scale_settings=OnlineScaleSettings(
    min_instances=0,  # Scale to zero when idle
    max_instances=10
)
Use batch endpoints for large datasets - only pay during job execution:
batch_deployment = BatchDeployment(
    name="default",
    endpoint_name="batch-endpoint",
    model=model,
    compute="batch-cluster",
    instance_count=5  # Parallel processing
)
Track spending in Azure Cost Management:
  • Filter by deployment tags
  • Set budget alerts
  • Analyze cost trends

Troubleshooting

Check:
  1. Model files are valid
  2. Scoring script has no syntax errors
  3. Environment dependencies are correct
  4. Sufficient quota for instance type
View deployment logs:
logs = ml_client.online_deployments.get_logs(
    name="blue",
    endpoint_name="my-endpoint",
    lines=500
)
Solutions:
  • Use GPU instances for deep learning models
  • Optimize model (quantization, pruning)
  • Increase concurrent requests per instance
  • Enable request batching
  • Use model caching
  • Switch to larger instance type
  • Reduce batch size in scoring script
  • Optimize model memory usage
  • Use model compression techniques

Next Steps

Online Endpoints

Learn more about real-time inference

Batch Scoring

Deploy models for batch processing

Monitor Deployments

Track performance and costs

MLOps

Automate deployment pipelines

Build docs developers (and LLMs) love