Why Monitoring is Critical
ML systems fail in unique ways:
Silent degradation : Accuracy drops but API keeps returning 200 OK
Data drift : Training data diverges from production data
Concept drift : The relationship between inputs and outputs changes
Infrastructure issues : Out of memory, slow inference, rate limiting
Unlike traditional software, ML bugs are often statistical —you need specialized monitoring.
Observability Layers
Infrastructure CPU, memory, GPU, network, pod health
Application Request rate, latency, errors, throughput
ML-Specific Predictions, confidence scores, feature distributions, drift
You need all three to understand system health.
Infrastructure Monitoring
Prometheus + Grafana
The standard stack for Kubernetes:
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install monitoring prometheus-community/kube-prometheus-stack
kubectl port-forward svc/monitoring-grafana 3000:80
# Login: admin / prom-operator
What you get:
Prometheus scrapes metrics from all pods
Grafana dashboards for cluster/pod health
Alertmanager for notifications
Key metrics:
CPU/Memory : container_cpu_usage_seconds_total, container_memory_usage_bytes
Network : container_network_receive_bytes_total
Disk : kubelet_volume_stats_used_bytes
Prometheus stores metrics in a time-series database. Use PromQL to query: rate(container_cpu_usage_seconds_total[5m])
GPU Monitoring
GPU metrics aren’t exposed by default:
helm install nvidia-gpu-exporter nvidia/gpu-exporter
This adds GPU utilization, memory, temperature to Prometheus.
Custom dashboard:
# GPU utilization
sum(DCGM_FI_DEV_GPU_UTIL) by (gpu)
# GPU memory used
sum(DCGM_FI_DEV_FB_USED) by (gpu)
For multi-GPU training, monitor GPU utilization to detect stragglers or imbalanced workloads.
Application Monitoring
FastAPI Metrics
Instrument your API with Prometheus client:
from prometheus_client import Counter, Histogram, generate_latest
from fastapi import FastAPI, Response
app = FastAPI()
request_count = Counter( 'requests_total' , 'Total requests' , [ 'method' , 'endpoint' ])
request_latency = Histogram( 'request_latency_seconds' , 'Request latency' )
@app.middleware ( 'http' )
async def metrics_middleware ( request , call_next ):
with request_latency.time():
response = await call_next(request)
request_count.labels( method = request.method, endpoint = request.url.path).inc()
return response
@app.get ( '/metrics' )
def metrics ():
return Response(generate_latest(), media_type = 'text/plain' )
Prometheus scrapes /metrics every 15s.
OpenTelemetry for Traces
Traces show request flow across services:
from opentelemetry import trace
from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
trace.set_tracer_provider(TracerProvider())
tracer = trace.get_tracer( __name__ )
span_processor = BatchSpanProcessor(OTLPSpanExporter( endpoint = 'http://signoz:4318' ))
trace.get_tracer_provider().add_span_processor(span_processor)
@app.post ( '/predict' )
def predict ( request ):
with tracer.start_as_current_span( 'predict' ):
with tracer.start_as_current_span( 'preprocess' ):
inputs = preprocess(request)
with tracer.start_as_current_span( 'model_inference' ):
outputs = model(inputs)
with tracer.start_as_current_span( 'postprocess' ):
result = postprocess(outputs)
return result
Traces show which span is slow (preprocessing vs inference vs postprocessing).
LLM Observability
LLMs need special monitoring:
OpenLLMetry (Traceloop)
Automatically traces LLM calls:
from traceloop.sdk import Traceloop
Traceloop.init( api_endpoint = 'http://localhost:4318' )
# Now all OpenAI/Anthropic calls are traced
import openai
response = openai.ChatCompletion.create(
model = 'gpt-4' ,
messages = [{ 'role' : 'user' , 'content' : 'Explain ML monitoring' }]
)
Captured data:
Prompt and completion
Token counts
Latency
Model and parameters
LangChain Tracing
import os
os.environ[ 'LANGCHAIN_TRACING_V2' ] = 'true'
os.environ[ 'LANGCHAIN_API_KEY' ] = 'lsv2_...'
from langchain.chains import LLMChain
chain = LLMChain( ... )
result = chain.run( 'query' ) # Automatically traced
LangSmith (LangChain’s platform) shows:
Chain execution graph
Intermediate steps
Errors and retries
For production, use SigNoz (open-source) or Datadog/New Relic (managed) to aggregate traces from all services.
SigNoz for Unified Observability
SigNoz combines metrics, traces, and logs:
helm repo add signoz https://charts.signoz.io
helm install my-release signoz/signoz
kubectl port-forward svc/my-release-signoz-frontend 3301:3301
kubectl port-forward svc/my-release-signoz-otel-collector 4318:4318
Now send traces via OTLP:
export TRACELOOP_BASE_URL = "http://localhost:4318"
Features:
Trace aggregation and search
APM (application performance monitoring)
Custom dashboards
Alerts
SigNoz is like open-source Datadog. Use it if you want full control. For managed solutions, consider Honeycomb or Datadog.
Data Drift Detection
Evidently
Evidently compares reference (training) and current (production) data:
from evidently.report import Report
from evidently.metrics import DataDriftTable
report = Report( metrics = [DataDriftTable()])
report.run( reference_data = train_df, current_data = prod_df)
report.save_html( 'drift_report.html' )
Drift types:
Feature drift : Input distributions change (e.g., user demographics shift)
Prediction drift : Output distributions change (e.g., suddenly predicting 90% negative)
Concept drift : Input-output relationship changes (e.g., “good” review now has different language)
Schedule drift reports daily/weekly. Alert if drift score exceeds threshold.
WhyLogs
WhyLogs profiles data in production:
import whylogs as why
result = why.log( pandas = prod_df)
profile = result.profile()
profile.view().to_pandas() # Summary statistics
Use case:
Log profiles (not raw data) to save storage
Compare profiles across time
Detect anomalies (outliers, nulls, type changes)
Alibi Detect
Alibi Detect provides algorithms for drift and outlier detection:
from alibi_detect.cd import TabularDrift
cd = TabularDrift( x_ref = X_train, p_val = 0.05 )
preds = cd.predict(X_prod)
if preds[ 'data' ][ 'is_drift' ]:
print ( 'Drift detected!' )
Methods:
Kolmogorov-Smirnov test
Chi-squared test
Maximum Mean Discrepancy (MMD)
For high-dimensional data (images, embeddings), use MMD or learned drift detectors instead of univariate tests.
Seldon Core v2 for Advanced Monitoring
Seldon Core v2 integrates outlier/drift detection into serving:
apiVersion : mlops.seldon.io/v1alpha1
kind : Model
metadata :
name : income-model
spec :
storageUri : s3://bucket/income-model
---
apiVersion : mlops.seldon.io/v1alpha1
kind : Model
metadata :
name : income-drift
spec :
storageUri : s3://bucket/drift-detector
---
apiVersion : mlops.seldon.io/v1alpha1
kind : Pipeline
metadata :
name : income-pipeline
spec :
steps :
- name : drift-check
model : income-drift
- name : prediction
model : income-model
inputs : [ drift-check.outputs ]
Requests flow through drift detector before prediction. If drift is detected, Seldon logs it (or blocks the request).
Seldon v2 is powerful but complex to set up. Use it for regulated industries (finance, healthcare) where explainability and monitoring are critical.
Track prediction quality over time:
import wandb
wandb.init( project = 'production-monitoring' )
for batch in production_stream:
predictions = model.predict(batch.inputs)
# Log predictions
wandb.log({
'mean_confidence' : predictions.mean(),
'prediction_distribution' : wandb.Histogram(predictions)
})
# If labels arrive later (e.g., user feedback)
if batch.labels:
accuracy = (predictions == batch.labels).mean()
wandb.log({ 'accuracy' : accuracy})
Key metrics:
Confidence distribution : Shift towards low confidence = model uncertainty
Prediction skew : Suddenly predicting 95% class A = data drift
Accuracy (if labels available): Direct measure of performance
For many applications, ground truth labels arrive with delay (e.g., click-through rate measured days later). Use confidence and distribution shifts as early warning signals.
Alerting
Prometheus AlertManager:
groups :
- name : ml-alerts
rules :
- alert : HighErrorRate
expr : rate(requests_total{status="500"}[5m]) > 0.05
for : 5m
annotations :
summary : "High error rate detected"
- alert : SlowInference
expr : histogram_quantile(0.95, request_latency_seconds) > 1
for : 10m
annotations :
summary : "95th percentile latency > 1s"
Configure AlertManager to send to Slack/PagerDuty.
PagerDuty for on-call:
import pypd
pypd.api_key = 'your-key'
pypd.Event.create( data = {
'routing_key' : 'your-routing-key' ,
'event_action' : 'trigger' ,
'payload' : {
'summary' : 'Model accuracy dropped below 80%' ,
'severity' : 'error' ,
'source' : 'monitoring-script'
}
})
Hands-On Examples
Explore monitoring in Module 7 :
Set up Prometheus + Grafana on Kubernetes
Instrument FastAPI with OpenTelemetry
Deploy SigNoz for trace aggregation
Detect drift with Evidently and Alibi Detect
Configure Seldon Core v2 pipelines
Next Steps
Production Patterns Put it all together
Containerization Review deployment basics
Further Reading