Module 7: Monitoring
This module covers observability and monitoring strategies for machine learning applications in production. You’ll learn how to instrument your applications with modern observability tools, set up dashboards for system metrics, and detect data drift in your ML models.What You’ll Learn
In this module, you’ll gain hands-on experience with:LLM Observability
Instrument LLM applications with OpenTelemetry, LangSmith, and OpenLLMetry
SigNoz Setup
Deploy and configure SigNoz for distributed tracing and metrics
Grafana Dashboards
Set up Prometheus and Grafana for Kubernetes monitoring
Data Monitoring
Detect drift and outliers with Evidently and Seldon
Module Overview
Monitoring is critical for maintaining reliable ML systems in production. This module focuses on three key areas:System Observability
Learn to instrument your applications with OpenTelemetry for distributed tracing, metrics, and logs. You’ll set up:- SigNoz for end-to-end observability
- Grafana dashboards for Kubernetes metrics
- Prometheus for metrics collection and alerting
LLM Application Monitoring
Special attention to observability patterns for Large Language Model applications:- Track token usage and costs
- Monitor latency and throughput
- Trace multi-step reasoning chains
- Compare different observability platforms (AgentOps, LangSmith, OpenLLMetry)
Data Monitoring
Implement monitoring for your ML models and data pipelines:- Drift detection to identify when input distributions change
- Outlier detection to catch anomalous requests
- Model performance monitoring to track prediction quality
- Integration with Evidently and Seldon for production monitoring
Prerequisites
Before starting this module, you should have:- A Kubernetes cluster (kind or similar)
- Basic understanding of Kubernetes concepts
- Familiarity with Python and ML concepts
- Experience with Module 5 (Model Serving) recommended
Architecture
The monitoring stack includes:Learning Outcomes
By the end of this module, you will be able to:Instrument applications for observability
Instrument applications for observability
- Add OpenTelemetry instrumentation to Python applications
- Configure tracing for LLM applications
- Send traces and metrics to observability backends
- Use decorators and context managers for custom instrumentation
Deploy and configure monitoring tools
Deploy and configure monitoring tools
- Install SigNoz on Kubernetes using Helm
- Set up Prometheus and Grafana stack
- Configure service discovery and scraping
- Create custom dashboards and alerts
Monitor ML models in production
Monitor ML models in production
- Deploy Seldon Core with drift and outlier detectors
- Configure Evidently for data quality monitoring
- Set up alerting for model degradation
- Build monitoring pipelines for continuous validation
Design monitoring strategies
Design monitoring strategies
- Identify key metrics for ML systems
- Plan ground truth collection strategies
- Design alerting thresholds and SLOs
- Document monitoring and incident response procedures
Practice Tasks
This module includes hands-on homework assignments:Integrate SigNoz Monitoring
Add SigNoz instrumentation to your application and verify traces are being collected.
Implement Drift Detection
Add drift detection logic to your ML pipeline (Kubeflow, Airflow, or Dagster).
Tools and Technologies
This module uses the following tools:- SigNoz: Open-source observability platform
- Grafana: Visualization and dashboarding
- Prometheus: Metrics collection and alerting
- OpenTelemetry: Instrumentation framework
- LangSmith: LLM application monitoring
- AgentOps: Agent workflow observability
- OpenLLMetry: LLM-specific telemetry
- Evidently: ML monitoring and drift detection
- Seldon Core: Model serving with analytics
- Alibi Detect: Outlier and drift detection algorithms
Reading Materials
Key papers and resources:- How ML Breaks: A Decade of Outages for One Large ML Pipeline
- Monitoring and explainability of models in production
- Data Distribution Shifts and Monitoring
- Failing Loudly: An Empirical Study of Methods for Detecting Dataset Shift
Next Steps
Start with Observability
Learn LLM observability concepts and patterns
View Practice Tasks
See homework assignments and criteria