Skip to main content

Module 7: Monitoring

This module covers observability and monitoring strategies for machine learning applications in production. You’ll learn how to instrument your applications with modern observability tools, set up dashboards for system metrics, and detect data drift in your ML models.

What You’ll Learn

In this module, you’ll gain hands-on experience with:

LLM Observability

Instrument LLM applications with OpenTelemetry, LangSmith, and OpenLLMetry

SigNoz Setup

Deploy and configure SigNoz for distributed tracing and metrics

Grafana Dashboards

Set up Prometheus and Grafana for Kubernetes monitoring

Data Monitoring

Detect drift and outliers with Evidently and Seldon

Module Overview

Monitoring is critical for maintaining reliable ML systems in production. This module focuses on three key areas:

System Observability

Learn to instrument your applications with OpenTelemetry for distributed tracing, metrics, and logs. You’ll set up:
  • SigNoz for end-to-end observability
  • Grafana dashboards for Kubernetes metrics
  • Prometheus for metrics collection and alerting

LLM Application Monitoring

Special attention to observability patterns for Large Language Model applications:
  • Track token usage and costs
  • Monitor latency and throughput
  • Trace multi-step reasoning chains
  • Compare different observability platforms (AgentOps, LangSmith, OpenLLMetry)

Data Monitoring

Implement monitoring for your ML models and data pipelines:
  • Drift detection to identify when input distributions change
  • Outlier detection to catch anomalous requests
  • Model performance monitoring to track prediction quality
  • Integration with Evidently and Seldon for production monitoring

Prerequisites

Before starting this module, you should have:
  • A Kubernetes cluster (kind or similar)
  • Basic understanding of Kubernetes concepts
  • Familiarity with Python and ML concepts
  • Experience with Module 5 (Model Serving) recommended

Architecture

The monitoring stack includes:

Learning Outcomes

By the end of this module, you will be able to:
  • Add OpenTelemetry instrumentation to Python applications
  • Configure tracing for LLM applications
  • Send traces and metrics to observability backends
  • Use decorators and context managers for custom instrumentation
  • Install SigNoz on Kubernetes using Helm
  • Set up Prometheus and Grafana stack
  • Configure service discovery and scraping
  • Create custom dashboards and alerts
  • Deploy Seldon Core with drift and outlier detectors
  • Configure Evidently for data quality monitoring
  • Set up alerting for model degradation
  • Build monitoring pipelines for continuous validation
  • Identify key metrics for ML systems
  • Plan ground truth collection strategies
  • Design alerting thresholds and SLOs
  • Document monitoring and incident response procedures

Practice Tasks

This module includes hands-on homework assignments:
1

Integrate SigNoz Monitoring

Add SigNoz instrumentation to your application and verify traces are being collected.
2

Create Grafana Dashboard

Build a custom dashboard showing key metrics for your application.
3

Implement Drift Detection

Add drift detection logic to your ML pipeline (Kubeflow, Airflow, or Dagster).
4

Design Monitoring Strategy

Document your system and ML monitoring plan, including ground truth collection and alert definitions.
See the Practice page for detailed requirements and evaluation criteria.

Tools and Technologies

This module uses the following tools:
  • SigNoz: Open-source observability platform
  • Grafana: Visualization and dashboarding
  • Prometheus: Metrics collection and alerting
  • OpenTelemetry: Instrumentation framework
  • LangSmith: LLM application monitoring
  • AgentOps: Agent workflow observability
  • OpenLLMetry: LLM-specific telemetry
  • Evidently: ML monitoring and drift detection
  • Seldon Core: Model serving with analytics
  • Alibi Detect: Outlier and drift detection algorithms

Reading Materials

Key papers and resources:

Next Steps

Start with Observability

Learn LLM observability concepts and patterns

View Practice Tasks

See homework assignments and criteria

Build docs developers (and LLMs) love