Module 7: Monitoring

This module covers observability and monitoring strategies for machine learning applications in production. You’ll learn how to instrument your applications with modern observability tools, set up dashboards for system metrics, and detect data drift in your ML models.

What You’ll Learn

In this module, you’ll gain hands-on experience with:

LLM Observability

Instrument LLM applications with OpenTelemetry, LangSmith, and OpenLLMetry

SigNoz Setup

Deploy and configure SigNoz for distributed tracing and metrics

Grafana Dashboards

Set up Prometheus and Grafana for Kubernetes monitoring

Data Monitoring

Detect drift and outliers with Evidently and Seldon

Module Overview

Monitoring is critical for maintaining reliable ML systems in production. This module focuses on three key areas:

System Observability

Learn to instrument your applications with OpenTelemetry for distributed tracing, metrics, and logs. You’ll set up:

SigNoz for end-to-end observability
Grafana dashboards for Kubernetes metrics
Prometheus for metrics collection and alerting

LLM Application Monitoring

Special attention to observability patterns for Large Language Model applications:

Track token usage and costs
Monitor latency and throughput
Trace multi-step reasoning chains
Compare different observability platforms (AgentOps, LangSmith, OpenLLMetry)

Data Monitoring

Implement monitoring for your ML models and data pipelines:

Drift detection to identify when input distributions change
Outlier detection to catch anomalous requests
Model performance monitoring to track prediction quality
Integration with Evidently and Seldon for production monitoring

Prerequisites

Before starting this module, you should have:

A Kubernetes cluster (kind or similar)
Basic understanding of Kubernetes concepts
Familiarity with Python and ML concepts
Experience with Module 5 (Model Serving) recommended

Architecture

The monitoring stack includes:

Learning Outcomes

By the end of this module, you will be able to:

Instrument applications for observability

Add OpenTelemetry instrumentation to Python applications
Configure tracing for LLM applications
Send traces and metrics to observability backends
Use decorators and context managers for custom instrumentation

Deploy and configure monitoring tools

Install SigNoz on Kubernetes using Helm
Set up Prometheus and Grafana stack
Configure service discovery and scraping
Create custom dashboards and alerts

Monitor ML models in production

Deploy Seldon Core with drift and outlier detectors
Configure Evidently for data quality monitoring
Set up alerting for model degradation
Build monitoring pipelines for continuous validation

Design monitoring strategies

Identify key metrics for ML systems
Plan ground truth collection strategies
Design alerting thresholds and SLOs
Document monitoring and incident response procedures

Practice Tasks

This module includes hands-on homework assignments:

Integrate SigNoz Monitoring

Add SigNoz instrumentation to your application and verify traces are being collected.

Create Grafana Dashboard

Build a custom dashboard showing key metrics for your application.

Implement Drift Detection

Add drift detection logic to your ML pipeline (Kubeflow, Airflow, or Dagster).

Design Monitoring Strategy

Document your system and ML monitoring plan, including ground truth collection and alert definitions.

See the Practice page for detailed requirements and evaluation criteria.

Tools and Technologies

This module uses the following tools:

SigNoz: Open-source observability platform
Grafana: Visualization and dashboarding
Prometheus: Metrics collection and alerting
OpenTelemetry: Instrumentation framework
LangSmith: LLM application monitoring
AgentOps: Agent workflow observability
OpenLLMetry: LLM-specific telemetry
Evidently: ML monitoring and drift detection
Seldon Core: Model serving with analytics
Alibi Detect: Outlier and drift detection algorithms

Reading Materials

Key papers and resources:

Next Steps

Start with Observability

Learn LLM observability concepts and patterns

View Practice Tasks

See homework assignments and criteria

Module 1: Infrastructure

Module 2: Data Management

Module 3: Training Workflows

Module 4: Pipeline Orchestration

Module 5: Model Serving

Module 6: Optimization

Module 7: Monitoring

Module 8: Cloud Platforms

Module 7: Monitoring

Module 7: Monitoring

What You’ll Learn

LLM Observability

SigNoz Setup

Grafana Dashboards

Data Monitoring

Module Overview

System Observability

LLM Application Monitoring

Data Monitoring

Prerequisites

Architecture

Learning Outcomes

Practice Tasks

Tools and Technologies

Reading Materials

Next Steps

Start with Observability

View Practice Tasks

Build docs developers (and LLMs) love

Module 1: Infrastructure

Module 2: Data Management

Module 3: Training Workflows

Module 4: Pipeline Orchestration

Module 5: Model Serving

Module 6: Optimization

Module 7: Monitoring

Module 8: Cloud Platforms

​Module 7: Monitoring

​What You’ll Learn

LLM Observability

SigNoz Setup

Grafana Dashboards

Data Monitoring

​Module Overview

​System Observability

​LLM Application Monitoring

​Data Monitoring

​Prerequisites

​Architecture

​Learning Outcomes

​Practice Tasks

​Tools and Technologies

​Reading Materials

​Next Steps

Start with Observability

View Practice Tasks

Build docs developers (and LLMs) love

Module 7: Monitoring

What You’ll Learn

Module Overview

System Observability

LLM Application Monitoring

Data Monitoring

Prerequisites

Architecture

Learning Outcomes

Practice Tasks

Tools and Technologies

Reading Materials

Next Steps