Production Monitoring for LLM Applications

Learn more about Mintlify

Enter your email to receive updates about new features and product releases.

The Challenge
Solution Overview
Implementation Guide
1. Instrument Your Application
2. Set Up Critical Alerts
3. Configure User Monitoring
4. Implement Session Tracking
5. Set Up Cost Controls
6. Enable Caching
Monitoring Dashboard
Overview Metrics
Feature Breakdown
User Insights
Incident Response
Best Practices
Advanced: Custom Dashboards
Monitoring Checklist
Next Steps

Running LLM applications in production requires proactive monitoring to catch issues before they impact users. This guide shows you how to set up comprehensive production monitoring with Helicone.

The Challenge

Production LLM applications face unique challenges:

Unpredictable errors: Provider outages, rate limits, and model changes
Cost volatility: Usage spikes from viral features or abuse
Quality degradation: Prompt drift, model updates, data issues
Performance issues: Latency spikes affecting user experience

Solution Overview

Helicone provides a complete monitoring stack:

Real-time Alerts

Get notified of errors, cost spikes, and latency issues

Request Observability

View every request/response with full context

Usage Analytics

Track costs, token usage, and model performance

User Tracking

Monitor per-user costs and identify abuse

Implementation Guide

1. Instrument Your Application

Add monitoring headers to all production requests:

import { OpenAI } from "openai";

const client = new OpenAI({
  apiKey: process.env.OPENAI_API_KEY,
  baseURL: "https://oai.helicone.ai/v1",
  defaultHeaders: {
    "Helicone-Auth": `Bearer ${process.env.HELICONE_API_KEY}`,
  },
});

// Wrapper function with monitoring
async function monitoredLLMCall(
  userId: string,
  feature: string,
  messages: any[],
  options: any = {}
) {
  const startTime = Date.now();
  
  try {
    const response = await client.chat.completions.create(
      {
        model: options.model || "gpt-4o",
        messages,
        ...options,
      },
      {
        headers: {
          // Core monitoring headers
          "Helicone-User-Id": userId,
          "Helicone-Property-Feature": feature,
          "Helicone-Property-Environment": "production",
          
          // Optional: Version tracking
          "Helicone-Property-Version": process.env.APP_VERSION || "unknown",
          
          // Optional: Session tracking for multi-step workflows
          ...(options.sessionId && {
            "Helicone-Session-Id": options.sessionId,
            "Helicone-Session-Name": feature,
            "Helicone-Session-Path": options.sessionPath || "/",
          }),
        },
      }
    );
    
    // Log success metrics
    const duration = Date.now() - startTime;
    console.log(`LLM call succeeded: ${feature} (${duration}ms)`);
    
    return response;
  } catch (error: any) {
    // Log error with context
    console.error('LLM call failed:', {
      feature,
      userId,
      error: error.message,
      status: error.status,
    });
    
    throw error;
  }
}

// Usage
await monitoredLLMCall(
  "user-123",
  "chat",
  [{ role: "user", content: "Hello!" }]
);

2. Set Up Critical Alerts

Create alerts for production issues:

Error Rate Alert

Navigate to Settings → Alerts and create:Alert Configuration:

Name: Production Error Rate
Metric: Error Rate
Threshold: > 5%
Time Window: 10 minutes
Minimum Requests: 10 (avoid false positives)

Filters:

Property: Environment = production

Notifications:

Slack: #production-alerts
Email: [email protected]

This catches provider outages, rate limit issues, and breaking changes quickly.

Cost Spike Alert

Alert Configuration:

Name: Production Cost Spike
Metric: Cost
Threshold: > $100/day
Time Window: 1 day

Filters:

Property: Environment = production

Notifications:

Email: [email protected]
Slack: #cost-alerts

Prevents unexpected bills from usage spikes or abuse.

Latency Alert

Alert Configuration:

Name: High Latency
Metric: Latency
Threshold: P95 > 10000ms
Time Window: 30 minutes
Minimum Requests: 20

Filters:

Property: Environment = production

Notifications:

Slack: #performance-alerts

Detects performance degradation affecting user experience.

Feature-Specific Alert

For critical features:Alert Configuration:

Name: Chat Feature Errors
Metric: Error Rate
Threshold: > 2%
Time Window: 5 minutes

Filters:

Property: Environment = production
Property: Feature = chat

More sensitive monitoring for revenue-critical features.

3. Configure User Monitoring

Track per-user usage to identify abuse and understand behavior:

// Track user metadata for segmentation
await client.chat.completions.create(
  { /* ... */ },
  {
    headers: {
      "Helicone-User-Id": user.id,
      
      // Segment by user tier for different thresholds
      "Helicone-Property-UserTier": user.tier, // free, pro, enterprise
      
      // Track signup date to understand cohort behavior
      "Helicone-Property-SignupDate": user.signupDate,
      
      // Geographic segmentation
      "Helicone-Property-Region": user.region,
    },
  }
);

Use Cases:

Identify users exceeding quotas
Detect potential abuse patterns
Understand usage by tier/cohort
Calculate customer lifetime value

4. Implement Session Tracking

For multi-step workflows, track complete user journeys:

import { randomUUID } from "crypto";

// Start of user interaction
const sessionId = randomUUID();

// Step 1: Initial query
await monitoredLLMCall(
  userId,
  "document-qa",
  [{ role: "user", content: "Summarize this document" }],
  {
    sessionId,
    sessionPath: "/summarize",
  }
);

// Step 2: Follow-up
await monitoredLLMCall(
  userId,
  "document-qa",
  [{ role: "user", content: "Extract key points" }],
  {
    sessionId,
    sessionPath: "/summarize/extract",
  }
);

// View complete workflow in Helicone Sessions dashboard

Benefits:

See total cost per user interaction
Debug failures with full context
Identify expensive workflow patterns
Measure success rates for complete flows

5. Set Up Cost Controls

Implement rate limiting and quota management:

interface UserQuota {
  dailyLimit: number;
  monthlyLimit: number;
}

const QUOTAS: Record<string, UserQuota> = {
  free: { dailyLimit: 10, monthlyLimit: 100 },
  pro: { dailyLimit: 500, monthlyLimit: 10000 },
  enterprise: { dailyLimit: 10000, monthlyLimit: 1000000 },
};

async function checkQuota(userId: string, tier: string): Promise<void> {
  // Query Helicone for user's usage
  const response = await fetch(
    "https://api.helicone.ai/v1/request/query-clickhouse",
    {
      method: "POST",
      headers: {
        "Authorization": `Bearer ${process.env.HELICONE_API_KEY}`,
        "Content-Type": "application/json",
      },
      body: JSON.stringify({
        filter: {
          request_response_rmt: {
            user_id: { equals: userId },
            request_created_at: {
              gte: new Date(Date.now() - 24 * 60 * 60 * 1000).toISOString(),
            },
          },
        },
      }),
    }
  );
  
  const data = await response.json();
  const dailyRequests = data.data.length;
  const quota = QUOTAS[tier];
  
  if (dailyRequests >= quota.dailyLimit) {
    throw new Error(
      `Daily limit of ${quota.dailyLimit} requests reached. ` +
      `Upgrade or try again tomorrow.`
    );
  }
}

// Use before each LLM call
await checkQuota(userId, userTier);
await monitoredLLMCall(userId, feature, messages);

6. Enable Caching

Reduce costs and latency for repetitive queries:

// Enable caching for production requests
const cacheableFeatures = ["faq", "help-docs", "translations"];

const headers = cacheableFeatures.includes(feature)
  ? {
      "Helicone-Cache-Enabled": "true",
      "Cache-Control": "max-age=3600", // 1 hour
    }
  : {};

await client.chat.completions.create(
  { /* ... */ },
  { headers: { ...baseHeaders, ...headers } }
);

Monitor Cache Performance:

Go to Dashboard → Cache Analytics
Track hit rate, savings, and performance
Adjust cache TTL based on update frequency

Monitoring Dashboard

Key metrics to watch daily:

Overview Metrics

Today's Snapshot:
├── Total Requests: 42,387
├── Error Rate: 0.8% ✅
├── Avg Latency: 1,247ms ✅
├── Total Cost: $127.45
└── Cache Hit Rate: 34%

Alerts:
✅ No active alerts

Feature Breakdown

Top Features by Volume:
chat: 28,432 requests (67%)
document-qa: 8,234 requests (19%)
search: 5,721 requests (14%)

Top Features by Cost:
document-qa: $78.23 (61%)
chat: $38.12 (30%)
search: $11.10 (9%)

User Insights

Active Users: 1,247

Top Users by Usage:
1. user-789 (enterprise): 523 requests, $12.34
2. user-456 (pro): 412 requests, $8.91
3. user-123 (pro): 387 requests, $7.45

Potential Abuse:
⚠️ user-999 (free): 95 requests (near daily limit)

Incident Response

When an alert fires:

Assess Severity

Error rate alert = High severity (affects all users)
Cost alert = Medium severity (financial impact)
Latency alert = Medium severity (poor UX)
Feature-specific = Varies by feature criticality

Investigate in Helicone

Click alert notification link
Review affected requests
Look for patterns:
- Specific users affected?
- Single feature or widespread?
- Started at specific time?

Take Action

For errors:

Check provider status pages
Review recent deployments
Implement fallback/retry logic

For cost spikes:

Identify top users/features
Implement temporary rate limits
Investigate for abuse

For latency:

Check model selection
Review prompt sizes
Consider model switching

Document & Follow Up

Add incident notes in monitoring system
Update runbooks
Create tickets for permanent fixes
Schedule post-mortem if needed

Best Practices

Start with broad alerts: Begin with high thresholds and tighten based on actual patterns

Use separate environments: Different alert thresholds for dev/staging/production

Monitor user behavior: Track per-user costs and usage patterns

Regular reviews: Weekly review of dashboards to identify trends

Avoid alert fatigue: Set minimum request thresholds to prevent false positives during low traffic

Advanced: Custom Dashboards

Build custom monitoring using Helicone API:

// Fetch daily production metrics
async function getDailyMetrics() {
  const response = await fetch(
    "https://api.helicone.ai/v1/request/query-clickhouse",
    {
      method: "POST",
      headers: {
        "Authorization": `Bearer ${HELICONE_API_KEY}`,
        "Content-Type": "application/json",
      },
      body: JSON.stringify({
        filter: {
          request_response_rmt: {
            request_created_at: {
              gte: new Date(Date.now() - 24 * 60 * 60 * 1000).toISOString(),
            },
            properties: {
              Environment: { equals: "production" },
            },
          },
        },
      }),
    }
  );
  
  const data = await response.json();
  
  // Calculate metrics
  const totalRequests = data.data.length;
  const errors = data.data.filter(r => r.status >= 400).length;
  const errorRate = (errors / totalRequests) * 100;
  const totalCost = data.data.reduce((sum, r) => sum + (r.cost_usd || 0), 0);
  const avgLatency = data.data.reduce((sum, r) => sum + r.latency, 0) / totalRequests;
  
  return {
    totalRequests,
    errorRate,
    totalCost,
    avgLatency,
    timestamp: new Date().toISOString(),
  };
}

// Send to your monitoring system (e.g., DataDog, Grafana)

Monitoring Checklist

Next Steps

Alerts Documentation

Deep dive into alert configuration options

User Metrics

Track and analyze per-user behavior

Debugging Guide

Learn how to investigate production issues

Cost Optimization

Reduce production costs

Debugging LLM Apps Fine-Tuning Preparation

⌘I

Build docs developers (and LLMs) love

Get started for free Talk to us

Tutorials

Use Cases

​The Challenge

​Solution Overview

Real-time Alerts

Request Observability

Usage Analytics

User Tracking

​Implementation Guide

​1. Instrument Your Application

​2. Set Up Critical Alerts

​3. Configure User Monitoring

​4. Implement Session Tracking

​5. Set Up Cost Controls

​6. Enable Caching

​Monitoring Dashboard

​Overview Metrics

​Feature Breakdown

​User Insights

​Incident Response

​Best Practices

​Advanced: Custom Dashboards

​Monitoring Checklist

​Next Steps

Alerts Documentation

User Metrics

Debugging Guide

Cost Optimization

Build docs developers (and LLMs) love

The Challenge

Solution Overview

Implementation Guide

1. Instrument Your Application

2. Set Up Critical Alerts

3. Configure User Monitoring

4. Implement Session Tracking

5. Set Up Cost Controls

6. Enable Caching

Monitoring Dashboard

Overview Metrics

Feature Breakdown

User Insights

Incident Response

Best Practices

Advanced: Custom Dashboards

Monitoring Checklist

Next Steps