Skip to main content

LLM Observability

Observability for Large Language Model applications requires specialized tooling to track token usage, latency, multi-step reasoning chains, and costs. This guide covers how to instrument your LLM applications with modern observability platforms.

Why LLM Observability Matters

LLM applications present unique monitoring challenges:
  • Variable costs: Each request has different token usage and costs
  • Long latency: Responses can take seconds to minutes
  • Complex workflows: Multi-step reasoning, tool use, and retrieval
  • Non-determinism: Same input can produce different outputs
  • Quality assessment: Correctness is harder to evaluate automatically
Without proper observability, you can’t debug failures, optimize costs, or improve quality in production LLM applications.

OpenTelemetry Basics

OpenTelemetry provides a vendor-neutral standard for collecting telemetry data:
  • Traces: Request flows through distributed systems
  • Metrics: Numerical measurements over time
  • Logs: Discrete events with context

Key Concepts

A trace represents a request’s journey through your system. It’s composed of spans, where each span represents a unit of work (e.g., an API call, database query, or LLM inference).Spans can be nested to show parent-child relationships, making it easy to see which operations take the most time.
OpenTelemetry automatically propagates context across service boundaries, so you can trace a request through multiple microservices and see the complete picture.
You can attach attributes (key-value pairs) to spans to add context, like model name, prompt length, or token count. Events mark specific points in time within a span.

Observability Platforms

This module demonstrates three observability platforms for LLM applications:

1. AgentOps

AgentOps specializes in monitoring AI agents:
import agentops
from openai import OpenAI

# Initialize AgentOps
agentops.init()

client = OpenAI()
response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "Hello!"}]
)

# End session
agentops.end_all_sessions()
Features:
  • Automatic tracking of OpenAI calls
  • Agent workflow visualization
  • Cost tracking per session
  • Simple integration with no code changes

2. LangSmith

LangSmith from LangChain provides detailed tracing:
from langsmith.wrappers import wrap_openai
import openai
import os

# Set environment variables
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = "lsv2_..."

# Wrap OpenAI client
client = wrap_openai(openai.Client())

response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "Hello!"}]
)
Features:
  • Detailed trace trees for complex chains
  • Prompt and response logging
  • Evaluation and testing tools
  • Dataset management

3. OpenLLMetry

OpenLLMetry (Traceloop) uses OpenTelemetry standards:
from traceloop.sdk import Traceloop
from traceloop.sdk.decorators import workflow
import openai
import os

# Configure OpenTelemetry endpoint
os.environ["TRACELOOP_BASE_URL"] = "http://localhost:4318"

# Initialize Traceloop
Traceloop.init(app_name="my-llm-app")

client = openai.Client()

@workflow(name="chat_completion")
def get_completion(prompt: str) -> str:
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}]
    )
    return response.choices[0].message.content

result = get_completion("Tell me a joke")
Features:
  • OpenTelemetry compatible (works with any backend)
  • Automatic instrumentation for popular frameworks
  • Custom workflow decorators
  • Self-hosted option

Example Applications

The module includes two reference applications:

Text-to-SQL Application

Location: llm-apps/sql_app.py This application demonstrates observability across three platforms:
import agentops
from openai import OpenAI
from langsmith.wrappers import wrap_openai
from traceloop.sdk import Traceloop
from traceloop.sdk.decorators import workflow

def get_sql(query: str, context: str, client: OpenAI) -> str:
    """Generate SQL from natural language query."""
    prompt = f"""
    Write the corresponding SQL query based on user requests and database context:
    
    User requests: {query}
    Database context: {context}
    
    Please return in JSON format: {{"sql": ""}}
    """
    
    chat_completion = client.chat.completions.create(
        messages=[{"role": "user", "content": prompt}],
        model="gpt-4o-mini-2024-07-18",
        response_format={"type": "json_object"},
    )
    
    return json.loads(chat_completion.choices[0].message.content)["sql"]

# Test with AgentOps
agentops.init()
client_agentops = OpenAI()
result = get_sql(query=sql_prompt, context=sql_context, client=client_agentops)
agentops.end_all_sessions()

# Test with LangSmith
client_lang_smith = wrap_openai(OpenAI())
result = get_sql(query=sql_prompt, context=sql_context, client=client_lang_smith)

# Test with OpenLLMetry
Traceloop.init(app_name="text2sql")
client_traceloop = OpenAI()
get_sql_traceloop = workflow(name="get_sql")(get_sql)
result = get_sql_traceloop(query=sql_prompt, context=sql_context, client=client_traceloop)

AI Scientist Paper Reviewer

Location: llm-apps/reviewer.py This demonstrates observability for a complex multi-step reasoning application:
from ai_scientist.perform_review import load_paper, perform_review
import openai
import agentops
from langsmith.wrappers import wrap_openai
from traceloop.sdk import Traceloop
from traceloop.sdk.decorators import workflow

def review_paper(paper_pdf_path: str, client: openai.OpenAI) -> str:
    """Review a research paper using AI Scientist."""
    model = "gpt-4o-mini-2024-07-18"
    paper_txt = load_paper(paper_pdf_path)
    
    review = perform_review(
        paper_txt,
        model,
        client,
        num_reflections=5,
        num_fs_examples=1,
        num_reviews_ensemble=5,
        temperature=0.1,
    )
    
    return f"{review['Overall']}\n{review['Decision']}\n{review['Weaknesses']}"
This multi-step workflow includes:
  • Paper parsing
  • Multiple review iterations
  • Ensemble evaluation
  • Reflection and refinement

Environment Setup

Required Environment Variables

# OpenAI API
export OPENAI_API_KEY="sk-proj-****"

# OpenTelemetry endpoint (for OpenLLMetry)
export TRACELOOP_BASE_URL="http://localhost:4318"

# LangSmith (optional)
export LANGCHAIN_TRACING_V2=true
export LANGCHAIN_API_KEY="lsv2_****"

# Python path for AI Scientist
export PYTHONPATH="llm-apps/AI-Scientist/"

Running the Examples

python llm-apps/sql_app.py

Custom Instrumentation

Add custom spans to track specific operations:
from opentelemetry import trace

tracer = trace.get_tracer(__name__)

def process_data(data):
    with tracer.start_as_current_span("data_processing") as span:
        # Add attributes
        span.set_attribute("data.size", len(data))
        span.set_attribute("data.type", type(data).__name__)
        
        # Add an event
        span.add_event("Processing started")
        
        # Your processing logic
        result = transform(data)
        
        span.add_event("Processing completed")
        return result

Workflow Decorators

Use decorators to automatically track functions:
from traceloop.sdk.decorators import workflow, task

@workflow(name="rag_pipeline")
def retrieve_and_generate(query: str) -> str:
    """Complete RAG pipeline."""
    docs = retrieve_documents(query)
    context = format_context(docs)
    return generate_response(query, context)

@task(name="retrieve")
def retrieve_documents(query: str):
    # Automatically tracked as a span
    return vector_db.search(query, top_k=5)

@task(name="generate")
def generate_response(query: str, context: str):
    # Also tracked automatically
    return llm.generate(query=query, context=context)

Comparing Platforms

FeatureAgentOpsLangSmithOpenLLMetry
Auto-instrumentation
Cost tracking
Custom backends
Evaluation tools⚠️ Basic✅ Advanced
Self-hosted⚠️ Paid
OpenTelemetry
LangChain integration⚠️✅ Native

Best Practices

Sample Strategically

Don’t trace every request in high-volume production. Use sampling to reduce overhead while maintaining visibility.

Add Context

Include user IDs, session IDs, model versions, and other metadata as span attributes for easier debugging.

Monitor Costs

Track token usage and costs per request, user, or feature to optimize spending.

Set Alerts

Configure alerts for high latency, error rates, or cost spikes to catch issues early.

Troubleshooting

  • Check that TRACELOOP_BASE_URL is set correctly
  • Verify the OpenTelemetry collector is running
  • Ensure your application has network access to the collector
  • Check for initialization errors in application logs
  • Verify the SDK version is compatible
  • Check that instrumentation is initialized before creating the client
  • Ensure exceptions aren’t silently caught
  • Enable sampling to reduce trace volume
  • Disable detailed logging in production
  • Use batch exporters instead of synchronous ones

Additional Resources

Next Steps

Install SigNoz

Set up SigNoz as your observability backend for OpenTelemetry traces

Build docs developers (and LLMs) love