Analyzing Agent Behavior - Building Reliable Agents

Why Tracing Matters

AI agents are complex systems that make multiple LLM calls, use various tools, and maintain conversation state. When something goes wrong—or right—you need to understand exactly what happened. Tracing gives you:

Complete Visibility

See every step: LLM calls, tool invocations, inputs, outputs, and timing

Debugging Context

Understand why an agent made a particular decision or used a specific tool

Performance Metrics

Measure latency, token usage, and cost per interaction

Pattern Recognition

Identify common failure modes or inefficient behavior patterns

Setting Up Tracing

The OfficeFlow agent uses LangSmith for tracing. The setup involves three steps:

1. Wrap the OpenAI Client

Python
TypeScript

from langsmith.wrappers import wrap_openai
from openai import AsyncOpenAI

# Wrap client for automatic LLM call tracing
client = wrap_openai(AsyncOpenAI(api_key=os.getenv("OPENAI_API_KEY")))

This automatically traces all calls to client.chat.completions.create() and client.embeddings.create().

import { wrapOpenAI } from "langsmith/wrappers";
import OpenAI from "openai";

// Wrap client for automatic LLM call tracing
const client = wrapOpenAI(new OpenAI({ apiKey: process.env.OPENAI_API_KEY }));

This automatically traces all calls to client.chat.completions.create() and client.embeddings.create().

2. Decorate Tools

Python
TypeScript

from langsmith import traceable

@traceable(name="query_database", run_type="tool")
def query_database(query: str, db_path: str) -> str:
    """Execute SQL query against the inventory database."""
    try:
        conn = sqlite3.connect(db_path)
        cursor = conn.cursor()
        cursor.execute(query)
        results = cursor.fetchall()
        conn.close()
        return str(results)
    except Exception as e:
        return f"Error: {str(e)}"

@traceable(name="search_knowledge_base", run_type="tool")
async def search_knowledge_base(query: str, top_k: int = 2) -> str:
    """Search knowledge base using semantic similarity."""
    # ... implementation
    pass

import { traceable } from "langsmith/traceable";

const queryDatabase = traceable(
  (query: string, dbPath: string): string => {
    try {
      const db = new Database(dbPath);
      const results = db.prepare(query).all();
      db.close();
      return JSON.stringify(results);
    } catch (e: any) {
      return `Error: ${e.message}`;
    }
  },
  { name: "query_database", run_type: "tool" }
);

const searchKnowledgeBase = traceable(
  async (query: string, topK: number = 2): Promise<string> => {
    // ... implementation
  },
  { name: "search_knowledge_base", run_type: "tool" }
);

3. Trace the Main Chat Function

Python
TypeScript

from langsmith import uuid7

thread_id = str(uuid7())  # Unique ID for this conversation

@traceable(name="Emma", metadata={"thread_id": thread_id})
async def chat(question: str) -> str:
    """Process a user question and return assistant response."""
    # ... chat logic
    pass

import { uuid7 } from "langsmith";

const threadId = String(uuid7());  // Unique ID for this conversation

const chat = traceable(
  async (question: string): Promise<{ messages: any[]; output: string }> => {
    // ... chat logic
  },
  { name: "Emma", metadata: { thread_id: threadId } }
);

Environment Configuration

Set these environment variables to enable LangSmith:

LANGCHAIN_TRACING_V2=true
LANGCHAIN_API_KEY=your_api_key_here
LANGCHAIN_PROJECT=officeflow-agent  # Optional: organize traces by project

Anatomy of a Trace

When you run the agent and ask a question, LangSmith creates a hierarchical trace:

Emma (parent trace)
├── ChatOpenAI (LLM call #1)
│   ├── Input: system prompt + user question
│   └── Output: assistant message with tool_calls
│
├── query_database (tool)
│   ├── Input: {query: "SELECT name FROM sqlite_master WHERE type='table'"}
│   └── Output: "[('products',), ('inventory',)]"
│
├── query_database (tool)
│   ├── Input: {query: "PRAGMA table_info(products)"}
│   └── Output: "[(0, 'id', 'INTEGER', ...), ...]"
│
├── query_database (tool)
│   ├── Input: {query: "SELECT * FROM products WHERE category='Paper'"}
│   └── Output: "[('P001', 'Copy Paper', 'Paper', ...), ...]"
│
└── ChatOpenAI (LLM call #2)
    ├── Input: previous messages + tool results
    └── Output: final user-facing response

Key Components

Parent Trace (Emma)

Represents the entire interaction
Contains metadata like thread_id for grouping conversations
Shows total latency and cost for the full response

LLM Calls (ChatOpenAI)

Includes full prompt (system + history + user message)
Shows model parameters (temperature, model name, etc.)
Displays token counts (prompt tokens, completion tokens)
Records latency and cost
Contains the raw response including tool calls

Tool Calls

Shows which tool was called and why
Displays input arguments (e.g., the SQL query)
Records the output returned to the LLM
Captures errors if the tool failed
Measures tool execution time

Common Debugging Scenarios

Scenario 1: Wrong Tool Called

Problem: Agent uses search_knowledge_base when it should use query_database. How to diagnose:

Open the trace in LangSmith
Look at the first LLM call’s output
Check the tool_calls array to see which tool was selected
Examine the LLM’s reasoning by looking at any content before tool calls
Review your tool descriptions - are they clear and distinct?

Common causes:

Tool descriptions are too similar
System prompt doesn’t clearly delineate when to use each tool
User query is ambiguous

Fix: Update tool descriptions or add examples to the system prompt.

Scenario 2: Tool Returns Error

Problem: Tool call fails with an error like “no such column: product_name”. How to diagnose:

Find the Tool Trace

Click on the query_database node in the trace tree

Check the Input

Look at the SQL query the agent generated. Does it match your schema?

Check the Output

See the error message. Does it indicate a schema mismatch, permission issue, or syntax error?

Verify Schema Discovery

Look at previous tool calls in the trace. Did the agent discover the schema first?

Common causes:

Agent didn’t discover schema (missing from tool description)
Agent hallucinated column names
Database connection issues

Fix: Add schema discovery instructions to tool description (see v2 in Agent Versions).

Scenario 3: Poor Response Quality

Problem: Agent’s final response is too verbose, inaccurate, or unhelpful. How to diagnose:

Open the final LLM call in the trace
Examine the full prompt passed to the LLM:
- System prompt
- Conversation history
- Tool results
- User question
Check if tool results contained the right information
Look for context window issues (truncation, too much irrelevant data)
Review the completion to see if the LLM properly synthesized the tool results

Common causes:

System prompt doesn’t include relevant guidelines
Tool returned too much or too little information
No examples of good responses in the prompt
Agent is using stale conversation history

Fix:

Add specific instructions to system prompt (e.g., conciseness directive)
Improve tool output formatting
Add few-shot examples

Scenario 4: Excessive Tool Use

Problem: Agent makes 5+ tool calls for a simple question. How to diagnose:

Count tool call nodes in the trace
Check if tool calls are redundant (same query multiple times)
Look for schema discovery happening multiple times
See if agent is exploring different tables unnecessarily

Common causes:

Agent doesn’t remember it already discovered schema
Tool description encourages exploration
Agent is uncertain and tries multiple approaches

Fix:

Cache schema information in conversation history
Provide schema upfront in system prompt for small databases
Add instruction to minimize tool calls

Scenario 5: Ignoring Tool Results

Problem: Agent calls a tool but doesn’t use the results in its response. How to diagnose:

Find the tool call in the trace
Note what data was returned
Look at the final LLM call
Check if the tool result is in the prompt but not referenced in the completion
See if there’s a prompt engineering issue causing the LLM to ignore tool results

Common causes:

Tool result format is hard to parse (e.g., deeply nested JSON)
Tool returned error but system prompt doesn’t handle errors well
System prompt doesn’t emphasize using tool results

Fix:

Format tool results more clearly (e.g., use markdown tables)
Add explicit instruction: “Base your answer on the tool results”
Handle tool errors gracefully in the tool function

Analyzing Patterns Across Traces

Filters and Search

LangSmith allows you to:

Filter by Metadata
Filter by Status
Search by Input/Output

# All traces for a specific thread
thread_id = "abc-123-def-456"
# Filter in LangSmith UI: metadata.thread_id = "abc-123-def-456"

View all interactions in a single conversation to understand context.

# Show only errors
status = "error"

# Show only successful traces
status = "success"

Focus on problematic interactions.

# Find all queries about "paper"
input contains "paper"

# Find responses that mentioned returns
output contains "[email protected]"

Identify how the agent handles specific topics.

Metrics and Analytics

LangSmith provides aggregate metrics:

Latency Distribution

P50, P95, P99 latency
Identify slow traces
Compare versions

Cost Analysis

Total tokens used
Cost per trace
Cost breakdown by model

Error Rate

Percentage of failed traces
Common error types
Trends over time

Tool Usage

How often each tool is called
Average calls per trace
Tool success rate

Comparing Agent Versions

When you improve your agent (e.g., v4 → v5), use traces to measure impact:

1. Create Separate Projects

# v4 traces
LANGCHAIN_PROJECT=officeflow-agent-v4

# v5 traces  
LANGCHAIN_PROJECT=officeflow-agent-v5

2. Run the Same Test Cases

Create a test set and run it against both versions:

Python
TypeScript

test_cases = [
    "Do you have copy paper?",
    "What's your return policy?",
    "I need 500 staplers for my office",
    "Are you open on weekends?",
]

for question in test_cases:
    response = await chat(question)
    # Automatically traced to current project

const testCases = [
  "Do you have copy paper?",
  "What's your return policy?",
  "I need 500 staplers for my office",
  "Are you open on weekends?",
];

for (const question of testCases) {
  const response = await chat(question);
  // Automatically traced to current project
}

3. Compare Metrics

Metric	v4	v5	Change
Avg latency	2.3s	1.9s	-17%
Avg tokens (completion)	156	98	-37%
Avg cost per trace	$0.0023	$0.0015	-35%
Tool calls per trace	2.1	2.1	0%
Error rate	2.3%	2.1%	-0.2pp

This shows v5’s conciseness directive reduced token usage by 37% without affecting tool usage or increasing errors.

4. Qualitative Analysis

Beyond metrics, manually review traces:

Checklist for Trace Review

Do responses sound natural and helpful?
Is tool usage logical and efficient?
Are errors handled gracefully?
Does the agent follow all instructions in the system prompt?
Are there any unexpected behaviors?
Does the agent maintain context across turns?
Are there any prompt injection vulnerabilities?

Debugging Workflow

When investigating an issue:

Reproduce the Issue

Run the agent with the problematic input. Note the trace URL.

Open the Trace

Click through to LangSmith and open the trace tree.

Identify the Problem Step

If wrong tool: Look at first LLM call’s tool selection
If tool error: Find the failing tool node
If bad output: Examine final LLM call

Inspect Inputs/Outputs

Click on the problematic node and review:

Input: What data did this step receive?
Output: What did it produce?
Metadata: Timing, model parameters, etc.

Form a Hypothesis

Based on the trace, what caused the issue?

Unclear instructions?
Missing context?
Tool implementation bug?
Model limitation?

Implement a Fix

Update system prompt
Improve tool description
Fix tool implementation
Add error handling

Verify with Traces

Run the same input again and compare new trace to old trace.

Advanced: Custom Metadata

Add custom metadata to traces for richer analysis:

Python
TypeScript

from langsmith import traceable

@traceable(
    name="Emma",
    metadata={
        "thread_id": thread_id,
        "customer_id": "CUST-12345",  # If authenticated
        "version": "v5",
        "environment": "production",
    }
)
async def chat(question: str) -> str:
    # ... implementation
    pass

import { traceable } from "langsmith/traceable";

const chat = traceable(
  async (question: string): Promise<{ messages: any[]; output: string }> => {
    // ... implementation
  },
  {
    name: "Emma",
    metadata: {
      thread_id: threadId,
      customer_id: "CUST-12345",  // If authenticated
      version: "v5",
      environment: "production",
    },
  }
);

Then filter by custom metadata in LangSmith:

metadata.version = "v5"
metadata.environment = "production"
metadata.customer_id = "CUST-12345"

Best Practices

Trace Everything

Wrap all LLM calls, tools, and major functions. More visibility is better.

Use Descriptive Names

Name traces and tools clearly so you can understand traces at a glance.

Add Rich Metadata

Include context like user IDs, versions, environments for better filtering.

Review Regularly

Don’t just check traces when things break. Review periodically to find improvements.

Share Traces

LangSmith traces are shareable URLs. Use them in bug reports and code reviews.

Combine with Evals

Traces show what happened. Evaluations measure if it was good.

Tracing in Production

Be mindful of these considerations when tracing in production:

Privacy: Traces contain user inputs and agent outputs. Ensure compliance with privacy policies.
Cost: LangSmith charges based on trace volume. Monitor usage.
Performance: Tracing adds minimal latency (~10-50ms) but test in your environment.
Sampling: For high-volume applications, consider sampling (trace 10% of requests).

Example sampling implementation:

Python
TypeScript

import random
from langsmith import traceable

SAMPLE_RATE = 0.1  # Trace 10% of requests

async def chat(question: str) -> str:
    should_trace = random.random() < SAMPLE_RATE
    
    if should_trace:
        return await chat_traced(question)
    else:
        return await chat_untraced(question)

@traceable(name="Emma", metadata={"thread_id": thread_id})
async def chat_traced(question: str) -> str:
    # ... implementation
    pass

async def chat_untraced(question: str) -> str:
    # Same implementation without @traceable
    pass

import { traceable } from "langsmith/traceable";

const SAMPLE_RATE = 0.1;  // Trace 10% of requests

async function chat(question: string): Promise<{ messages: any[]; output: string }> {
  const shouldTrace = Math.random() < SAMPLE_RATE;
  
  if (shouldTrace) {
    return await chatTraced(question);
  } else {
    return await chatUntraced(question);
  }
}

const chatTraced = traceable(
  async (question: string): Promise<{ messages: any[]; output: string }> => {
    // ... implementation
  },
  { name: "Emma", metadata: { thread_id: threadId } }
);

async function chatUntraced(question: string): Promise<{ messages: any[]; output: string }> {
  // Same implementation without traceable wrapper
}

Get Started

Core Concepts

Building Agents

Evaluation

Production

​Why Tracing Matters

Complete Visibility

Debugging Context

Performance Metrics

Pattern Recognition

​Setting Up Tracing

​1. Wrap the OpenAI Client

​2. Decorate Tools

​3. Trace the Main Chat Function

​Environment Configuration

​Anatomy of a Trace

​Key Components

​Common Debugging Scenarios

​Scenario 1: Wrong Tool Called

​Scenario 2: Tool Returns Error

​Scenario 3: Poor Response Quality

​Scenario 4: Excessive Tool Use

​Scenario 5: Ignoring Tool Results

​Analyzing Patterns Across Traces

​Filters and Search

​Metrics and Analytics

Latency Distribution

Cost Analysis

Error Rate

Tool Usage

​Comparing Agent Versions

​1. Create Separate Projects

​2. Run the Same Test Cases

​3. Compare Metrics

​4. Qualitative Analysis

​Debugging Workflow

​Advanced: Custom Metadata

​Best Practices

Trace Everything

Use Descriptive Names

Add Rich Metadata

Review Regularly

Share Traces

Combine with Evals

​Tracing in Production

​Next Steps

Run the Agent

Build Evaluations

Build docs developers (and LLMs) love

Why Tracing Matters

Setting Up Tracing

1. Wrap the OpenAI Client

2. Decorate Tools

3. Trace the Main Chat Function

Environment Configuration

Anatomy of a Trace

Key Components

Common Debugging Scenarios

Scenario 1: Wrong Tool Called

Scenario 2: Tool Returns Error

Scenario 3: Poor Response Quality

Scenario 4: Excessive Tool Use

Scenario 5: Ignoring Tool Results

Analyzing Patterns Across Traces

Filters and Search

Metrics and Analytics

Comparing Agent Versions

1. Create Separate Projects

2. Run the Same Test Cases

3. Compare Metrics

4. Qualitative Analysis

Debugging Workflow

Advanced: Custom Metadata

Best Practices

Tracing in Production

Next Steps