Why Tracing Matters
AI agents are complex systems that make multiple LLM calls, use various tools, and maintain conversation state. When something goes wrong—or right—you need to understand exactly what happened. Tracing gives you:Complete Visibility
See every step: LLM calls, tool invocations, inputs, outputs, and timing
Debugging Context
Understand why an agent made a particular decision or used a specific tool
Performance Metrics
Measure latency, token usage, and cost per interaction
Pattern Recognition
Identify common failure modes or inefficient behavior patterns
Setting Up Tracing
The OfficeFlow agent uses LangSmith for tracing. The setup involves three steps:1. Wrap the OpenAI Client
- Python
- TypeScript
client.chat.completions.create() and client.embeddings.create().2. Decorate Tools
- Python
- TypeScript
3. Trace the Main Chat Function
- Python
- TypeScript
Environment Configuration
Set these environment variables to enable LangSmith:Anatomy of a Trace
When you run the agent and ask a question, LangSmith creates a hierarchical trace:Key Components
Parent Trace (Emma)
Parent Trace (Emma)
- Represents the entire interaction
- Contains metadata like
thread_idfor grouping conversations - Shows total latency and cost for the full response
LLM Calls (ChatOpenAI)
LLM Calls (ChatOpenAI)
- Includes full prompt (system + history + user message)
- Shows model parameters (temperature, model name, etc.)
- Displays token counts (prompt tokens, completion tokens)
- Records latency and cost
- Contains the raw response including tool calls
Tool Calls
Tool Calls
- Shows which tool was called and why
- Displays input arguments (e.g., the SQL query)
- Records the output returned to the LLM
- Captures errors if the tool failed
- Measures tool execution time
Common Debugging Scenarios
Scenario 1: Wrong Tool Called
Problem: Agent usessearch_knowledge_base when it should use query_database.
How to diagnose:
- Open the trace in LangSmith
- Look at the first LLM call’s output
- Check the
tool_callsarray to see which tool was selected - Examine the LLM’s reasoning by looking at any
contentbefore tool calls - Review your tool descriptions - are they clear and distinct?
- Tool descriptions are too similar
- System prompt doesn’t clearly delineate when to use each tool
- User query is ambiguous
Scenario 2: Tool Returns Error
Problem: Tool call fails with an error like “no such column: product_name”. How to diagnose:Check the Output
See the error message. Does it indicate a schema mismatch, permission issue, or syntax error?
- Agent didn’t discover schema (missing from tool description)
- Agent hallucinated column names
- Database connection issues
Scenario 3: Poor Response Quality
Problem: Agent’s final response is too verbose, inaccurate, or unhelpful. How to diagnose:- Open the final LLM call in the trace
- Examine the full prompt passed to the LLM:
- System prompt
- Conversation history
- Tool results
- User question
- Check if tool results contained the right information
- Look for context window issues (truncation, too much irrelevant data)
- Review the completion to see if the LLM properly synthesized the tool results
- System prompt doesn’t include relevant guidelines
- Tool returned too much or too little information
- No examples of good responses in the prompt
- Agent is using stale conversation history
- Add specific instructions to system prompt (e.g., conciseness directive)
- Improve tool output formatting
- Add few-shot examples
Scenario 4: Excessive Tool Use
Problem: Agent makes 5+ tool calls for a simple question. How to diagnose:- Count tool call nodes in the trace
- Check if tool calls are redundant (same query multiple times)
- Look for schema discovery happening multiple times
- See if agent is exploring different tables unnecessarily
- Agent doesn’t remember it already discovered schema
- Tool description encourages exploration
- Agent is uncertain and tries multiple approaches
- Cache schema information in conversation history
- Provide schema upfront in system prompt for small databases
- Add instruction to minimize tool calls
Scenario 5: Ignoring Tool Results
Problem: Agent calls a tool but doesn’t use the results in its response. How to diagnose:- Find the tool call in the trace
- Note what data was returned
- Look at the final LLM call
- Check if the tool result is in the prompt but not referenced in the completion
- See if there’s a prompt engineering issue causing the LLM to ignore tool results
- Tool result format is hard to parse (e.g., deeply nested JSON)
- Tool returned error but system prompt doesn’t handle errors well
- System prompt doesn’t emphasize using tool results
- Format tool results more clearly (e.g., use markdown tables)
- Add explicit instruction: “Base your answer on the tool results”
- Handle tool errors gracefully in the tool function
Analyzing Patterns Across Traces
Filters and Search
LangSmith allows you to:- Filter by Metadata
- Filter by Status
- Search by Input/Output
Metrics and Analytics
LangSmith provides aggregate metrics:Latency Distribution
- P50, P95, P99 latency
- Identify slow traces
- Compare versions
Cost Analysis
- Total tokens used
- Cost per trace
- Cost breakdown by model
Error Rate
- Percentage of failed traces
- Common error types
- Trends over time
Tool Usage
- How often each tool is called
- Average calls per trace
- Tool success rate
Comparing Agent Versions
When you improve your agent (e.g., v4 → v5), use traces to measure impact:1. Create Separate Projects
2. Run the Same Test Cases
Create a test set and run it against both versions:- Python
- TypeScript
3. Compare Metrics
| Metric | v4 | v5 | Change |
|---|---|---|---|
| Avg latency | 2.3s | 1.9s | -17% |
| Avg tokens (completion) | 156 | 98 | -37% |
| Avg cost per trace | $0.0023 | $0.0015 | -35% |
| Tool calls per trace | 2.1 | 2.1 | 0% |
| Error rate | 2.3% | 2.1% | -0.2pp |
4. Qualitative Analysis
Beyond metrics, manually review traces:Checklist for Trace Review
Checklist for Trace Review
- Do responses sound natural and helpful?
- Is tool usage logical and efficient?
- Are errors handled gracefully?
- Does the agent follow all instructions in the system prompt?
- Are there any unexpected behaviors?
- Does the agent maintain context across turns?
- Are there any prompt injection vulnerabilities?
Debugging Workflow
When investigating an issue:Identify the Problem Step
- If wrong tool: Look at first LLM call’s tool selection
- If tool error: Find the failing tool node
- If bad output: Examine final LLM call
Inspect Inputs/Outputs
Click on the problematic node and review:
- Input: What data did this step receive?
- Output: What did it produce?
- Metadata: Timing, model parameters, etc.
Form a Hypothesis
Based on the trace, what caused the issue?
- Unclear instructions?
- Missing context?
- Tool implementation bug?
- Model limitation?
Implement a Fix
- Update system prompt
- Improve tool description
- Fix tool implementation
- Add error handling
Advanced: Custom Metadata
Add custom metadata to traces for richer analysis:- Python
- TypeScript
metadata.version = "v5"metadata.environment = "production"metadata.customer_id = "CUST-12345"
Best Practices
Trace Everything
Wrap all LLM calls, tools, and major functions. More visibility is better.
Use Descriptive Names
Name traces and tools clearly so you can understand traces at a glance.
Add Rich Metadata
Include context like user IDs, versions, environments for better filtering.
Review Regularly
Don’t just check traces when things break. Review periodically to find improvements.
Share Traces
LangSmith traces are shareable URLs. Use them in bug reports and code reviews.
Combine with Evals
Traces show what happened. Evaluations measure if it was good.
Tracing in Production
Example sampling implementation:- Python
- TypeScript
Next Steps
Run the Agent
Get hands-on experience with the OfficeFlow agent and generate your own traces
Build Evaluations
Use traces to create datasets and build automated evaluations