Why Observability?
With 14+ LLM calls per analysis (3 debate rounds × 4 agents + synthesis + validation), you need visibility into:- Latency bottlenecks: Which agent is slowest?
- Token usage: Where are costs coming from?
- Agent outputs: What did each agent actually say?
- Debate progression: How did the debate evolve across rounds?
- Failures: Where did the pipeline break?
LangSmith Features
- Distributed tracing: Full call graph of every agent interaction
- Token tracking: Per-agent and per-round token counts
- Latency breakdown: Identify slow components
- Prompt/response logging: Inspect exact inputs/outputs
- Error tracking: Automatic capture of LLM failures
- Annotations: Add custom metadata to traces
Setup
Create LangSmith Account
- Go to https://smith.langchain.com
- Sign up (free tier available)
- Create a new project (e.g.,
clinicalpilot)
Get API Key
- Navigate to Settings → API Keys
- Click Create API Key
- Copy the key (starts with
lsv2_...)
View Traces
Open https://smith.langchain.com and navigate to your project. You should see a trace for the analysis.
Environment Variables
| Variable | Required | Description |
|---|---|---|
LANGSMITH_API_KEY | Yes | Your LangSmith API key |
LANGCHAIN_TRACING_V2 | Yes | Set to true to enable tracing |
LANGCHAIN_PROJECT | No | Project name (defaults to clinicalpilot) |
Trace Structure
A typical full analysis generates this trace hierarchy:Viewing Agent Outputs
In LangSmith, click on any agent span to see:- Inputs: Patient context, previous debate rounds
- Outputs: Agent’s differential diagnoses, reasoning, citations
- Metadata: Model used, temperature, token count, latency
- Prompt: Exact system prompt + user message sent to LLM
This is invaluable for debugging hallucinations, understanding agent behavior, and iterating on system prompts.
Performance Analysis
Latency Breakdown
LangSmith automatically calculates:- Total analysis time: End-to-end latency
- Per-agent time: Which agent is the bottleneck?
- LLM time vs API time: How much is network overhead?
| Component | Latency | % of Total |
|---|---|---|
| Full Analysis | 98.2s | 100% |
| Debate (3 rounds) | 85.4s | 87% |
| Clinical Agent (avg) | 11.2s | 11% |
| Literature Agent (avg) | 8.7s | 9% |
| Safety Agent (avg) | 9.1s | 9% |
| Critic Agent (avg) | 7.3s | 7% |
| Med Error Panel | 12.1s | 12% (parallel) |
| Synthesis | 8.9s | 9% |
Token Usage
Track token consumption per agent:Use this data to optimize costs. For example, you might switch the Literature Agent to GPT-4o-mini (cheaper) or reduce debate rounds from 3 to 2.
Custom Annotations
Add metadata to traces:Debugging Failures
Automatic Error Capture
LangSmith captures exceptions:- Exception type
- Stack trace
- Input that caused the failure
- Retry attempts (if any)
Filtering Traces
Find specific issues:- By status:
status:error— all failed traces - By agent:
metadata.agent_type:clinical— Clinical Agent only - By latency:
latency_ms:>10000— traces over 10 seconds - By token count:
total_tokens:>50000— expensive calls
Alternative: Langfuse
ClinicalPilot also supports Langfuse (open-source alternative to LangSmith):backend/observability/tracing.py to use Langfuse SDK instead of LangSmith.
Production Considerations
ClinicalPilot anonymizes all inputs via Microsoft Presidio before tracing. Verify:- PHI scrubbing happens before agent calls
- Anonymized data is logged, not raw input
- Your LangSmith project has appropriate access controls
Self-Hosted Tracing
For HIPAA compliance, consider:- Self-hosted Langfuse: Deploy on your infrastructure
- Local-only logging: Disable cloud tracing, use file-based logs
- VPC-only LangSmith: Enterprise plan with private deployment
Disabling Tracing
To turn off tracing:LANGSMITH_API_KEY variable.
Check status programmatically:
Trace Retention
LangSmith free tier:- 14 days trace retention
- 1M traces/month
Best Practices
- Use descriptive span names:
"Clinical Agent - Round 2"instead of"agent_call" - Add metadata: Include case ID, patient age range, diagnosis category
- Sample high-volume endpoints: Don’t trace every health check — only
/api/analyze - Set up alerts: Monitor for high error rates or latency spikes
- Review traces regularly: Weekly review of failed/slow traces
Example Queries
Find slow analyses
Find cases that needed 3 debate rounds
Find high-token-usage cases
Find emergency mode activations
Troubleshooting
”LANGSMITH_API_KEY not set” Warning
Traces Not Appearing
- Check API key is valid: https://smith.langchain.com/settings
- Verify
LANGCHAIN_TRACING_V2=true - Check firewall/proxy isn’t blocking LangSmith API
- Look for errors in app logs:
grep -i langsmith /var/log/clinicalpilot.log
”Rate limit exceeded” Errors
LangSmith free tier has rate limits. Upgrade to paid plan or reduce trace volume.Next Steps
Testing
Write smoke tests to validate agent behavior
Production Deployment
Deploy ClinicalPilot to production with observability