Skip to main content
ClinicalPilot integrates with LangSmith for comprehensive observability across all agent interactions, debate rounds, and LLM calls.

Why Observability?

With 14+ LLM calls per analysis (3 debate rounds × 4 agents + synthesis + validation), you need visibility into:
  • Latency bottlenecks: Which agent is slowest?
  • Token usage: Where are costs coming from?
  • Agent outputs: What did each agent actually say?
  • Debate progression: How did the debate evolve across rounds?
  • Failures: Where did the pipeline break?

LangSmith Features

  • Distributed tracing: Full call graph of every agent interaction
  • Token tracking: Per-agent and per-round token counts
  • Latency breakdown: Identify slow components
  • Prompt/response logging: Inspect exact inputs/outputs
  • Error tracking: Automatic capture of LLM failures
  • Annotations: Add custom metadata to traces

Setup

1

Create LangSmith Account

  1. Go to https://smith.langchain.com
  2. Sign up (free tier available)
  3. Create a new project (e.g., clinicalpilot)
2

Get API Key

  1. Navigate to SettingsAPI Keys
  2. Click Create API Key
  3. Copy the key (starts with lsv2_...)
3

Configure Environment

Update your .env file:
LANGSMITH_API_KEY=lsv2_pt_your_key_here
LANGCHAIN_TRACING_V2=true
LANGCHAIN_PROJECT=clinicalpilot
4

Restart Application

python -m uvicorn backend.main:app --reload
Check logs for confirmation:
INFO: LangSmith tracing enabled for project: clinicalpilot
5

Run a Test Analysis

Trigger an analysis to generate traces:
curl -X POST http://localhost:8000/api/analyze \
  -H "Content-Type: application/json" \
  -d '{"text": "45-year-old male with acute chest pain radiating to left arm. PMH: HTN, Type 2 DM. Medications: metformin, lisinopril."}'
6

View Traces

Open https://smith.langchain.com and navigate to your project. You should see a trace for the analysis.

Environment Variables

VariableRequiredDescription
LANGSMITH_API_KEYYesYour LangSmith API key
LANGCHAIN_TRACING_V2YesSet to true to enable tracing
LANGCHAIN_PROJECTNoProject name (defaults to clinicalpilot)

Trace Structure

A typical full analysis generates this trace hierarchy:
/api/analyze (root span)

├── Input Parsing (FHIR/EHR/Text)

├── PHI Anonymization (Presidio)

├── Debate Round 1
│   ├── Clinical Agent (GPT-4o)
│   ├── Literature Agent (GPT-4o-mini + PubMed)
│   ├── Safety Agent (GPT-4o + DrugBank/RxNorm)
│   └── Critic Agent (GPT-4o)

├── Debate Round 2
│   ├── Clinical Agent
│   ├── Literature Agent
│   ├── Safety Agent
│   └── Critic Agent

├── Debate Round 3 (if needed)
│   └── ...

├── Medical Error Prevention Panel (parallel)
│   ├── Drug-Drug Interactions
│   ├── Contraindication Checks
│   └── Dosing Alerts

├── Synthesis (GPT-4o)

└── Validation

Viewing Agent Outputs

In LangSmith, click on any agent span to see:
  • Inputs: Patient context, previous debate rounds
  • Outputs: Agent’s differential diagnoses, reasoning, citations
  • Metadata: Model used, temperature, token count, latency
  • Prompt: Exact system prompt + user message sent to LLM
This is invaluable for debugging hallucinations, understanding agent behavior, and iterating on system prompts.

Performance Analysis

Latency Breakdown

LangSmith automatically calculates:
  • Total analysis time: End-to-end latency
  • Per-agent time: Which agent is the bottleneck?
  • LLM time vs API time: How much is network overhead?
Example trace:
ComponentLatency% of Total
Full Analysis98.2s100%
Debate (3 rounds)85.4s87%
Clinical Agent (avg)11.2s11%
Literature Agent (avg)8.7s9%
Safety Agent (avg)9.1s9%
Critic Agent (avg)7.3s7%
Med Error Panel12.1s12% (parallel)
Synthesis8.9s9%

Token Usage

Track token consumption per agent:
Total tokens: 47,823
- Clinical Agent: 18,432 (38%)
- Literature Agent: 12,104 (25%)
- Safety Agent: 9,876 (21%)
- Critic Agent: 5,432 (11%)
- Synthesis: 1,979 (4%)
Use this data to optimize costs. For example, you might switch the Literature Agent to GPT-4o-mini (cheaper) or reduce debate rounds from 3 to 2.

Custom Annotations

Add metadata to traces:
from langsmith import traceable

@traceable(
    run_type="llm",
    name="Clinical Agent",
    metadata={"agent_type": "clinical", "round": 1}
)
async def call_clinical_agent(context: PatientContext):
    # Agent logic here
    ...

Debugging Failures

Automatic Error Capture

LangSmith captures exceptions:
Error in Literature Agent (Round 2):
  PubMed API returned 429 (rate limit exceeded)
  Traceback: ...
Click the failed span to see:
  • Exception type
  • Stack trace
  • Input that caused the failure
  • Retry attempts (if any)

Filtering Traces

Find specific issues:
  • By status: status:error — all failed traces
  • By agent: metadata.agent_type:clinical — Clinical Agent only
  • By latency: latency_ms:>10000 — traces over 10 seconds
  • By token count: total_tokens:>50000 — expensive calls

Alternative: Langfuse

ClinicalPilot also supports Langfuse (open-source alternative to LangSmith):
# .env
LANGFUSE_PUBLIC_KEY=pk-lf-...
LANGFUSE_SECRET_KEY=sk-lf-...
LANGFUSE_HOST=https://cloud.langfuse.com  # or self-hosted
Modify backend/observability/tracing.py to use Langfuse SDK instead of LangSmith.

Production Considerations

Do not log PHI to external tracing services in production. Ensure patient data is anonymized before LLM calls.
ClinicalPilot anonymizes all inputs via Microsoft Presidio before tracing. Verify:
  1. PHI scrubbing happens before agent calls
  2. Anonymized data is logged, not raw input
  3. Your LangSmith project has appropriate access controls

Self-Hosted Tracing

For HIPAA compliance, consider:
  • Self-hosted Langfuse: Deploy on your infrastructure
  • Local-only logging: Disable cloud tracing, use file-based logs
  • VPC-only LangSmith: Enterprise plan with private deployment

Disabling Tracing

To turn off tracing:
# .env
LANGCHAIN_TRACING_V2=false
Or remove the LANGSMITH_API_KEY variable. Check status programmatically:
from backend.observability.tracing import get_tracing_status

status = get_tracing_status()
print(status)
# {"enabled": false, "project": "clinicalpilot", "provider": "langsmith"}

Trace Retention

LangSmith free tier:
  • 14 days trace retention
  • 1M traces/month
For longer retention, upgrade to paid plan or use self-hosted Langfuse.

Best Practices

  1. Use descriptive span names: "Clinical Agent - Round 2" instead of "agent_call"
  2. Add metadata: Include case ID, patient age range, diagnosis category
  3. Sample high-volume endpoints: Don’t trace every health check — only /api/analyze
  4. Set up alerts: Monitor for high error rates or latency spikes
  5. Review traces regularly: Weekly review of failed/slow traces

Example Queries

Find slow analyses

latency_ms:>120000 AND status:success

Find cases that needed 3 debate rounds

metadata.debate_rounds:3

Find high-token-usage cases

total_tokens:>60000

Find emergency mode activations

name:"Emergency Mode"

Troubleshooting

”LANGSMITH_API_KEY not set” Warning

# Verify .env is loaded
cat .env | grep LANGSMITH

# Restart the app
python -m uvicorn backend.main:app --reload

Traces Not Appearing

  1. Check API key is valid: https://smith.langchain.com/settings
  2. Verify LANGCHAIN_TRACING_V2=true
  3. Check firewall/proxy isn’t blocking LangSmith API
  4. Look for errors in app logs: grep -i langsmith /var/log/clinicalpilot.log

”Rate limit exceeded” Errors

LangSmith free tier has rate limits. Upgrade to paid plan or reduce trace volume.

Next Steps

Testing

Write smoke tests to validate agent behavior

Production Deployment

Deploy ClinicalPilot to production with observability

Build docs developers (and LLMs) love