Observability (LangSmith)

ClinicalPilot integrates with LangSmith for comprehensive observability across all agent interactions, debate rounds, and LLM calls.

Why Observability?

With 14+ LLM calls per analysis (3 debate rounds × 4 agents + synthesis + validation), you need visibility into:

Latency bottlenecks: Which agent is slowest?
Token usage: Where are costs coming from?
Agent outputs: What did each agent actually say?
Debate progression: How did the debate evolve across rounds?
Failures: Where did the pipeline break?

LangSmith Features

Distributed tracing: Full call graph of every agent interaction
Token tracking: Per-agent and per-round token counts
Latency breakdown: Identify slow components
Prompt/response logging: Inspect exact inputs/outputs
Error tracking: Automatic capture of LLM failures
Annotations: Add custom metadata to traces

Setup

Create LangSmith Account

Go to https://smith.langchain.com
Sign up (free tier available)
Create a new project (e.g., clinicalpilot)

Get API Key

Navigate to Settings → API Keys
Click Create API Key
Copy the key (starts with lsv2_...)

Configure Environment

Update your .env file:

LANGSMITH_API_KEY=lsv2_pt_your_key_here
LANGCHAIN_TRACING_V2=true
LANGCHAIN_PROJECT=clinicalpilot

Restart Application

python -m uvicorn backend.main:app --reload

Check logs for confirmation:

INFO: LangSmith tracing enabled for project: clinicalpilot

Run a Test Analysis

Trigger an analysis to generate traces:

curl -X POST http://localhost:8000/api/analyze \
  -H "Content-Type: application/json" \
  -d '{"text": "45-year-old male with acute chest pain radiating to left arm. PMH: HTN, Type 2 DM. Medications: metformin, lisinopril."}'

View Traces

Open https://smith.langchain.com and navigate to your project. You should see a trace for the analysis.

Environment Variables

Variable	Required	Description
`LANGSMITH_API_KEY`	Yes	Your LangSmith API key
`LANGCHAIN_TRACING_V2`	Yes	Set to `true` to enable tracing
`LANGCHAIN_PROJECT`	No	Project name (defaults to `clinicalpilot`)

Trace Structure

A typical full analysis generates this trace hierarchy:

/api/analyze (root span)
│
├── Input Parsing (FHIR/EHR/Text)
│
├── PHI Anonymization (Presidio)
│
├── Debate Round 1
│   ├── Clinical Agent (GPT-4o)
│   ├── Literature Agent (GPT-4o-mini + PubMed)
│   ├── Safety Agent (GPT-4o + DrugBank/RxNorm)
│   └── Critic Agent (GPT-4o)
│
├── Debate Round 2
│   ├── Clinical Agent
│   ├── Literature Agent
│   ├── Safety Agent
│   └── Critic Agent
│
├── Debate Round 3 (if needed)
│   └── ...
│
├── Medical Error Prevention Panel (parallel)
│   ├── Drug-Drug Interactions
│   ├── Contraindication Checks
│   └── Dosing Alerts
│
├── Synthesis (GPT-4o)
│
└── Validation

Viewing Agent Outputs

In LangSmith, click on any agent span to see:

Inputs: Patient context, previous debate rounds
Outputs: Agent’s differential diagnoses, reasoning, citations
Metadata: Model used, temperature, token count, latency
Prompt: Exact system prompt + user message sent to LLM

This is invaluable for debugging hallucinations, understanding agent behavior, and iterating on system prompts.

Performance Analysis

Latency Breakdown

LangSmith automatically calculates:

Total analysis time: End-to-end latency
Per-agent time: Which agent is the bottleneck?
LLM time vs API time: How much is network overhead?

Example trace:

Component	Latency	% of Total
Full Analysis	98.2s	100%
Debate (3 rounds)	85.4s	87%
Clinical Agent (avg)	11.2s	11%
Literature Agent (avg)	8.7s	9%
Safety Agent (avg)	9.1s	9%
Critic Agent (avg)	7.3s	7%
Med Error Panel	12.1s	12% (parallel)
Synthesis	8.9s	9%

Token Usage

Track token consumption per agent:

Total tokens: 47,823
- Clinical Agent: 18,432 (38%)
- Literature Agent: 12,104 (25%)
- Safety Agent: 9,876 (21%)
- Critic Agent: 5,432 (11%)
- Synthesis: 1,979 (4%)

Use this data to optimize costs. For example, you might switch the Literature Agent to GPT-4o-mini (cheaper) or reduce debate rounds from 3 to 2.

Custom Annotations

Add metadata to traces:

from langsmith import traceable

@traceable(
    run_type="llm",
    name="Clinical Agent",
    metadata={"agent_type": "clinical", "round": 1}
)
async def call_clinical_agent(context: PatientContext):
    # Agent logic here
    ...

Debugging Failures

Automatic Error Capture

LangSmith captures exceptions:

Error in Literature Agent (Round 2):
  PubMed API returned 429 (rate limit exceeded)
  Traceback: ...

Click the failed span to see:

Exception type
Stack trace
Input that caused the failure
Retry attempts (if any)

Filtering Traces

Find specific issues:

By status: status:error — all failed traces
By agent: metadata.agent_type:clinical — Clinical Agent only
By latency: latency_ms:>10000 — traces over 10 seconds
By token count: total_tokens:>50000 — expensive calls

Alternative: Langfuse

ClinicalPilot also supports Langfuse (open-source alternative to LangSmith):

# .env
LANGFUSE_PUBLIC_KEY=pk-lf-...
LANGFUSE_SECRET_KEY=sk-lf-...
LANGFUSE_HOST=https://cloud.langfuse.com  # or self-hosted

Modify backend/observability/tracing.py to use Langfuse SDK instead of LangSmith.

Production Considerations

Do not log PHI to external tracing services in production. Ensure patient data is anonymized before LLM calls.

ClinicalPilot anonymizes all inputs via Microsoft Presidio before tracing. Verify:

PHI scrubbing happens before agent calls
Anonymized data is logged, not raw input
Your LangSmith project has appropriate access controls

Self-Hosted Tracing

For HIPAA compliance, consider:

Self-hosted Langfuse: Deploy on your infrastructure
Local-only logging: Disable cloud tracing, use file-based logs
VPC-only LangSmith: Enterprise plan with private deployment

Disabling Tracing

To turn off tracing:

# .env
LANGCHAIN_TRACING_V2=false

Or remove the LANGSMITH_API_KEY variable. Check status programmatically:

from backend.observability.tracing import get_tracing_status

status = get_tracing_status()
print(status)
# {"enabled": false, "project": "clinicalpilot", "provider": "langsmith"}

Trace Retention

LangSmith free tier:

14 days trace retention
1M traces/month

For longer retention, upgrade to paid plan or use self-hosted Langfuse.

Best Practices

Use descriptive span names: "Clinical Agent - Round 2" instead of "agent_call"
Add metadata: Include case ID, patient age range, diagnosis category
Sample high-volume endpoints: Don’t trace every health check — only /api/analyze
Set up alerts: Monitor for high error rates or latency spikes
Review traces regularly: Weekly review of failed/slow traces

Example Queries

Find slow analyses

latency_ms:>120000 AND status:success

Find cases that needed 3 debate rounds

metadata.debate_rounds:3

Find high-token-usage cases

total_tokens:>60000

Find emergency mode activations

name:"Emergency Mode"

Troubleshooting

”LANGSMITH_API_KEY not set” Warning

# Verify .env is loaded
cat .env | grep LANGSMITH

# Restart the app
python -m uvicorn backend.main:app --reload

Traces Not Appearing

Check API key is valid: https://smith.langchain.com/settings
Verify LANGCHAIN_TRACING_V2=true
Check firewall/proxy isn’t blocking LangSmith API
Look for errors in app logs: grep -i langsmith /var/log/clinicalpilot.log

”Rate limit exceeded” Errors

LangSmith free tier has rate limits. Upgrade to paid plan or reduce trace volume.

Advanced

Deployment

Observability (LangSmith)

Why Observability?

LangSmith Features

Setup

Environment Variables

Trace Structure

Viewing Agent Outputs

Performance Analysis

Latency Breakdown

Token Usage

Custom Annotations

Debugging Failures

Automatic Error Capture

Filtering Traces

Alternative: Langfuse

Production Considerations

Self-Hosted Tracing

Disabling Tracing

Trace Retention

Best Practices

Example Queries

Find slow analyses

Find cases that needed 3 debate rounds

Find high-token-usage cases

Find emergency mode activations

Troubleshooting

”LANGSMITH_API_KEY not set” Warning

Traces Not Appearing

”Rate limit exceeded” Errors

Next Steps

Testing

Production Deployment

Build docs developers (and LLMs) love

Advanced

Deployment

​Why Observability?

​LangSmith Features

​Setup

​Environment Variables

​Trace Structure

​Viewing Agent Outputs

​Performance Analysis

​Latency Breakdown

​Token Usage

​Custom Annotations

​Debugging Failures

​Automatic Error Capture

​Filtering Traces

​Alternative: Langfuse

​Production Considerations

​Self-Hosted Tracing

​Disabling Tracing

​Trace Retention

​Best Practices

​Example Queries

​Find slow analyses

​Find cases that needed 3 debate rounds

​Find high-token-usage cases

​Find emergency mode activations

​Troubleshooting

​”LANGSMITH_API_KEY not set” Warning

​Traces Not Appearing

​”Rate limit exceeded” Errors

​Next Steps

Testing

Production Deployment

Build docs developers (and LLMs) love

Why Observability?

LangSmith Features

Setup

Environment Variables

Trace Structure

Viewing Agent Outputs

Performance Analysis

Latency Breakdown

Token Usage

Custom Annotations

Debugging Failures

Automatic Error Capture

Filtering Traces

Alternative: Langfuse

Production Considerations

Self-Hosted Tracing

Disabling Tracing

Trace Retention

Best Practices

Example Queries

Find slow analyses

Find cases that needed 3 debate rounds

Find high-token-usage cases

Find emergency mode activations

Troubleshooting

”LANGSMITH_API_KEY not set” Warning

Traces Not Appearing

”Rate limit exceeded” Errors

Next Steps