System Architecture

ClinicalPilot uses a 7-layer architecture to transform clinical inputs into validated SOAP notes through multi-agent debate. Each layer has a specific purpose, from PHI anonymization to observability.

High-Level Overview

Processing Time: The full pipeline takes ~100 seconds for complex cases due to 14+ LLM calls across 3 debate rounds. Emergency mode bypasses debate for <5s response.

Layer 1: Input Gateway

Purpose: Accept clinical data from multiple sources and scrub PHI

FHIR API

HL7 FHIR R4 JSON bundles parsed to extract Patient, Condition, Observation, MedicationRequest resources

EHR Upload

PDF/CSV documents parsed via PyPDF2 and Unstructured.io

Free Text

Direct text input or future voice (Whisper STT)

PHI Anonymization

All inputs pass through Microsoft Presidio for PHI scrubbing:

# backend/input_layer/anonymizer.py:87-148
entities = [
    "PERSON",
    "PHONE_NUMBER",
    "EMAIL_ADDRESS",
    "US_SSN",
    "CREDIT_CARD",
    "IP_ADDRESS",
    "LOCATION",  # Filtered to avoid clinical term false positives
    # DATE_TIME excluded - causes false positives on med dosages
]

results = self._presidio_analyzer.analyze(
    text=text, entities=entities, language="en"
)

# Post-process: scrub DOB patterns via targeted regex
result = re.sub(
    r"\b(?:DOB|Date of Birth|Birth\s?date)[:\s]*\d{1,2}[/\-]\d{1,2}[/\-]\d{2,4}\b",
    "[DOB REDACTED]",
    result,
    flags=re.I,
)

DATE_TIME entity is intentionally excluded from Presidio to prevent false positives on medication dosages like “20mEq” being misidentified as dates. DOB scrubbing uses targeted regex instead.

Fallback: If Presidio is unavailable, regex-based anonymization catches SSN, phone, email, MRN patterns.

Layer 2: Processing & Parsing

Purpose: Convert diverse inputs into a unified PatientContext schema

Parse Input

Route to appropriate parser based on input type:

fhir_parser.py - FHIR R4 bundles
ehr_parser.py - PDF/CSV uploads
text_parser.py - Free text normalization

Build PatientContext

All parsers output the same Pydantic model:

PatientContext(
    patient_id=...,
    age=...,
    gender=...,
    conditions=[...],
    medications=[...],
    labs=[...],
    allergies=[...],
    current_prompt=...,
    raw_text=...
)

Validate Schema

Pydantic validates all fields for type safety before entering agent layer

Key Design: Single unified schema eliminates parsing logic from agents - they all receive the same structured data.

Layer 3: Agent Layer

Four specialized agents generate outputs in parallel (Round 1) or incorporate critique (Round 2+):

Agent	Model	Purpose	Output	Source
Clinical	GPT-4o / MedGemma-2-9b	Generate differential diagnoses, risk scores, SOAP draft	`ClinicalAgentOutput`	`backend/agents/clinical.py:18`
Literature	GPT-4o-mini	Search PubMed, Europe PMC, LanceDB RAG for evidence	`LiteratureAgentOutput`	`backend/agents/literature.py:20`
Safety	GPT-4o	Check drug interactions, contraindications, dosing alerts	`SafetyAgentOutput`	`backend/agents/safety.py:17`
Critic	GPT-4o	Review all outputs, identify contradictions, gaps, errors	`CriticOutput`	`backend/agents/critic.py:22`

Agent Execution Flow (Round 1)

# backend/debate/debate_engine.py:96-120
async def _round_1(patient: PatientContext):
    # Clinical runs first (Literature needs its output)
    clinical = await run_clinical_agent(patient)
    
    if _using_groq():
        # Sequential for Groq free-tier rate limits
        await asyncio.sleep(2.0)
        literature = await run_literature_agent(patient, clinical_output=clinical)
        await asyncio.sleep(2.0)
        safety = await run_safety_agent(patient, proposed_plan=clinical.soap_draft)
    else:
        # Parallel for OpenAI / Ollama
        literature, safety = await asyncio.gather(
            run_literature_agent(patient, clinical_output=clinical),
            run_safety_agent(patient, proposed_plan=clinical.soap_draft),
        )
    
    return clinical, literature, safety

Optimization: Literature and Safety run in parallel when using OpenAI. Groq uses sequential execution to respect free-tier TPM limits.

External API Integration

Agents query external medical databases in parallel:

# backend/agents/literature.py:98-101
tasks = [search_pubmed(q, max_results=3) for q in queries[:3]]
all_hits = await asyncio.gather(*tasks, return_exceptions=True)

Layer 4: Debate Engine

Purpose: Multi-round iterative refinement with critic feedback

Round 1: Independent Generation

All agents generate outputs independently (Clinical → Literature & Safety in parallel)

Critic Review

Critic agent reviews all outputs, identifies:

EHR contradictions
Evidence gaps
Safety misses
Points of dissent

Round 2+: Revision

Agents revise outputs based on critique:

# backend/agents/clinical.py:38-42
if critique:
    user_message += f"""
## Critique from Previous Round (address these issues)
{critique}
"""

Consensus Check

After each round, check if consensus_reached == True. If yes, end debate. If not, continue to next round (max 3).

Final Output

If no consensus after 3 rounds, flag case for human review but still return best-effort SOAP note

# backend/debate/debate_engine.py:29-84
async def run_debate(patient: PatientContext, max_rounds: int = 3):
    state = DebateState()
    
    for round_num in range(1, max_rounds + 1):
        state.round_number = round_num
        
        if round_num == 1:
            clinical, literature, safety = await _round_1(patient)
        else:
            # Revision based on critique
            critique_text = _format_critique(state.critic_outputs[-1])
            clinical, literature, safety = await _revision_round(
                patient, critique_text, prev_clinical=state.clinical_outputs[-1]
            )
        
        state.clinical_outputs.append(clinical)
        state.literature_outputs.append(literature)
        state.safety_outputs.append(safety)
        
        # Run critic
        critic = await run_critic_agent(patient, clinical, literature, safety)
        state.critic_outputs.append(critic)
        
        if critic.consensus_reached:
            state.final_consensus = True
            break
    
    if not state.final_consensus:
        state.flagged_for_human = True
    
    return state

Fixed Rounds: Debate is deterministic (2-3 rounds max) to avoid infinite loops. This design prioritizes speed and predictability over exhaustive consensus.

Layer 0: Medical Error Prevention Panel

Runs in parallel with the debate pipeline via asyncio.gather:

# backend/main.py (conceptual)
soap, debate_state, med_panel = await asyncio.gather(
    run_debate(patient),
    run_med_error_panel(patient),
)

The panel performs comprehensive safety checks:

Drug-Drug Interactions

Check every medication pair for interactions (contraindicated, major, moderate, minor)

Drug-Disease Contraindications

Cross-reference each drug against patient conditions

Dosing Alerts

Renal/hepatic/weight/age-based adjustments

Population Flags

Pregnancy, pediatric, elderly, lactation concerns

See Safety System for implementation details.

Layer 5: Validation & Output

Synthesize SOAP

synthesizer.py merges final round outputs into a complete SOAP note with citations

Validate Completeness

validator.py checks:

All 4 SOAP sections populated (Subjective, Objective, Assessment, Plan)
≥2 differential diagnoses
Safety flags explicitly addressed
No hallucinated medications (cross-ref DrugBank)

Return Structured Output

SOAPNote(
    subjective="...",
    objective="...",
    assessment="...",
    plan="...",
    differentials=[...],
    citations=[...],
    uncertainty="...",
    model_used="gpt-4o",
    latency_ms=102000
)

Layer 6: Observability

LangSmith / Langfuse tracing captures:

Every LLM call (model, tokens, latency)
Agent inputs/outputs
Debate round progression
External API calls (PubMed, FDA, RxNorm)

# .env
LANGSMITH_API_KEY=your_key_here
LANGCHAIN_TRACING_V2=true
LANGCHAIN_PROJECT=clinicalpilot

Tracing is automatic when configured - no code changes needed.

Layer 7: Guardrails

Multiple validation layers prevent unsafe outputs:

Pydantic Schema Validation

All models use strict typing:

class Differential(BaseModel):
    diagnosis: str
    likelihood: str
    reasoning: str
    confidence: ConfidenceLevel  # Enum: high/medium/low
    supporting_evidence: list[str] = Field(default_factory=list)

Invalid data is rejected before it enters the system.

No Hallucinated Medications

Cross-reference all medications against DrugBank vocabulary before output

Minimum Differential Count

Require ≥2 differential diagnoses to avoid anchoring bias

Safety Flag Enforcement

If Medical Error Panel flags contraindicated interactions, they MUST appear in Plan section

MUC Confidence Thresholds

Model Under Certainty (MUC) analysis flags low-confidence outputs for human review

Emergency Mode (Fast Path)

Bypass debate for time-critical cases:

POST /api/emergency

Target: <5 seconds from input to output

Key Design Decisions

LanceDB over Pinecone

Serverless, embedded, no infrastructure cost - perfect for hackathon pace

Async Python (asyncio)

All agent calls are async for maximum parallelization

Pydantic Everywhere

Type safety, auto-validation, JSON schema generation for API docs

Fixed Debate Rounds

Deterministic 2-3 rounds prevent infinite loops while allowing refinement

Next Steps

Multi-Agent Debate

Deep dive into debate mechanics, consensus algorithms, and critique formatting

Agent Types

Detailed breakdown of Clinical, Literature, Safety, and Critic agents

Safety System

Medical Error Prevention Panel implementation and guardrails

Get Started

Core Concepts

Guides

High-Level Overview

Layer 1: Input Gateway

FHIR API

EHR Upload

Free Text

PHI Anonymization

Layer 2: Processing & Parsing

Layer 3: Agent Layer

External API Integration

Layer 4: Debate Engine

Layer 0: Medical Error Prevention Panel

Drug-Drug Interactions

Drug-Disease Contraindications

Dosing Alerts

Population Flags

Layer 5: Validation & Output

Layer 6: Observability

Layer 7: Guardrails

Emergency Mode (Fast Path)

Key Design Decisions

LanceDB over Pinecone

Async Python (asyncio)

Pydantic Everywhere

Fixed Debate Rounds

Next Steps

Multi-Agent Debate

Agent Types

Safety System

Build docs developers (and LLMs) love

Get Started

Core Concepts

Guides

​High-Level Overview

​Layer 1: Input Gateway

FHIR API

EHR Upload

Free Text

​PHI Anonymization

​Layer 2: Processing & Parsing

​Layer 3: Agent Layer

​External API Integration

​Layer 4: Debate Engine

​Layer 0: Medical Error Prevention Panel

Drug-Drug Interactions

Drug-Disease Contraindications

Dosing Alerts

Population Flags

​Layer 5: Validation & Output

​Layer 6: Observability

​Layer 7: Guardrails

​Emergency Mode (Fast Path)

​Key Design Decisions

LanceDB over Pinecone

Async Python (asyncio)

Pydantic Everywhere

Fixed Debate Rounds

​Next Steps

Multi-Agent Debate

Agent Types

Safety System

Build docs developers (and LLMs) love

High-Level Overview

Layer 1: Input Gateway

PHI Anonymization

Layer 2: Processing & Parsing

Layer 3: Agent Layer

External API Integration

Layer 4: Debate Engine

Layer 0: Medical Error Prevention Panel

Layer 5: Validation & Output

Layer 6: Observability

Layer 7: Guardrails

Emergency Mode (Fast Path)

Key Design Decisions

Next Steps