Skip to main content
ClinicalPilot uses a 7-layer architecture to transform clinical inputs into validated SOAP notes through multi-agent debate. Each layer has a specific purpose, from PHI anonymization to observability.

High-Level Overview

Processing Time: The full pipeline takes ~100 seconds for complex cases due to 14+ LLM calls across 3 debate rounds. Emergency mode bypasses debate for <5s response.

Layer 1: Input Gateway

Purpose: Accept clinical data from multiple sources and scrub PHI

FHIR API

HL7 FHIR R4 JSON bundles parsed to extract Patient, Condition, Observation, MedicationRequest resources

EHR Upload

PDF/CSV documents parsed via PyPDF2 and Unstructured.io

Free Text

Direct text input or future voice (Whisper STT)

PHI Anonymization

All inputs pass through Microsoft Presidio for PHI scrubbing:
# backend/input_layer/anonymizer.py:87-148
entities = [
    "PERSON",
    "PHONE_NUMBER",
    "EMAIL_ADDRESS",
    "US_SSN",
    "CREDIT_CARD",
    "IP_ADDRESS",
    "LOCATION",  # Filtered to avoid clinical term false positives
    # DATE_TIME excluded - causes false positives on med dosages
]

results = self._presidio_analyzer.analyze(
    text=text, entities=entities, language="en"
)

# Post-process: scrub DOB patterns via targeted regex
result = re.sub(
    r"\b(?:DOB|Date of Birth|Birth\s?date)[:\s]*\d{1,2}[/\-]\d{1,2}[/\-]\d{2,4}\b",
    "[DOB REDACTED]",
    result,
    flags=re.I,
)
DATE_TIME entity is intentionally excluded from Presidio to prevent false positives on medication dosages like “20mEq” being misidentified as dates. DOB scrubbing uses targeted regex instead.
Fallback: If Presidio is unavailable, regex-based anonymization catches SSN, phone, email, MRN patterns.

Layer 2: Processing & Parsing

Purpose: Convert diverse inputs into a unified PatientContext schema
1

Parse Input

Route to appropriate parser based on input type:
  • fhir_parser.py - FHIR R4 bundles
  • ehr_parser.py - PDF/CSV uploads
  • text_parser.py - Free text normalization
2

Build PatientContext

All parsers output the same Pydantic model:
PatientContext(
    patient_id=...,
    age=...,
    gender=...,
    conditions=[...],
    medications=[...],
    labs=[...],
    allergies=[...],
    current_prompt=...,
    raw_text=...
)
3

Validate Schema

Pydantic validates all fields for type safety before entering agent layer
Key Design: Single unified schema eliminates parsing logic from agents - they all receive the same structured data.

Layer 3: Agent Layer

Four specialized agents generate outputs in parallel (Round 1) or incorporate critique (Round 2+):
AgentModelPurposeOutputSource
ClinicalGPT-4o / MedGemma-2-9bGenerate differential diagnoses, risk scores, SOAP draftClinicalAgentOutputbackend/agents/clinical.py:18
LiteratureGPT-4o-miniSearch PubMed, Europe PMC, LanceDB RAG for evidenceLiteratureAgentOutputbackend/agents/literature.py:20
SafetyGPT-4oCheck drug interactions, contraindications, dosing alertsSafetyAgentOutputbackend/agents/safety.py:17
CriticGPT-4oReview all outputs, identify contradictions, gaps, errorsCriticOutputbackend/agents/critic.py:22
# backend/debate/debate_engine.py:96-120
async def _round_1(patient: PatientContext):
    # Clinical runs first (Literature needs its output)
    clinical = await run_clinical_agent(patient)
    
    if _using_groq():
        # Sequential for Groq free-tier rate limits
        await asyncio.sleep(2.0)
        literature = await run_literature_agent(patient, clinical_output=clinical)
        await asyncio.sleep(2.0)
        safety = await run_safety_agent(patient, proposed_plan=clinical.soap_draft)
    else:
        # Parallel for OpenAI / Ollama
        literature, safety = await asyncio.gather(
            run_literature_agent(patient, clinical_output=clinical),
            run_safety_agent(patient, proposed_plan=clinical.soap_draft),
        )
    
    return clinical, literature, safety
Optimization: Literature and Safety run in parallel when using OpenAI. Groq uses sequential execution to respect free-tier TPM limits.

External API Integration

Agents query external medical databases in parallel:
# backend/agents/literature.py:98-101
tasks = [search_pubmed(q, max_results=3) for q in queries[:3]]
all_hits = await asyncio.gather(*tasks, return_exceptions=True)

Layer 4: Debate Engine

Purpose: Multi-round iterative refinement with critic feedback
1

Round 1: Independent Generation

All agents generate outputs independently (Clinical → Literature & Safety in parallel)
2

Critic Review

Critic agent reviews all outputs, identifies:
  • EHR contradictions
  • Evidence gaps
  • Safety misses
  • Points of dissent
3

Round 2+: Revision

Agents revise outputs based on critique:
# backend/agents/clinical.py:38-42
if critique:
    user_message += f"""
## Critique from Previous Round (address these issues)
{critique}
"""
4

Consensus Check

After each round, check if consensus_reached == True. If yes, end debate. If not, continue to next round (max 3).
5

Final Output

If no consensus after 3 rounds, flag case for human review but still return best-effort SOAP note
# backend/debate/debate_engine.py:29-84
async def run_debate(patient: PatientContext, max_rounds: int = 3):
    state = DebateState()
    
    for round_num in range(1, max_rounds + 1):
        state.round_number = round_num
        
        if round_num == 1:
            clinical, literature, safety = await _round_1(patient)
        else:
            # Revision based on critique
            critique_text = _format_critique(state.critic_outputs[-1])
            clinical, literature, safety = await _revision_round(
                patient, critique_text, prev_clinical=state.clinical_outputs[-1]
            )
        
        state.clinical_outputs.append(clinical)
        state.literature_outputs.append(literature)
        state.safety_outputs.append(safety)
        
        # Run critic
        critic = await run_critic_agent(patient, clinical, literature, safety)
        state.critic_outputs.append(critic)
        
        if critic.consensus_reached:
            state.final_consensus = True
            break
    
    if not state.final_consensus:
        state.flagged_for_human = True
    
    return state
Fixed Rounds: Debate is deterministic (2-3 rounds max) to avoid infinite loops. This design prioritizes speed and predictability over exhaustive consensus.

Layer 0: Medical Error Prevention Panel

Runs in parallel with the debate pipeline via asyncio.gather:
# backend/main.py (conceptual)
soap, debate_state, med_panel = await asyncio.gather(
    run_debate(patient),
    run_med_error_panel(patient),
)
The panel performs comprehensive safety checks:

Drug-Drug Interactions

Check every medication pair for interactions (contraindicated, major, moderate, minor)

Drug-Disease Contraindications

Cross-reference each drug against patient conditions

Dosing Alerts

Renal/hepatic/weight/age-based adjustments

Population Flags

Pregnancy, pediatric, elderly, lactation concerns
See Safety System for implementation details.

Layer 5: Validation & Output

1

Synthesize SOAP

synthesizer.py merges final round outputs into a complete SOAP note with citations
2

Validate Completeness

validator.py checks:
  • All 4 SOAP sections populated (Subjective, Objective, Assessment, Plan)
  • ≥2 differential diagnoses
  • Safety flags explicitly addressed
  • No hallucinated medications (cross-ref DrugBank)
3

Return Structured Output

SOAPNote(
    subjective="...",
    objective="...",
    assessment="...",
    plan="...",
    differentials=[...],
    citations=[...],
    uncertainty="...",
    model_used="gpt-4o",
    latency_ms=102000
)

Layer 6: Observability

LangSmith / Langfuse tracing captures:
  • Every LLM call (model, tokens, latency)
  • Agent inputs/outputs
  • Debate round progression
  • External API calls (PubMed, FDA, RxNorm)
# .env
LANGSMITH_API_KEY=your_key_here
LANGCHAIN_TRACING_V2=true
LANGCHAIN_PROJECT=clinicalpilot
Tracing is automatic when configured - no code changes needed.

Layer 7: Guardrails

Multiple validation layers prevent unsafe outputs:
All models use strict typing:
class Differential(BaseModel):
    diagnosis: str
    likelihood: str
    reasoning: str
    confidence: ConfidenceLevel  # Enum: high/medium/low
    supporting_evidence: list[str] = Field(default_factory=list)
Invalid data is rejected before it enters the system.
Cross-reference all medications against DrugBank vocabulary before output
Require ≥2 differential diagnoses to avoid anchoring bias
If Medical Error Panel flags contraindicated interactions, they MUST appear in Plan section
Model Under Certainty (MUC) analysis flags low-confidence outputs for human review

Emergency Mode (Fast Path)

Bypass debate for time-critical cases:
POST /api/emergency
Target: <5 seconds from input to output

Key Design Decisions

LanceDB over Pinecone

Serverless, embedded, no infrastructure cost - perfect for hackathon pace

Async Python (asyncio)

All agent calls are async for maximum parallelization

Pydantic Everywhere

Type safety, auto-validation, JSON schema generation for API docs

Fixed Debate Rounds

Deterministic 2-3 rounds prevent infinite loops while allowing refinement

Next Steps

Multi-Agent Debate

Deep dive into debate mechanics, consensus algorithms, and critique formatting

Agent Types

Detailed breakdown of Clinical, Literature, Safety, and Critic agents

Safety System

Medical Error Prevention Panel implementation and guardrails

Build docs developers (and LLMs) love