Skip to main content
ClinicalPilot uses multi-round debate to refine clinical reasoning through iterative critique and revision. This approach reduces single-model bias, catches errors, and produces higher-quality SOAP notes.

Debate Philosophy

Core Insight: A single LLM can miss critical details, anchor on initial diagnoses, or overlook safety issues. By having specialized agents debate and critique each other, we expose blind spots and force deeper reasoning.

Why Debate Works

Specialization

Each agent focuses on its domain (clinical reasoning, literature, safety) without cognitive overload

Adversarial Review

Critic agent actively looks for flaws, contradictions, and missing evidence

Iterative Refinement

Agents revise outputs based on critique, incorporating new evidence and addressing gaps

Debate Architecture


Round-by-Round Breakdown

Round 1: Independent Generation

Goal: Generate initial outputs without bias from other agents
1

Clinical Agent Runs First

Generates differential diagnoses, risk scores, and SOAP draft based on PatientContext:
# backend/agents/clinical.py:18-50
async def run_clinical_agent(
    patient: PatientContext,
    critique: str = "",
) -> ClinicalAgentOutput:
    system_prompt = load_prompt("clinical_system.txt")
    
    user_message = f"""## Patient Context
{patient.to_clinical_summary()}

## Raw Clinical Text
{patient.raw_text or patient.current_prompt}
"""
    
    result = await llm_call(
        system_prompt=system_prompt,
        user_message=user_message,
        json_mode=True,
    )
    
    return _parse_output(result["content"])
2

Literature & Safety Run in Parallel

Literature agent receives Clinical’s differentials to search for evidence:
# backend/debate/debate_engine.py:114-118
literature, safety = await asyncio.gather(
    run_literature_agent(patient, clinical_output=clinical),
    run_safety_agent(patient, proposed_plan=clinical.soap_draft),
)
Optimization: OpenAI/Ollama run these in parallel. Groq uses sequential with 2s delays for rate limits.
3

Critic Reviews All Outputs

Receives Clinical, Literature, and Safety outputs, then identifies:
  • EHR contradictions: Does the DDx conflict with patient history?
  • Evidence gaps: Are claims unsupported by literature?
  • Safety misses: Did Safety agent catch all drug interactions?
  • Dissent log: What points of disagreement exist between agents?
4

Consensus Check

# backend/models/agents.py:55-62
class CriticOutput(BaseModel):
    ehr_contradictions: list[str] = Field(default_factory=list)
    evidence_gaps: list[str] = Field(default_factory=list)
    safety_misses: list[str] = Field(default_factory=list)
    overall_assessment: str = ""
    consensus_reached: bool = False  # ← Decision point
    dissent_log: list[str] = Field(default_factory=list)
If consensus_reached == True, debate ends. Otherwise, proceed to Round 2.

Round 2+: Revision Based on Critique

Goal: Address gaps and contradictions identified by Critic
# backend/debate/debate_engine.py:160-179
def _format_critique(critic: CriticOutput) -> str:
    parts = []
    
    if critic.ehr_contradictions:
        parts.append(
            "EHR Contradictions:\n" + 
            "\n".join(f"- {c}" for c in critic.ehr_contradictions)
        )
    
    if critic.evidence_gaps:
        parts.append(
            "Evidence Gaps:\n" + 
            "\n".join(f"- {g}" for g in critic.evidence_gaps)
        )
    
    if critic.safety_misses:
        parts.append(
            "Safety Misses:\n" + 
            "\n".join(f"- {m}" for m in critic.safety_misses)
        )
    
    if critic.overall_assessment:
        parts.append(f"Overall Assessment: {critic.overall_assessment}")
    
    if critic.dissent_log:
        parts.append(
            "Points of Dissent:\n" + 
            "\n".join(f"- {d}" for d in critic.dissent_log)
        )
    
    return "\n\n".join(parts)
1

Clinical Agent Receives Critique

# backend/agents/clinical.py:38-42
if critique:
    user_message += f"""
## Critique from Previous Round (address these issues)
{critique}
"""
Example critique:
EHR Contradictions:
- DDx includes CKD but patient labs show normal creatinine (0.9 mg/dL)

Evidence Gaps:
- Claim that "orthostatic hypotension common in elderly" lacks citation

Safety Misses:
- Did not flag Lisinopril + Potassium hyperkalemia risk

Overall Assessment: Strong differential but needs better integration of labs. Safety review incomplete.
2

Agents Revise Outputs

Clinical revises DDx, Literature searches new queries, Safety re-checks interactions:
# backend/debate/debate_engine.py:123-157
async def _revision_round(
    patient: PatientContext,
    critique_text: str,
    prev_clinical: ClinicalAgentOutput | None = None,
):
    # Clinical revises first
    clinical = await run_clinical_agent(patient, critique=critique_text)
    
    # Literature & Safety revise in parallel
    literature, safety = await asyncio.gather(
        run_literature_agent(
            patient, clinical_output=clinical, critique=critique_text
        ),
        run_safety_agent(
            patient, proposed_plan=clinical.soap_draft, critique=critique_text
        ),
    )
    
    return clinical, literature, safety
3

Critic Reviews Again

Checks if issues were addressed. If yes, sets consensus_reached = True. If not, logs remaining dissent.
4

Max Rounds Check

# backend/debate/debate_engine.py:40-83
for round_num in range(1, max_rounds + 1):
    # ... run round ...
    
    if critic.consensus_reached:
        logger.info(f"Consensus reached after round {round_num}")
        state.final_consensus = True
        break

# After max rounds
if not state.final_consensus:
    state.flagged_for_human = True
    logger.warning("No consensus after max rounds — flagging for human review")
Default: 3 rounds max (configurable via MAX_DEBATE_ROUNDS env var)

Consensus Algorithm

Consensus is qualitative, not quantitative. The Critic agent uses LLM reasoning to determine if outputs are coherent, evidence-backed, and safe.

Consensus Criteria (from Critic System Prompt)

You should set consensus_reached = true ONLY if:
1. All differential diagnoses are supported by patient data (no EHR contradictions)
2. All clinical claims have literature backing (no major evidence gaps)
3. All safety issues have been identified and flagged (no missed interactions)
4. No significant points of dissent remain between agents

If any of the above fail, set consensus_reached = false and provide specific feedback.

Example Consensus Decision

{
  "ehr_contradictions": [
    "DDx includes 'Diabetic Nephropathy' but patient has normal HbA1c (5.2%)"
  ],
  "evidence_gaps": [
    "No citation for claim that 'ARBs superior to ACEIs in CKD'"
  ],
  "safety_misses": [
    "Lisinopril + Potassium interaction not flagged"
  ],
  "overall_assessment": "Strong clinical reasoning but needs better lab integration and complete safety review.",
  "consensus_reached": false,
  "dissent_log": [
    "Clinical agent suggests CKD but labs contradict",
    "Safety agent did not check all medication pairs"
  ]
}

Debate State Tracking

The DebateState model captures full history across rounds:
# backend/models/agents.py:64-73
class DebateState(BaseModel):
    """Full state of a debate across rounds."""
    round_number: int = 0
    clinical_outputs: list[ClinicalAgentOutput] = Field(default_factory=list)
    literature_outputs: list[LiteratureAgentOutput] = Field(default_factory=list)
    safety_outputs: list[SafetyAgentOutput] = Field(default_factory=list)
    critic_outputs: list[CriticOutput] = Field(default_factory=list)
    final_consensus: bool = False
    flagged_for_human: bool = False
Usage in UI:
  • Show round-by-round progression in a timeline
  • Display critic feedback for each round
  • Highlight final consensus or human review flag

Performance Optimizations

Literature and Safety run in parallel when possible:
# backend/debate/debate_engine.py:114-118
literature, safety = await asyncio.gather(
    run_literature_agent(patient, clinical_output=clinical),
    run_safety_agent(patient, proposed_plan=clinical.soap_draft),
)
Speedup: ~2× faster than sequential (12s → 6s for Round 1)
PubMed searches run 3 queries concurrently:
# backend/agents/literature.py:98-101
tasks = [search_pubmed(q, max_results=3) for q in queries[:3]]
all_hits = await asyncio.gather(*tasks, return_exceptions=True)
Speedup: ~3× faster than sequential (9s → 3s)
When using Groq (free tier), sequential execution with delays:
# backend/debate/debate_engine.py:107-112
if _using_groq():
    await asyncio.sleep(2.0)  # Respect TPM limits
    literature = await run_literature_agent(patient, clinical_output=clinical)
    await asyncio.sleep(2.0)
    safety = await run_safety_agent(patient, proposed_plan=clinical.soap_draft)
Safety panel runs alongside debate, not after:
# Conceptual (in main.py)
soap, debate_state, med_panel = await asyncio.gather(
    full_pipeline(request),
    run_med_error_panel(patient),
)
Speedup: No added latency for safety panel

Debate vs. Emergency Mode

FeatureDebate ModeEmergency Mode
Rounds2-3 rounds0 (single pass)
AgentsClinical, Literature, Safety, CriticClinical, Safety only
ConsensusYes (Critic-driven)No
Latency~100 seconds<5 seconds
Use CaseComplex cases, outpatientTime-critical, ED triage
OutputFull SOAP noteTop 3 DDx + red flags
When to Use Emergency Mode:
  • ESI Level 1-2 (immediate/emergent)
  • Chest pain, stroke symptoms, severe trauma
  • Any case where debate latency is unacceptable

Handling No Consensus

If consensus is not reached after 3 rounds:
1

Flag for Human Review

state.flagged_for_human = True
UI shows a warning badge: ⚠ Requires Human Review
2

Return Best-Effort SOAP

Use the latest round’s outputs (most refined) even without consensus
3

Log Dissent Points

Display remaining issues in UI:
"dissent_log": [
  "Clinical agent suggests AMI but troponin negative",
  "Literature agent found conflicting studies on statin dosing"
]
4

Human-in-the-Loop

Doctor reviews outputs, makes edits, and can optionally trigger re-debate:
POST /api/human-feedback
{
  "soap_edits": {...},
  "re_debate": true,  // Optional: run 1 more round
  "max_iterations": 1
}

Example Debate Timeline

Round 1 (45s):
  • Clinical: DDx = CKD, Anemia, Orthostatic Hypotension
  • Literature: Found 4 PubMed papers on orthostatic hypotension in elderly
  • Safety: Flagged Lisinopril + Potassium hyperkalemia risk
  • Critic: ✗ No consensus - missing evidence for CKD claim (normal creatinine)
Round 2 (38s):
  • Clinical: Revised DDx = Anemia, Orthostatic Hypotension, Medication Side Effects (removed CKD)
  • Literature: Added 2 citations on ACE inhibitor side effects
  • Safety: Re-confirmed Lisinopril + K interaction
  • Critic: ✓ Consensus - all issues addressed, outputs coherent
Total Time: 83sFinal Output: 4-section SOAP note with 3 differentials, 6 citations, 2 safety flags

Configuration

# .env
MAX_DEBATE_ROUNDS=3          # Default: 3
OPENAI_MODEL=gpt-4o          # Clinical, Safety, Critic
OPENAI_FAST_MODEL=gpt-4o-mini # Literature
GROQ_API_KEY=...             # Optional: use Groq instead

Next Steps

Agent Types

Deep dive into each agent’s implementation and prompts

Safety System

Medical Error Prevention Panel and guardrails

Architecture

Layer-by-layer system overview

Build docs developers (and LLMs) love