Two Tiers of Review
- Mechanical (review_loop.py)
- Contextual (human + Opus)
Catches rule violations automatically:
- Multi-question responses
- Forbidden phrases (“before I can proceed”)
- Response length violations
- Empty
sms_responsefields
From the review system docs: “The mechanical tier is easy. The contextual tier is where the real value lives — things we didn’t know we needed.”
review_loop.py: Mechanical Tier
File:runtime/scripts/review_loop.py (715 lines)
What It Does
Ingests multiple signal sources, evaluates agent behavior against skill guidance, and produces lessons + operational alerts.Signal Sources
Conversation logs
Conversation logs
Path:
runtime/conversations/{phone}/*.logPairs INBOUND→OUTBOUND exchanges from recent conversations.PHI audit logs
PHI audit logs
Path:
runtime/phi_access/{date}.logTracks blocked responses (PHI leakage) and unknown numbers.Pending approvals
Pending approvals
Path:
families/{id}/pending_approvals.jsonFlags stale approvals (>18h old).Poller stdout
Poller stdout
Captures errors and warnings from the message polling loop (tmux session
caresupport).Rule Checks
Multi-question check
Counts
Lesson: “One question per response, always. Never stack multiple questions in a single message.”
? characters in outbound messages. If >1, generates finding:Severity: criticalLesson: “One question per response, always. Never stack multiple questions in a single message.”
Forbidden phrases check
Scans for:
Lesson: “Never say ‘[phrase]’. Act on available information per social.md.”
- “before I can proceed”
- “before I save”
- “before I can help”
- “I need to know”
Lesson: “Never say ‘[phrase]’. Act on available information per social.md.”
User feedback check
Detects correction patterns in user messages:
- “that’s wrong” / “that’s not right”
- “don’t do that” / “stop doing”
- “I told you” / “I already said”
Response length check
Flags responses >500 characters.Severity: info
Recommendation: “Keep SMS responses under 320 chars (2 segments) when possible.”
Recommendation: “Keep SMS responses under 320 chars (2 segments) when possible.”
Stale approvals check
Flags pending approvals >18h old.Severity: warning
Recommendation: “Resolve or re-send confirmation SMS.”
Recommendation: “Resolve or re-send confirmation SMS.”
Usage
Output Format
- Human-readable digest
- JSON output (--json)
Where Lessons Go
Without
--stage, lessons are written immediately via append_lessons():- Family-specific:
families/{id}/lessons.md(if--familyspecified) - Global:
runtime/learning/lessons.md(if no family specified)
--stage, lessons go to families/{id}/staging/reviews/{timestamp}.json instead.
review_staging.py: Staging Workflow
File:runtime/scripts/review_staging.py (843 lines)
The Problem
Everyreview_loop run writes lessons directly to real files. Testing = mutating production data. Resetting means deleting real corrections.
The Fix
A scratch pad. Nothing touches production until you explicitly say “promote this.” Only things that survive scrutiny become permanent.Three Piles
- staging/reviews/
- staging/saved/
- staging/proposals/
Disposable test output. Accumulates with each
--stage run. Cleared on reset.Lifecycle: Short-lived. Most reviews are noise.Commands
snapshot
staging/baseline/. This is your safety net — locks the starting point before testing.restore
reviews/. (Use reset instead unless you have a reason.)save
reviews/ to saved/ with a human-readable name. Flags interesting material before it gets cleared.promote
families/kano/lessons.md). This is the one-way door — the only moment real files change.Staged Testing Protocol
Full workflow (follow this exactly)
Full workflow (follow this exactly)
Setup (once per testing session):Test loop (repeat as many times as needed):Then iterate. Change the agent, run more tests, save what’s interesting, reset, repeat. Pushes approved lessons to production. This is the only moment real files change.
saved/ accumulates across cycles. reviews/ gets cleared each time.When ready (one-way door):Resist the Shiny Object
From the docs:
During testing you WILL notice gaps — missing member lifecycle states, flows with no protocol, features that seem obvious to build right now. DO NOT stop testing to build them. Save the observation to saved/ and keep going.
Reasons:
- You’re in a test cycle. Momentum matters more than plumbing.
- You don’t have enough signal yet. Real interactions reveal what the abstraction needs to be.
- The observation is more valuable than the implementation right now.
saved/ with "type": "process_observation". They graduate to Linear when there’s enough signal to spec them properly — not before.
Graduation Pipeline
Lessons inlessons.md are temporary. They’re meant to graduate to permanent locations:
- Factual lessons →
family.mdormembers/{name}.md - Behavioral lessons →
skills/social.mdorskills/scheduling.md - Operational lessons →
capabilities.md
Why Graduate?
- lessons.md is capped at 20-30 entries (to prevent prompt bloat)
- Eviction is lossy — when new lessons arrive, old ones are removed
- Permanent files don’t get evicted — facts belong in profiles, protocols belong in skills
Workflow
Classify lessons
lessons.md and classifies each entry:- Factual: mentions a member name →
members/{name}.md - Factual: general family context →
family.md - Behavioral: contains scheduling keywords →
skills/scheduling.md - Behavioral: other →
skills/social.md - Operational: system constraints →
capabilities.md
Classification Rules
- [factual] category
- [behavioral] category
- [operational] category
- Mentions a member name →
members/{name}.md(Personal Context section) - No member name →
family.md(Care Preferences & Personality section) - Deduplication: skips if content already exists in target file (>60% word overlap)
Safety Gates
Age check: Lessons younger than 2 days are skipped (too recent, not enough signal).Deduplication: If content already exists in target file, the lesson is skipped.SOUL.md protection: Merging into SOUL.md requires
--force flag (it’s the core identity, changes should be deliberate).Contextual Review Workflow
The mechanical tier catches rule violations. The contextual tier finds things we didn’t know to look for.Workflow
Run staged review with full transcript
staging/reviews/{timestamp}.json.Read the transcript (human or Opus)
Look for:
- Contradictions: Agent calling someone both “grandmother” and “aunt”
- Missing context: Confusion caused by incomplete member profiles
- Protocol gaps: Flows that should exist but don’t
- Process observations: Things nobody thought to codify
Write lessons (or proposals)
- For immediate corrections: Add to
lessons.mddirectly - For structural changes: Write to
staging/proposals/(markdown, not schema) - For interesting reviews: Save before reset
Why Proposals Stay Markdown
From the docs:
If we constrain what “good output” looks like, we lose these. proposals/ stays markdown (not schema) so Opus can surface whatever it notices. The whole point is to not constrain what gets found.
The most valuable findings aren’t rule violations — they’re things we didn’t know we needed.
Review Outputs
Findings Structure
Lesson Writing
Generated lessons are written viaappend_lessons():
Eviction on Write
When lessons exceed the max, category-aware eviction removes oldest entries while preserving diversity:- Each category ([behavioral], [factual], [operational]) is guaranteed at least 1 slot
- Slots are distributed proportionally to how many entries each category has
- Oldest entries within each category are removed first
Real Example: Family Tree Confusion
Expand to see full example
Expand to see full example
Conversation:Mechanical tier (review_loop.py):Output:No lesson auto-generated — needs contextual review.Contextual review:
- Agent has
"relationship": "grandmother"in member profile - User corrected it mid-conversation
- Agent used
self_correctionsto capture the fix - Lesson written to
families/kano/lessons.md:
- Update
members/degitu.mdwith correct relationship - Graduate the lesson via
graduate+mergeworkflow
Key Principles
Mechanical tier is fast, limited scope. Catches rule violations automatically.Contextual tier is slow, high value. Finds things we didn’t know to look for.Staging prevents accidental mutation. Nothing touches production until explicitly promoted.Graduation moves lessons to permanent homes. Prevents eviction, keeps prompts lean.