The problem: context windows are budget-constrained
GenieHelper runs on a 16GB RAM server. The inference model (Qwen 2.5 7B / Dolphin 3 8B) pins roughly 4.8GB of RAM. The remaining headroom is shared across active sessions, BullMQ job queues, Directus, PostgreSQL, and Redis. This means context windows are not infinitely expandable. Injecting everything the retrieval pipeline surfaces would push RAM usage into swap or cause OOM. More importantly, bloating the context window with low-value chunks degrades LLM output quality — a well-documented failure mode known as “Lost in the Middle”, where relevant content buried in a large context gets ignored by the model.How Shannon entropy scoring works
Shannon entropy measures information density. Applied to text, it answers: how much unique information does this chunk contain? Character-level entropy:p(c) is the probability of each unique character in the text. Range: 0 (single repeated character) to ~5.5+ (dense, varied data like payout formulas or platform-specific rules).
Token-level entropy (semantic richness):
Tokens are normalized words. High token entropy means many unique tokens — indicating varied, specific content rather than repetitive prose.
GenieHelper uses a weighted blend:
Entropy benchmarks
What gets a high entropy score?
What gets a high entropy score?
High-entropy content (score ≥ 0.65) contains unique, specific data:
- Platform payout rate tables (
"Slushy: 80% net, weekly, min $25, holds: 14 days") - Creator-specific scheduling rules (
"Post yoga content Tue/Thu 6-8 PM — peak engagement per 90-day analytics") - Policy details with specific numbers (
"OnlyFans subscription price floor: $4.99, ceiling: $49.99") - Technical configuration (
"BullMQ concurrency:1, Redis maxmemory 2gb, eviction: allkeys-lru")
What gets a low entropy score?
What gets a low entropy score?
Low-entropy content (score ≤ 0.35) contains repetitive boilerplate:
- Generic greetings and transitions
- Repeated instructional phrases (
"To do this, follow these steps. First, open the settings. Then...”) - Redundant summaries of information already in the context window
- Empty or near-empty node content
Context pruning: filling the budget
Once all candidate chunks are scored,prune_to_budget() selects the highest-entropy chunks that fit within the token budget:
CHARS_PER_TOKEN = 4) to avoid a tokenizer dependency.
The eviction report
Every pruning pass produces aneviction_report — a structured summary of what was kept versus dropped:
retrieval-performance.log and is fully auditable. If a retrieval result seems wrong — the agent answered something it should have known — you can inspect the eviction report to see whether the relevant node was in the candidate set but pruned for budget reasons.
CRAG: Corrective RAG
After entropy gating produces the final context set, CRAG (Corrective RAG) validates that context before injection. The agent grades each retrieved chunk for relevance to the actual query:- High confidence: chunk is clearly relevant → injected normally
- Low confidence: chunk relevance is uncertain → trigger fallback
- Web search fallback — if the information needed exists on the public web (platform policy changes, current events, pricing updates), trigger a web search via the Stagehand or PinchTab MCP tools
- HITL escalation — if web search is insufficient or the query requires human judgment, push to the
hitl_sessionscollection for human review before responding
CRAG is the mechanism by which GenieHelper avoids confident hallucination. Rather than injecting low-confidence context and letting the LLM generate a plausible-sounding answer, the system explicitly surfaces its uncertainty and routes to a human or a live source.
Implementation files
Key functions
| Function | File | Description |
|---|---|---|
calculate_shannon_entropy(text) | shannon_filter.py | Character-level H score; range 0–5.5+ |
calculate_token_entropy(text) | shannon_filter.py | Token-level H score; range 0–6+ |
score_chunk(chunk) | shannon_filter.py | Weighted blend (60% token, 40% char); range 0–1 |
annotate_chunks(chunks) | shannon_filter.py | Add entropy key to all chunks in-place |
prune_to_budget(chunks, max_tokens, min_entropy) | context_pruner.py | Select highest-entropy chunks within token budget |
eviction_report(chunks, selected) | context_pruner.py | Structured diff of kept vs evicted chunks |