genieChat.js routes requests, how the context window is managed under a 16 GB RAM ceiling, and how JIT skill hydration keeps the agent competent without blowing the context budget.
Installing Ollama and pulling models
Install Ollama
Ollama runs as a systemd service on the server. Install from the official script:Verify the service is running:
Pull the three GenieHelper models
The three models in detail
dolphin3:8b-llama3.1-q4_K_M — orchestrator
dolphin3:8b-llama3.1-q4_K_M — orchestrator
Role: Multi-step reasoning, tool planning, ACTION tag emission.Why this model: Dolphin 3 is fine-tuned for instruction following and function-calling-style tasks. It reliably structures multi-step plans and emits the
[ACTION:slug:{"params"}] tags that the Action Runner intercepts.RAM footprint: ~5 GB at q4_K_M quantization.Typical latency: 3–6 s/call on CPU.Use cases:- Decomposing a creator request (“scrape my OnlyFans stats and draft a post”) into sequential ACTION steps
- Deciding which BullMQ job type to enqueue
- Multi-step Stagehand navigation flows
dolphin-mistral:7b — content writer
dolphin-mistral:7b — content writer
Role: Uncensored adult content drafting.Why this model: Dolphin Mistral is a Mistral 7B fine-tune with the safety filters removed. It writes captions, fan messages, custom request responses, and post concepts in the creator’s configured voice without refusing adult content.RAM footprint: ~4.5 GB.Typical latency: 2–4 s/call on CPU.Use cases:
- Caption generation for media posts
- Fan message replies (per-fan voice persona applied via system prompt)
- Custom content request descriptions
- Creator persona drafting
qwen-2.5:latest — primary agent
qwen-2.5:latest — primary agent
Role: AnythingLLM agent workspace, code generation, structured JSON.Why this model: Qwen 2.5 7B has strong instruction following, produces reliable JSON output, and is the default model for MCP tool calls via the
ai.ollama plugin.RAM footprint: ~4.8 GB pinned at idle.Typical latency: 2–5 s/call on CPU.Use cases:- Primary chat agent for creator sessions
- Code generation for automation rules
- JSON-structured data extraction via Stagehand
- Tool call routing through AnythingLLM’s agent workspace
Model selection in genieChat.js
The routing logic lives inserver/endpoints/api/genieChat.js. The decision tree is straightforward:
OLLAMA_MODEL environment variable sets the default (fallback: qwen-2.5:latest), and the ai.ollama MCP plugin passes model explicitly per call.
Context window management
The 16 GB RAM ceiling imposes a hard constraint on how much context can be active at once. The system manages this through three mechanisms:BullMQ concurrency:1
All three BullMQ queues —media-jobs, scrape-jobs, and onboarding-jobs — run with concurrency: 1. This prevents multiple Ollama inference calls from stacking up and saturating RAM.
Shannon entropy gating
Before injecting retrieved context into the agent prompt, nodes are ranked by information entropy. High-redundancy nodes are evicted. Thememory/retrieval/entropy/ module (prune_to_budget, shannon_filter, context_pruner) enforces this budget.
The eviction_report surfaces what was dropped and why, so the agent knows what it doesn’t have access to in the current session.
CRAG — Corrective RAG
Retrieved context is graded for relevance before injection. Low-confidence retrievals do not enter the prompt — they either trigger a web search fallback or escalate to the HITL queue. The agent does not hallucinate answers from stale or irrelevant context.JIT skill hydration
GenieHelper stores 191 procedural skills across 11 categories in a DuckDB skill graph atmemory/core/agent_memory.duckdb. The graph has 252 nodes and over 12,880 edges connecting skills by semantic proximity.
The problem with preloading: Injecting all 191 skills into every session context would exhaust the context budget before the user types a word.
The solution: Just-In-Time hydration. At session start, surgical_context.py runs stimulus propagation across the graph to surface only the skills relevant to the incoming task.
How the propagation works
Seed node selection
The task description is embedded and matched against the skill graph. The closest node becomes the propagation seed.
Stimulus propagation
Activation spreads from the seed node across graph edges using a Leaky Integrate-and-Fire (LIF) neuron model. Edges strengthen with use (Hebbian reinforcement), so frequently co-activated skills cluster together over time.
Top-N selection
The top-N nodes by final activation score are returned as skill keys. Only these skills are loaded into the agent’s context window for the session.
Skill categories
The 191 skills span 11 categories:| Category | Examples |
|---|---|
| Scraping | OnlyFans profile scrape, Fansly subscriber fetch |
| Media | Watermark apply, teaser clip, thumbnail generate |
| Publishing | Schedule post, publish to platform |
| Connect | Platform auth, cookie injection, session rehydration |
| Profile | Creator persona update, content boundaries |
| Analytics | Earnings summary, engagement report |
| Settings | Tier limits, API key rotation |
| Comms | Fan message draft, broadcast segment |
| Memory | Node activation, graph query |
| Taxonomy | Tag content, map term, rebuild graph |
| Admin | RBAC sync, user provisioning |