Skip to main content
Plugin ID: ai.ollama | Version: 1.1.0 | Tools: 3 The ai.ollama plugin exposes GenieHelper’s local Ollama inference layer to the agent as MCP tools. All inference runs on the local VPS — no tokens leave the server, no API keys are required, and there is no content moderation layer that would reject adult content requests.
Ollama runs as a systemd service on port 11434. It is not a Docker container. The plugin connects to http://127.0.0.1:11434 by default, configurable via OLLAMA_URL.

Tools

generate — single-turn completion

Sends a single prompt to Ollama and returns the completion. This is the right tool for stateless, one-shot generation tasks. Parameters:
  • prompt (required) — the user prompt
  • system (optional) — system instruction injected before the prompt
  • model (optional) — overrides OLLAMA_MODEL for this call
  • temperature (optional) — sampling temperature (default 0.7)
  • max_tokens (optional) — maximum tokens to generate
When to use generate:
  • Generating captions for a single media asset
  • Classifying a piece of content against a schema
  • Summarising scraped platform data
  • Drafting a single-message push notification
  • Any task where conversation history is irrelevant

chat — multi-turn conversation

Sends a full messages array ([{role, content}]) to the Ollama chat endpoint. Maintains conversational coherence across multiple exchanges by including prior turns. Parameters:
  • messages (required) — JSON array of {role: "user"|"assistant"|"system", content: string}
  • system (optional) — system prompt prepended to the messages array
  • model (optional) — overrides OLLAMA_MODEL for this call
  • temperature (optional) — sampling temperature
When to use chat:
  • Drafting a back-and-forth fan message conversation where previous replies shape tone
  • Multi-step creative writing where each generation builds on the last
  • Iterative refinement loops where the agent critiques and improves its own output
  • Simulating a creator’s voice across multiple platform reply threads

list-models — available models

Lists all Ollama models currently pulled on the local server. Returns model name, disk size, and last modified timestamp. When to use list-models:
  • Verifying a required model is available before starting a generation task
  • Selecting the most appropriate model for a task based on what is currently loaded
  • Debugging inference failures when the default model may have been removed

Model selection

GenieHelper runs three primary models, each with a specific role:

dolphin3:8b

Role: OrchestratorDefault model (OLLAMA_MODEL env). Used for tool planning, ACTION emission, and agent coordination. Llama 3.1 base with Dolphin fine-tune for instruction following.

dolphin-mistral:7b

Role: Content writerUncensored Mistral fine-tune. Used for caption writing, fan message drafting, and any task requiring creative adult content generation without content policy restrictions.

qwen-2.5:latest

Role: Agent / codePrimary AnythingLLM agent model. Used for JSON generation, code tasks, and structured output. Also the default STAGEHAND_MODEL for browser automation.

Custom selection

Per-call overrideBoth generate and chat accept a model parameter to override OLLAMA_MODEL for a single call. Use this to route content writing tasks to dolphin-mistral:7b without changing the global default.
The default model is set by the OLLAMA_MODEL environment variable. If the variable is not set, the plugin falls back to qwen-2.5:latest. The orchestrator model (dolphin3:8b-llama3.1-q4_K_M) must be set explicitly via the env var.

Performance characteristics

All inference is CPU-bound on the IONOS dedicated server (16 GB RAM).
ModelApprox. latencyRAM usage
qwen-2.5:7b (q4)2–5 s/call~4.8 GB
dolphin-mistral:7b3–6 s/call~5.2 GB
dolphin3:8b4–8 s/call~5.8 GB
The server has a 16 GB RAM ceiling. Running multiple concurrent Ollama calls with large models risks OOM. The plugin is configured with concurrency: 2 and timeout_ms: 120000 (2 minutes) to accommodate slow CPU inference without false timeouts.

Configuration

SettingValue
Base URLhttp://127.0.0.1:11434 (configurable via OLLAMA_URL)
Default modelOLLAMA_MODEL env (fallback: qwen-2.5:latest)
Timeout120,000 ms (2 minutes)
Concurrency2
AuthNone (localhost only)

Build docs developers (and LLMs) love