BaseChatModel-compatible LLM. You choose at startup whether to use a local model via Ollama or a cloud provider that speaks the OpenAI API — or mix them per agent, as main.py does.
LLM configuration
- Local (Ollama)
- Cloud (OpenAI-compatible)
Install the Ollama integration:Start Ollama and pull a model:Configure in Initialize in code:
.env:Environment variables reference
All variables come from.env.example. Copy that file to .env and fill in your values:
| Variable | Example value | Purpose |
|---|---|---|
LOCAL_MODEL | llama3 | Ollama model name passed to ChatOllama(model=...) |
OLLAMA_HOST | http://localhost:11434 | Ollama server URL |
SUMMARY_HOST | https://api.routeway.ai/v1 | Base URL for the cloud LLM (OpenAI-compatible) |
SUMMARY_MODEL | nemotron-nano-9b-v2:free | Model name for the cloud LLM |
SUMMARY_AGENT_API_KEY | **** | API key for the cloud provider |
REPORT_HOST | https://api.llmapi.ai/v1 | Alternative host for a report-generation model |
REPORT_MODEL | llama-3-8b-instruct | Model name for the report provider |
REPORT_AGENT_API_KEY | **** | API key for the report provider |
Mixing LLMs across agents
Each agent constructor accepts its ownllm instance. This lets you assign fast, cheap models to planning and execution while reserving a more capable cloud model for monitoring:
LocalAgent for context compression
When Ollama is available you can useLocalAgent as the compressor passed to LangGraphOrchestrator or SequentialWorkflow. It calls the local LLM to produce semantically meaningful summaries, which is more token-efficient than the regex-based CompressContextTool.
LocalAgent exposes an invoke(user_input) method. The orchestrator detects this and calls .invoke() automatically.
Fallback pattern
main.py tries to create a LocalAgent and falls back to CompressContextTool if Ollama or langchain_ollama is not available:
active_compressor is then passed directly to LangGraphOrchestrator(compressor=active_compressor, ...) and SequentialWorkflow(tools=[active_compressor, ...]). Both accept either type because they duck-type on .invoke() / ._run().
Choosing between local and cloud
Local (Ollama)
No API costs. Data stays on your machine. Slower for large models. Requires running
ollama serve. Best for development and privacy-sensitive workloads.Cloud (OpenAI-compatible)
Faster inference on large models. Requires an API key and network access. Incurs per-token costs. Best for production and tasks that need stronger reasoning.