Local vs cloud LLM

All agents in this project accept any BaseChatModel-compatible LLM. You choose at startup whether to use a local model via Ollama or a cloud provider that speaks the OpenAI API — or mix them per agent, as main.py does.

LLM configuration

Local (Ollama)
Cloud (OpenAI-compatible)

Install the Ollama integration:

pip install langchain-ollama

Start Ollama and pull a model:

ollama serve
ollama pull llama3

Configure in .env:

LOCAL_MODEL=llama3
OLLAMA_HOST=http://localhost:11434

Initialize in code:

import os
from langchain_ollama import ChatOllama

llm = ChatOllama(model=os.getenv("LOCAL_MODEL"))

Install the OpenAI integration:

pip install langchain-openai

Configure in .env:

SUMMARY_HOST=https://api.routeway.ai/v1
SUMMARY_MODEL=nemotron-nano-9b-v2:free
SUMMARY_AGENT_API_KEY=your-api-key-here

Initialize in code:

import os
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(
    model_name=os.getenv("SUMMARY_MODEL"),
    temperature=0.3,
    openai_api_key=os.getenv("SUMMARY_AGENT_API_KEY"),
    base_url=os.getenv("SUMMARY_HOST"),
)

Any OpenAI-compatible provider works by changing SUMMARY_HOST, SUMMARY_MODEL, and SUMMARY_AGENT_API_KEY.

Environment variables reference

All variables come from .env.example. Copy that file to .env and fill in your values:

cp .env.example .env

Variable	Example value	Purpose
`LOCAL_MODEL`	`llama3`	Ollama model name passed to `ChatOllama(model=...)`
`OLLAMA_HOST`	`http://localhost:11434`	Ollama server URL
`SUMMARY_HOST`	`https://api.routeway.ai/v1`	Base URL for the cloud LLM (OpenAI-compatible)
`SUMMARY_MODEL`	`nemotron-nano-9b-v2:free`	Model name for the cloud LLM
`SUMMARY_AGENT_API_KEY`	`****`	API key for the cloud provider
`REPORT_HOST`	`https://api.llmapi.ai/v1`	Alternative host for a report-generation model
`REPORT_MODEL`	`llama-3-8b-instruct`	Model name for the report provider
`REPORT_AGENT_API_KEY`	`****`	API key for the report provider

Mixing LLMs across agents

Each agent constructor accepts its own llm instance. This lets you assign fast, cheap models to planning and execution while reserving a more capable cloud model for monitoring:

import os
from dotenv import load_dotenv
from langchain_ollama import ChatOllama
from langchain_openai import ChatOpenAI
from agents import PlanningAgent, ExecutionAgent, MonitoringAgent

load_dotenv()

# Local model for planning and execution
llm = ChatOllama(model=os.getenv("LOCAL_MODEL"))

# Cloud model for stricter monitoring/evaluation
llm_cloud = ChatOpenAI(
    model_name=os.getenv("SUMMARY_MODEL"),
    temperature=0.3,
    openai_api_key=os.getenv("SUMMARY_AGENT_API_KEY"),
    base_url=os.getenv("SUMMARY_HOST"),
)

planner  = PlanningAgent(llm=llm)        # local
executor = ExecutionAgent(llm=llm)       # local
monitor  = MonitoringAgent(llm=llm_cloud) # cloud

LocalAgent for context compression

When Ollama is available you can use LocalAgent as the compressor passed to LangGraphOrchestrator or SequentialWorkflow. It calls the local LLM to produce semantically meaningful summaries, which is more token-efficient than the regex-based CompressContextTool.

from agents.local_agent import LocalAgent

context_compressor_agent = LocalAgent(llm=llm)

LocalAgent exposes an invoke(user_input) method. The orchestrator detects this and calls .invoke() automatically.

Fallback pattern

main.py tries to create a LocalAgent and falls back to CompressContextTool if Ollama or langchain_ollama is not available:

from tools.compress_context_tool import CompressContextTool

compressor_tool = CompressContextTool(max_length=10000)
context_compressor_agent = None

if os.getenv("LOCAL_MODEL"):
    try:
        from agents.local_agent import LocalAgent
        context_compressor_agent = LocalAgent(llm=llm)
        print(f"[*] Using LocalAgent with model '{os.getenv('LOCAL_MODEL')}' for context compression.")
    except ImportError:
        print("[!] Unable to import langchain_ollama. Falling back to simple CompressContextTool.")
else:
    print("[*] No 'LOCAL_MODEL' env var found. Using simple CompressContextTool for compression.")

# Resolve the active compressor
active_compressor = context_compressor_agent if context_compressor_agent else compressor_tool

active_compressor is then passed directly to LangGraphOrchestrator(compressor=active_compressor, ...) and SequentialWorkflow(tools=[active_compressor, ...]). Both accept either type because they duck-type on .invoke() / ._run().

If LOCAL_MODEL is set but Ollama is not running when LocalAgent.invoke() is first called, it will raise a connection error at runtime — not at init time. Wrap calls in a try/except or validate connectivity at startup.

Choosing between local and cloud

Local (Ollama)

No API costs. Data stays on your machine. Slower for large models. Requires running ollama serve. Best for development and privacy-sensitive workloads.

Cloud (OpenAI-compatible)

Faster inference on large models. Requires an API key and network access. Incurs per-token costs. Best for production and tasks that need stronger reasoning.

Get Started

Core Concepts

Guides

LLM configuration

Environment variables reference

Mixing LLMs across agents

LocalAgent for context compression

Fallback pattern

Choosing between local and cloud

Local (Ollama)

Cloud (OpenAI-compatible)

Build docs developers (and LLMs) love

Get Started

Core Concepts

Guides

​LLM configuration

​Environment variables reference

​Mixing LLMs across agents

​LocalAgent for context compression

​Fallback pattern

​Choosing between local and cloud

Local (Ollama)

Cloud (OpenAI-compatible)

Build docs developers (and LLMs) love

LLM configuration

Environment variables reference

Mixing LLMs across agents

LocalAgent for context compression

Fallback pattern

Choosing between local and cloud