Skip to main content
All agents in this project accept any BaseChatModel-compatible LLM. You choose at startup whether to use a local model via Ollama or a cloud provider that speaks the OpenAI API — or mix them per agent, as main.py does.

LLM configuration

Install the Ollama integration:
pip install langchain-ollama
Start Ollama and pull a model:
ollama serve
ollama pull llama3
Configure in .env:
LOCAL_MODEL=llama3
OLLAMA_HOST=http://localhost:11434
Initialize in code:
import os
from langchain_ollama import ChatOllama

llm = ChatOllama(model=os.getenv("LOCAL_MODEL"))

Environment variables reference

All variables come from .env.example. Copy that file to .env and fill in your values:
cp .env.example .env
VariableExample valuePurpose
LOCAL_MODELllama3Ollama model name passed to ChatOllama(model=...)
OLLAMA_HOSThttp://localhost:11434Ollama server URL
SUMMARY_HOSThttps://api.routeway.ai/v1Base URL for the cloud LLM (OpenAI-compatible)
SUMMARY_MODELnemotron-nano-9b-v2:freeModel name for the cloud LLM
SUMMARY_AGENT_API_KEY****API key for the cloud provider
REPORT_HOSThttps://api.llmapi.ai/v1Alternative host for a report-generation model
REPORT_MODELllama-3-8b-instructModel name for the report provider
REPORT_AGENT_API_KEY****API key for the report provider

Mixing LLMs across agents

Each agent constructor accepts its own llm instance. This lets you assign fast, cheap models to planning and execution while reserving a more capable cloud model for monitoring:
import os
from dotenv import load_dotenv
from langchain_ollama import ChatOllama
from langchain_openai import ChatOpenAI
from agents import PlanningAgent, ExecutionAgent, MonitoringAgent

load_dotenv()

# Local model for planning and execution
llm = ChatOllama(model=os.getenv("LOCAL_MODEL"))

# Cloud model for stricter monitoring/evaluation
llm_cloud = ChatOpenAI(
    model_name=os.getenv("SUMMARY_MODEL"),
    temperature=0.3,
    openai_api_key=os.getenv("SUMMARY_AGENT_API_KEY"),
    base_url=os.getenv("SUMMARY_HOST"),
)

planner  = PlanningAgent(llm=llm)        # local
executor = ExecutionAgent(llm=llm)       # local
monitor  = MonitoringAgent(llm=llm_cloud) # cloud

LocalAgent for context compression

When Ollama is available you can use LocalAgent as the compressor passed to LangGraphOrchestrator or SequentialWorkflow. It calls the local LLM to produce semantically meaningful summaries, which is more token-efficient than the regex-based CompressContextTool.
from agents.local_agent import LocalAgent

context_compressor_agent = LocalAgent(llm=llm)
LocalAgent exposes an invoke(user_input) method. The orchestrator detects this and calls .invoke() automatically.

Fallback pattern

main.py tries to create a LocalAgent and falls back to CompressContextTool if Ollama or langchain_ollama is not available:
from tools.compress_context_tool import CompressContextTool

compressor_tool = CompressContextTool(max_length=10000)
context_compressor_agent = None

if os.getenv("LOCAL_MODEL"):
    try:
        from agents.local_agent import LocalAgent
        context_compressor_agent = LocalAgent(llm=llm)
        print(f"[*] Using LocalAgent with model '{os.getenv('LOCAL_MODEL')}' for context compression.")
    except ImportError:
        print("[!] Unable to import langchain_ollama. Falling back to simple CompressContextTool.")
else:
    print("[*] No 'LOCAL_MODEL' env var found. Using simple CompressContextTool for compression.")

# Resolve the active compressor
active_compressor = context_compressor_agent if context_compressor_agent else compressor_tool
active_compressor is then passed directly to LangGraphOrchestrator(compressor=active_compressor, ...) and SequentialWorkflow(tools=[active_compressor, ...]). Both accept either type because they duck-type on .invoke() / ._run().
If LOCAL_MODEL is set but Ollama is not running when LocalAgent.invoke() is first called, it will raise a connection error at runtime — not at init time. Wrap calls in a try/except or validate connectivity at startup.

Choosing between local and cloud

Local (Ollama)

No API costs. Data stays on your machine. Slower for large models. Requires running ollama serve. Best for development and privacy-sensitive workloads.

Cloud (OpenAI-compatible)

Faster inference on large models. Requires an API key and network access. Incurs per-token costs. Best for production and tasks that need stronger reasoning.

Build docs developers (and LLMs) love