Why roles matter
Different parts of Lerim’s pipeline have different requirements:- Orchestration (lead, explorer) needs strong reasoning and tool use
- Extraction needs large context windows to process long transcripts
- Summarization benefits from fast, cost-effective models
- Use expensive, powerful models only where they add value
- Use cheaper models for high-volume tasks like extraction
- Switch providers per-role (OpenRouter for orchestration, OpenAI for extraction)
- Test new models in one role without affecting others
The four roles
Lerim uses four model roles, each independently configurable:| Role | Purpose | Config section |
|---|---|---|
| lead | Orchestrates chat, sync, maintain flows (PydanticAI agent) | [roles.lead] |
| explorer | Read-only subagent for candidate gathering | [roles.explorer] |
| extract | DSPy extraction pipeline (decisions and learnings) | [roles.extract] |
| summarize | DSPy summarization pipeline (session summaries) | [roles.summarize] |
Lead
The lead agent orchestrates Lerim’s main workflows:lerim ask— Answers questions about memorieslerim sync— Decides which extracted candidates become memorieslerim maintain— Merges duplicates, archives stale entries
- Strong reasoning and decision-making capabilities
- Reliable tool calling (PydanticAI tools)
- Fast response time for interactive use
x-ai/grok-4.1-fast via OpenRouter
Explorer
The explorer is a read-only subagent that the lead delegates to during sync and maintain:- Searches existing memories for duplicates
- Gathers context for merge decisions
- Explores related memories
- Fast inference for repeated calls
- Good search and comparison abilities
- No write access (read-only by design)
x-ai/grok-4.1-fast via OpenRouter
The explorer is called multiple times per sync run, so using a fast model here improves sync performance.
Extract
The extract role runs DSPy pipelines to identify decisions and learnings in session transcripts:- Scans agent transcripts for decision points
- Identifies learnings and patterns
- Outputs structured candidates for the lead agent
- Large context window (
max_window_tokens≥ 100K recommended) - Strong information extraction capabilities
- Cost efficiency (processes many sessions)
openai/gpt-5-nano via OpenRouter (300K token window)
Summarize
The summarize role creates concise summaries of agent sessions:- Generates session titles and descriptions
- Summarizes key topics and outcomes
- Feeds metadata to the dashboard
- Fast inference (many sessions to summarize)
- Good compression and summarization skills
- Cost efficiency
openai/gpt-5-nano via OpenRouter (300K token window)
Default configuration
Out of the box, Lerim uses these defaults:These defaults use OpenRouter to route to xAI Grok for orchestration and OpenAI GPT-5 Nano for extraction/summarization. You need an OpenRouter API key:
export OPENROUTER_API_KEY="sk-or-..."Supported providers
Lerim supports multiple LLM providers through PydanticAI and DSPy:| Provider | Config value | API key variable | Best for |
|---|---|---|---|
| OpenRouter | openrouter | OPENROUTER_API_KEY | Access to many models via one API |
| OpenAI | openai | OPENAI_API_KEY | GPT-4o, GPT-4 Turbo, GPT-5 models |
| ZAI | zai | ZAI_API_KEY | ZAI platform models |
| Anthropic | anthropic | ANTHROPIC_API_KEY | Claude models (via PydanticAI) |
| Ollama | ollama | none (local) | Local models (Qwen, Llama, etc.) |
Setting API keys
API keys are environment variables only, never in config files:.env file in your project root — Lerim loads it automatically.
Customizing models
Switching providers
To use OpenAI instead of OpenRouter for all roles:export OPENAI_API_KEY="sk-..."
Using different models per role
You can mix providers and models:- Uses GPT-4o for the lead agent (best reasoning)
- Uses GPT-4o Mini for explorer (cheaper, still good)
- Uses Qwen for extraction (great code understanding, large context)
- Uses GPT-5 Nano for summarization (fast and cheap)
Why different models per role?
Why different models per role?
Model selection is about matching capabilities to requirements:Lead agent — Needs strong reasoning to make good memory decisions. Bad decisions = bad memory store. Worth the cost.Explorer — Called repeatedly, but tasks are simpler (search, compare). A faster/cheaper model can save money without hurting quality.Extract — Processes many sessions. Context window size matters more than reasoning. A cheaper model with a large window is often better than an expensive model with a small window.Summarize — High volume, simple task (compression). Use the cheapest model that produces readable summaries.
Using local models via Ollama
You can run Lerim entirely on local models using Ollama:Ollama models typically have smaller context windows. Adjust
max_window_tokens to match your model’s limit (e.g., 32K for Qwen3:8b).Custom API endpoints
You can point roles at custom endpoints:- LLM proxy services
- Internal corporate endpoints
- Custom OpenAI-compatible APIs
Model-specific settings
Orchestration roles (lead, explorer)
These settings apply to[roles.lead] and [roles.explorer]:
Request timeout in seconds. Increase for slower models or complex tasks.
Maximum agent iterations per run. The agent can call tools and reason in multiple steps. Increase for complex workflows.
List of fallback models to try if the primary model fails. Example:
["x-ai/grok-4.1-fast", "openai/gpt-4o"]OpenRouter provider routing preference. Example:
["Together", "Lepton"] tries Together first, then Lepton.DSPy roles (extract, summarize)
These settings apply to[roles.extract] and [roles.summarize]:
Maximum tokens per transcript window. Lerim splits long transcripts into overlapping windows. Increase for large-context models.
Token overlap between consecutive windows. Ensures context continuity at window boundaries.
Request timeout in seconds.
OpenRouter provider routing preference.
Understanding max_window_tokens
Themax_window_tokens setting is critical for extraction quality:
- Too small: Long sessions get truncated, losing context
- Too large: Model exceeds context limit and fails
- Just right: Full context with room for prompts and outputs
| Model | Context | Recommended max_window_tokens |
|---|---|---|
| GPT-5 Nano | 320K | 300K |
| GPT-4o | 128K | 100K |
| Grok 4.1 | 131K | 100K |
| Claude 4.5 Sonnet | 200K | 180K |
| Qwen3 Coder | 131K | 100K |
| Llama 3.3 | 128K | 100K |
Recommended configurations
Budget-conscious
Use cheaper models where possible:Performance-focused
Use the best models everywhere:Fully local (Ollama)
No API keys, no cloud, all local:Testing model changes
You can test model changes without affecting your main config:lerim sync in my-project/ uses Claude, while other projects use your global default.
Troubleshooting
Model not found
- OpenRouter: https://openrouter.ai/models
- OpenAI: https://platform.openai.com/docs/models
- Use the exact model identifier from the provider docs
Context length exceeded
max_window_tokens is larger than the model’s context limit.
Fix: Reduce max_window_tokens or switch to a larger-context model:
Slow extraction
Cause: Using a slow model for high-volume extraction. Fix: Switch to a faster model for[roles.extract]:
API rate limits
Cause: Too many requests to the provider. Fix: Reduce sync frequency or use a different provider:Next steps
Config.toml reference
See all available role configuration options
Tracing
Monitor model usage with OpenTelemetry tracing