Model Tier Routing
Automatic model selection based on prompt complexity. Simple queries use cheap models, complex tasks use premium models.Configuration
Fromgrip/config/schema.py:136:
Setup
Enable tiered routing in~/.grip/config.json:
How Complexity Detection Works
The router analyzes prompt characteristics: Low Complexity (useslow model):
- Greetings and small talk: “Hello”, “How are you?”
- Simple lookups: “What time is it?”, “Current Bitcoin price”
- Basic regex/formatting: “Extract emails from this text”
- Single-step operations: “List files in /tmp”
medium or default model):
- Code changes: “Fix this bug”, “Add error handling”
- Explanations: “Explain how JWT works”
- Multi-step tasks: “Search for X and summarize findings”
- Data analysis: “Analyze this CSV and find trends”
high model):
- System design: “Design a scalable microservices architecture”
- Large refactors: “Refactor this module to use async/await”
- Debugging: “Find why this race condition occurs”
- Research synthesis: “Compare React vs Vue for enterprise apps”
Cost Savings Example
Without Tier Routing:Consolidation Model
Use a cheaper model for session compaction and summarization.Configuration
Fromgrip/config/schema.py:91:
~/.grip/config.json:
How It Works
When conversation history exceeds2 × memory_window messages, grip automatically:
- Sends old messages to the
consolidation_model - Generates a concise summary (typically 200-500 tokens)
- Replaces old messages with the summary
- Keeps recent
memory_windowmessages intact
Manual Consolidation
From
grip/engines/litellm_engine.py:138, consolidation is implemented in the engine layer and works with any model combination.Choosing Cost-Effective Models
Budget Models
Google Gemini Flash 2.0 (openrouter/google/gemini-flash-2.0)
- Cost: ~$0.0001 per request (100x cheaper than Claude)
- Speed: 2-3x faster than Sonnet
- Best for: Lookups, simple Q&A, data extraction, summarization
- Limitations: Weaker reasoning, less reliable tool use
openrouter/openai/gpt-4o-mini)
- Cost: ~$0.0005 per request (30x cheaper than Claude)
- Speed: Very fast
- Best for: Code formatting, regex tasks, simple refactors
- Limitations: Shorter context, less creative
anthropic/claude-haiku-4)
- Cost: ~$0.003 per request (5x cheaper than Sonnet)
- Speed: Fastest Claude model
- Best for: Tool-heavy workflows, data processing, quick iterations
- Limitations: Not as strong for complex reasoning
Premium Models
Claude Sonnet-4 (anthropic/claude-sonnet-4)
- Cost: ~$0.015 per request
- Best for: General-purpose tasks, code generation, analysis
- Sweet spot: Best quality/cost ratio for most coding tasks
openrouter/anthropic/claude-opus-4)
- Cost: ~$0.075 per request (5x more than Sonnet)
- Best for: System design, complex debugging, research synthesis
- When to use: Only when task quality justifies the cost
openrouter/openai/gpt-4-turbo)
- Cost: ~$0.04 per request
- Best for: Math, structured output, code execution validation
- Trade-off: Cheaper than Opus, slightly different strengths
Memory Window Optimization
Reduce tokens sent per request by limiting conversation history.Configuration
memory_window: 10-20):
- Pros: Very low token usage, fast responses
- Cons: Agent forgets context quickly
- Best for: Single-task sessions, tool-heavy automation
memory_window: 30-50, default):
- Pros: Good balance of cost and context retention
- Cons: May need consolidation every 100 messages
- Best for: General interactive use
memory_window: 100-200):
- Pros: Excellent context retention, fewer consolidations
- Cons: High token usage (2-5K tokens per request)
- Best for: Complex multi-turn debugging or architecture discussions
Max Tool Iterations Limit
Prevent runaway costs from infinite tool loops.grip/config/schema.py:76):
0= Unlimited iterations (default, see long-running tasks)N= Stop after N tool call rounds, even if task is incomplete
- Simple tasks:
max_tool_iterations: 5(file operations, lookups) - Code tasks:
max_tool_iterations: 15(build, test, fix) - Research:
max_tool_iterations: 10(search, fetch, analyze)
Each iteration costs 1K-8K tokens depending on tool output size. A 15-iteration task with an 8K-token model can consume 120K tokens total.
Semantic Caching
Cache identical queries to avoid re-processing.grip/config/schema.py:100):
- Identical user messages return cached responses
- Cache expires after
semantic_cache_ttlseconds (default 1 hour) - Saves 100% of tokens for repeated queries
- FAQ-style queries: “What time is it?”, “Show me the logs”
- Repeated analysis: Re-running the same report
- Development: Testing the same prompt multiple times
- Only caches exact message matches (no fuzzy matching)
- Session-specific (cache key includes session_key)
- Stored in workspace
state/semantic_cache.db
Token Budget Enforcement
Set daily token limits to prevent cost overruns.grip/config/schema.py:110):
0= Unlimited (default)N= Stop all agent runs after N total tokens used today- Counts both prompt tokens and completion tokens
- Resets at midnight UTC
- Light use:
100,000tokens/day (~$1.50 with Sonnet) - Medium use:
500,000tokens/day (~$7.50 with Sonnet) - Heavy use:
2,000,000tokens/day (~$30 with Sonnet)