The Problem
Different actions have different requirements:- Dialog synthesis needs conversational fluency
- Mathematical reasoning needs strong logical capabilities
- JSON generation needs structured output reliability
- Temporal reasoning needs causal inference
M18: Intelligent Model Selection
Capability-based model selection that routes actions to optimal LLMs. Key principle: Match action type to model capabilities, with automatic fallbacks and license compliance for commercial synthetic data.Core Concepts
16 Action Types
15 Model Capabilities
Model Registry
Only open-source models with licenses permitting commercial synthetic data generation. | Model | Context | Strengths | License | |-------|---------|-----------|---------|| | Llama 3.1 8B | 128k | Fast, cost-efficient | Llama 3.1 | | Llama 3.1 70B | 128k | Balanced quality/cost, dialog | Llama 3.1 | | Llama 3.1 405B | 128k | Highest quality | Llama 3.1 | | Llama 4 Scout | 512k | Multimodal, huge context | Llama 4 | | Qwen 2.5 7B | 32k | JSON, code, fast | Qwen | | Qwen 2.5 72B | 128k | Structured output, analytical | Qwen | | QwQ 32B | 32k | Mathematical, logical reasoning | Qwen | | DeepSeek Chat | 64k | Balanced, analytical | MIT | | DeepSeek R1 | 64k | Deep reasoning, math | MIT | | Mistral 7B | 32k | Fast, cost-efficient | Apache 2.0 | | Mixtral 8x7B | 32k | Balanced MoE | Apache 2.0 | | Mixtral 8x22B | 64k | High quality MoE | Apache 2.0 |Castaway Colony Example
The template routes four distinct task types to specialized models:| Task | Model | Why |
|---|---|---|
| O2 depletion calculations | DeepSeek R1 | Mathematical precision |
| Radiation exposure modeling | DeepSeek R1 | Numerical reasoning |
| Crew interpersonal dialog | Llama 70B | Conversational fluency |
| Command decisions | Llama 70B | Natural language generation |
| Supply inventories | Qwen 72B | Reliable structured JSON |
| Flora analysis reports | Qwen 72B | Analytical output |
| Branch outcome judging | Llama 405B | Highest quality evaluation |
Selection Algorithm
Action → Capability Mappings
Examples from the system:Fallback Chains
If the primary model fails, automatic retry with alternatives.Integration with LLMService
Response Parsing
ResponseParser in llm_service/response_parser.py extracts JSON from LLM responses using a three-stage pipeline:
Stage 1: Markdown Code Blocks
Matches```json ... ``` fences first.
Stage 2: Bracket-Depth Matching
Walks the response character-by-character tracking:- Bracket depth
- String boundaries (
"...") - Escape sequences (
\")
{...} or [...] structure.
Stage 3: Whole-Text Fallback
Triesjson.loads() on the stripped response.
Bracket-depth matching handles common LLM failure modes:
- Text before/after JSON
- Truncated responses
- Brackets inside string values
- Nested structures
INVALID_JSON by the error handler and retried with exponential backoff.
License Compliance
All models in the registry permit commercial use. However, not all permit unrestricted use of outputs as training data.Unrestricted for Training Data
Outputs can train any model:- MIT (DeepSeek Chat, DeepSeek R1): Most permissive, no restrictions
- Apache 2.0 (Mistral 7B, Mixtral 8x7B, Mixtral 8x22B): Permissive, attribution required
Restricted for Training Data
- Llama 3.1/4: Commercial use allowed, but Meta’s license prohibits using Llama outputs to train non-Llama models
- ✅ Use for simulation
- ✅ Use outputs to fine-tune a Llama model
- ❌ Use outputs to fine-tune DeepSeek/Qwen/Mistral/custom models
- Qwen: Commercial use allowed, permissive for most training uses
- Google Gemini: TOS restricts synthetic data generation entirely (opt-in only via
--gemini-flash)
Training-Safe Model Selection
If you intend to use simulation outputs as training data:Models Explicitly Excluded
- OpenAI (usage restrictions)
- Anthropic (synthetic data restrictions)
Free Model Support
OpenRouter offers a rotating selection of free models (identified by:free suffix).
FreeModelSelector
CLI Usage
Rate Limiting
Fromllm.py:17-149:
RateLimiter Class
Thread-safe token bucket rate limiter for API calls. Two modes:| Mode | Requests/Min | Burst Size | Use Case |
|---|---|---|---|
| free | 20 | 5 | Conservative limits for free tier |
| paid | 1000 | 50 | Aggressive limits for paid tier (DEFAULT) |
Implementation
Global Controls
OpenRouter Client
Custom HTTP client for OpenRouter API (replaces OpenAI client). Fromllm.py:152-200:
Timeout Configuration
- connect: 10s for connection establishment
- read: 120s for slow LLM responses (increased from 60s)
- write: 30s for request body upload
- pool: 10s for getting a connection from the pool
Performance Characteristics
Model Selection Speed
Model selection is O(M) where M = number of models in registry (typically ~12). Typical selection time: under 1msCost Optimization
Compared to using Llama 405B for everything:| Action Type | Typical Model | Cost Ratio |
|---|---|---|
| Dialog synthesis | Llama 70B | 6x cheaper |
| Knowledge extraction | Qwen 72B | 6x cheaper |
| Mathematical reasoning | DeepSeek R1 | 8x cheaper |
| JSON generation | Qwen 7B | 50x cheaper |
| High-stakes evaluation | Llama 405B | 1x (baseline) |
Fallback Reliability
With 3-model fallback chains:- Single model failure rate: ~2-5%
- Chain failure rate: under 0.1%
Next Steps
Overview
Back to mechanisms overview
Fidelity Management
How fidelity follows attention

