Ollama Backend (Stub)
The OllamaBackend is a placeholder for local model prompt compilation via Ollama. It is scheduled for Phase 2 expansion.Status
Implementation: NOT YET IMPLEMENTEDPriority: Scheduled for Phase 2 expansion
Use Case: Running AXON programs on local LLMs (Llama, Mistral, etc.)
About Ollama
Ollama is a tool for running large language models locally. It provides:- Simple API for model inference
- Support for Llama 2, Mistral, CodeLlama, and more
- Lightweight model management
- No external API keys required
- Privacy: Run AXON programs entirely on-premises
- Cost: Zero API costs for development and testing
- Latency: No network round-trips
- Experimentation: Test with different model sizes and quantizations
Stub Implementation
Design Considerations
1. Context Window Constraints
Local models typically have smaller context windows:| Model | Context Window |
|---|---|
| Llama 2 7B | 4096 tokens |
| Mistral 7B | 8192 tokens |
| CodeLlama 34B | 16384 tokens |
| GPT-4 (comparison) | 128000 tokens |
- Simplify system prompts
- Compress anchor instructions
- Prioritize essential context
- Warn when flows exceed model capacity
2. Instruction Format
Local models use different instruction templates: Llama 2 Chat:3. Tool Support
Most local models do not support native tool calling. The backend must:- Detect if the model supports tools (via Ollama metadata)
- For models without tool support:
- Compile tool invocations as plain text instructions
- Parse tool results from text output
- Gracefully degrade functionality
4. Quantization Awareness
Local models are often quantized (4-bit, 8-bit) for efficiency:- Impact: Lower precision may affect reasoning quality
- Solution: Adjust confidence thresholds and validation rules
- Mitigation: Use clearer, more explicit prompts
Planned Features
System Prompt Simplification
Goal: Compress system prompts to fit smaller context windows. Strategy:Instruction Template Detection
Tool Graceful Degradation
Implementation Roadmap
Phase 1: Core Compilation
- Implement
compile_system_prompt()with prompt compression - Add instruction template detection
- Implement
compile_step()for basic steps
Phase 2: Advanced Features
- Add tool graceful degradation
- Implement context window management
- Add model-specific optimizations (Llama vs Mistral)
Phase 3: Optimization
- Benchmark prompt efficiency across quantization levels
- Add automatic model selection based on flow complexity
- Implement local memory backend for persistent storage
Example: Ollama API Integration
Ollama HTTP API
Contributing
Interested in implementing the Ollama backend?- Study the reference: Read
anthropic_backend.pyandgemini_backend.py - Install Ollama: https://ollama.ai/download
- Test with local models:
ollama run llama2 - Implement OllamaBackend: Follow the
BaseBackendinterface - Create OllamaModelClient: Implement the
ModelClientprotocol - Submit a PR with tests and documentation
Next Steps
Backend Overview
Review backend architecture principles
Anthropic Reference
Study the reference implementation
