Skip to main content

Ollama Backend (Stub)

The OllamaBackend is a placeholder for local model prompt compilation via Ollama. It is scheduled for Phase 2 expansion.

Status

Implementation: NOT YET IMPLEMENTED
Priority: Scheduled for Phase 2 expansion
Use Case: Running AXON programs on local LLMs (Llama, Mistral, etc.)

About Ollama

Ollama is a tool for running large language models locally. It provides:
  • Simple API for model inference
  • Support for Llama 2, Mistral, CodeLlama, and more
  • Lightweight model management
  • No external API keys required
Why an Ollama Backend?
  • Privacy: Run AXON programs entirely on-premises
  • Cost: Zero API costs for development and testing
  • Latency: No network round-trips
  • Experimentation: Test with different model sizes and quantizations

Stub Implementation

from axon.backends.base_backend import BaseBackend, CompiledStep, CompilationContext
from axon.compiler.ir_nodes import IRNode, IRPersona, IRContext, IRAnchor, IRToolSpec

class OllamaBackend(BaseBackend):
    """Stub implementation for the Ollama backend."""

    @property
    def name(self) -> str:
        return "ollama"

    def compile_step(
        self, step: IRNode, context: CompilationContext
    ) -> CompiledStep:
        raise NotImplementedError(
            "Ollama backend is not yet implemented. "
            "Scheduled for Phase 2 expansion. "
            "Should adapt prompts for local models with smaller "
            "context windows and optional tool support."
        )

    def compile_system_prompt(
        self,
        persona: IRPersona | None,
        context: IRContext | None,
        anchors: list[IRAnchor],
    ) -> str:
        raise NotImplementedError(
            "Ollama system prompt compilation is not yet implemented. "
            "Should produce simplified system prompts suitable for "
            "local models (Llama, Mistral, etc.)."
        )

    def compile_tool_spec(self, tool: IRToolSpec) -> dict[str, Any]:
        raise NotImplementedError(
            "Ollama tool spec compilation is not yet implemented. "
            "Should produce Ollama-compatible tool format or gracefully "
            "degrade for models without tool support."
        )

Design Considerations

1. Context Window Constraints

Local models typically have smaller context windows:
ModelContext Window
Llama 2 7B4096 tokens
Mistral 7B8192 tokens
CodeLlama 34B16384 tokens
GPT-4 (comparison)128000 tokens
Implication: The Ollama backend must:
  • Simplify system prompts
  • Compress anchor instructions
  • Prioritize essential context
  • Warn when flows exceed model capacity

2. Instruction Format

Local models use different instruction templates: Llama 2 Chat:
[INST] <<SYS>>
You are a helpful assistant.
<</SYS>>

User question here [/INST]
Mistral Instruct:
<s>[INST] User instruction [/INST]
Alpaca Format:
### Instruction:
You are a helpful assistant.

### Input:
User question here

### Response:
The backend must detect the model family and apply the correct template.

3. Tool Support

Most local models do not support native tool calling. The backend must:
  • Detect if the model supports tools (via Ollama metadata)
  • For models without tool support:
    • Compile tool invocations as plain text instructions
    • Parse tool results from text output
    • Gracefully degrade functionality

4. Quantization Awareness

Local models are often quantized (4-bit, 8-bit) for efficiency:
  • Impact: Lower precision may affect reasoning quality
  • Solution: Adjust confidence thresholds and validation rules
  • Mitigation: Use clearer, more explicit prompts

Planned Features

System Prompt Simplification

Goal: Compress system prompts to fit smaller context windows. Strategy:
def compile_system_prompt(
    self, persona, context, anchors
) -> str:
    # Simplified format for local models
    parts = []
    
    if persona:
        # Concise persona: "You are X specializing in Y."
        parts.append(f"You are {persona.name} specializing in {', '.join(persona.domain[:2])}.")
    
    if anchors:
        # Compact constraint list
        parts.append("Rules: " + "; ".join(a.require for a in anchors if a.require))
    
    return " ".join(parts)
Example:
You are LegalExpert specializing in contract law, IP. Rules: cite all sources; no hallucination.

Instruction Template Detection

def _detect_instruction_format(self, model_name: str) -> str:
    """Detect instruction template from model name."""
    if "llama" in model_name.lower():
        return "llama2_chat"
    if "mistral" in model_name.lower():
        return "mistral_instruct"
    if "alpaca" in model_name.lower():
        return "alpaca"
    return "generic"

def _apply_instruction_template(
    self, prompt: str, model_format: str
) -> str:
    """Wrap prompt in model-specific instruction template."""
    if model_format == "llama2_chat":
        return f"[INST] {prompt} [/INST]"
    elif model_format == "mistral_instruct":
        return f"<s>[INST] {prompt} [/INST]"
    # ...
    return prompt

Tool Graceful Degradation

def compile_tool_spec(self, tool: IRToolSpec) -> dict[str, Any]:
    # Check if model supports tools (hypothetical API)
    if self._model_supports_tools():
        # Return Ollama tool format
        return {"name": tool.name, "description": tool.description}
    else:
        # Degrade to text-based tool calling
        return {
            "name": tool.name,
            "mode": "text_simulation",
            "instruction": (
                f"To use {tool.name}, output: TOOL_CALL[{tool.name}](query=\"...\")\n"
                f"The system will execute it and provide results."
            ),
        }

Implementation Roadmap

Phase 1: Core Compilation

  • Implement compile_system_prompt() with prompt compression
  • Add instruction template detection
  • Implement compile_step() for basic steps

Phase 2: Advanced Features

  • Add tool graceful degradation
  • Implement context window management
  • Add model-specific optimizations (Llama vs Mistral)

Phase 3: Optimization

  • Benchmark prompt efficiency across quantization levels
  • Add automatic model selection based on flow complexity
  • Implement local memory backend for persistent storage

Example: Ollama API Integration

Ollama HTTP API

import httpx

class OllamaModelClient:
    """Model client for Ollama API."""
    
    def __init__(self, base_url: str = "http://localhost:11434"):
        self.base_url = base_url
    
    async def call(
        self,
        system_prompt: str,
        user_prompt: str,
        model: str = "llama2",
        **kwargs
    ) -> ModelResponse:
        async with httpx.AsyncClient() as client:
            response = await client.post(
                f"{self.base_url}/api/generate",
                json={
                    "model": model,
                    "prompt": self._format_prompt(system_prompt, user_prompt),
                    "stream": False,
                },
            )
            data = response.json()
            return ModelResponse(content=data["response"])
    
    def _format_prompt(self, system: str, user: str) -> str:
        # Apply Llama 2 instruction template
        return f"[INST] <<SYS>>\n{system}\n<</SYS>>\n\n{user} [/INST]"

Contributing

Interested in implementing the Ollama backend?
  1. Study the reference: Read anthropic_backend.py and gemini_backend.py
  2. Install Ollama: https://ollama.ai/download
  3. Test with local models: ollama run llama2
  4. Implement OllamaBackend: Follow the BaseBackend interface
  5. Create OllamaModelClient: Implement the ModelClient protocol
  6. Submit a PR with tests and documentation

Next Steps

Backend Overview

Review backend architecture principles

Anthropic Reference

Study the reference implementation

Build docs developers (and LLMs) love