Ollama Backend (Stub)

The OllamaBackend is a placeholder for local model prompt compilation via Ollama. It is scheduled for Phase 2 expansion.

Status

Implementation: NOT YET IMPLEMENTED
Priority: Scheduled for Phase 2 expansion
Use Case: Running AXON programs on local LLMs (Llama, Mistral, etc.)

About Ollama

Ollama is a tool for running large language models locally. It provides:

Simple API for model inference
Support for Llama 2, Mistral, CodeLlama, and more
Lightweight model management
No external API keys required

Why an Ollama Backend?

Privacy: Run AXON programs entirely on-premises
Cost: Zero API costs for development and testing
Latency: No network round-trips
Experimentation: Test with different model sizes and quantizations

Stub Implementation

from axon.backends.base_backend import BaseBackend, CompiledStep, CompilationContext
from axon.compiler.ir_nodes import IRNode, IRPersona, IRContext, IRAnchor, IRToolSpec

class OllamaBackend(BaseBackend):
    """Stub implementation for the Ollama backend."""

    @property
    def name(self) -> str:
        return "ollama"

    def compile_step(
        self, step: IRNode, context: CompilationContext
    ) -> CompiledStep:
        raise NotImplementedError(
            "Ollama backend is not yet implemented. "
            "Scheduled for Phase 2 expansion. "
            "Should adapt prompts for local models with smaller "
            "context windows and optional tool support."
        )

    def compile_system_prompt(
        self,
        persona: IRPersona | None,
        context: IRContext | None,
        anchors: list[IRAnchor],
    ) -> str:
        raise NotImplementedError(
            "Ollama system prompt compilation is not yet implemented. "
            "Should produce simplified system prompts suitable for "
            "local models (Llama, Mistral, etc.)."
        )

    def compile_tool_spec(self, tool: IRToolSpec) -> dict[str, Any]:
        raise NotImplementedError(
            "Ollama tool spec compilation is not yet implemented. "
            "Should produce Ollama-compatible tool format or gracefully "
            "degrade for models without tool support."
        )

Design Considerations

1. Context Window Constraints

Local models typically have smaller context windows:

Model	Context Window
Llama 2 7B	4096 tokens
Mistral 7B	8192 tokens
CodeLlama 34B	16384 tokens
GPT-4 (comparison)	128000 tokens

Implication: The Ollama backend must:

Simplify system prompts
Compress anchor instructions
Prioritize essential context
Warn when flows exceed model capacity

2. Instruction Format

Local models use different instruction templates: Llama 2 Chat:

[INST] <<SYS>>
You are a helpful assistant.
<</SYS>>

User question here [/INST]

Mistral Instruct:

<s>[INST] User instruction [/INST]

Alpaca Format:

### Instruction:
You are a helpful assistant.

### Input:
User question here

### Response:

The backend must detect the model family and apply the correct template.

3. Tool Support

Most local models do not support native tool calling. The backend must:

Detect if the model supports tools (via Ollama metadata)
For models without tool support:
- Compile tool invocations as plain text instructions
- Parse tool results from text output
- Gracefully degrade functionality

4. Quantization Awareness

Local models are often quantized (4-bit, 8-bit) for efficiency:

Impact: Lower precision may affect reasoning quality
Solution: Adjust confidence thresholds and validation rules
Mitigation: Use clearer, more explicit prompts

Planned Features

System Prompt Simplification

Goal: Compress system prompts to fit smaller context windows. Strategy:

def compile_system_prompt(
    self, persona, context, anchors
) -> str:
    # Simplified format for local models
    parts = []
    
    if persona:
        # Concise persona: "You are X specializing in Y."
        parts.append(f"You are {persona.name} specializing in {', '.join(persona.domain[:2])}.")
    
    if anchors:
        # Compact constraint list
        parts.append("Rules: " + "; ".join(a.require for a in anchors if a.require))
    
    return " ".join(parts)

Example:

You are LegalExpert specializing in contract law, IP. Rules: cite all sources; no hallucination.

Instruction Template Detection

def _detect_instruction_format(self, model_name: str) -> str:
    """Detect instruction template from model name."""
    if "llama" in model_name.lower():
        return "llama2_chat"
    if "mistral" in model_name.lower():
        return "mistral_instruct"
    if "alpaca" in model_name.lower():
        return "alpaca"
    return "generic"

def _apply_instruction_template(
    self, prompt: str, model_format: str
) -> str:
    """Wrap prompt in model-specific instruction template."""
    if model_format == "llama2_chat":
        return f"[INST] {prompt} [/INST]"
    elif model_format == "mistral_instruct":
        return f"<s>[INST] {prompt} [/INST]"
    # ...
    return prompt

Tool Graceful Degradation

def compile_tool_spec(self, tool: IRToolSpec) -> dict[str, Any]:
    # Check if model supports tools (hypothetical API)
    if self._model_supports_tools():
        # Return Ollama tool format
        return {"name": tool.name, "description": tool.description}
    else:
        # Degrade to text-based tool calling
        return {
            "name": tool.name,
            "mode": "text_simulation",
            "instruction": (
                f"To use {tool.name}, output: TOOL_CALL[{tool.name}](query=\"...\")\n"
                f"The system will execute it and provide results."
            ),
        }

Implementation Roadmap

Phase 1: Core Compilation

Implement compile_system_prompt() with prompt compression
Add instruction template detection
Implement compile_step() for basic steps

Phase 2: Advanced Features

Add tool graceful degradation
Implement context window management
Add model-specific optimizations (Llama vs Mistral)

Phase 3: Optimization

Benchmark prompt efficiency across quantization levels
Add automatic model selection based on flow complexity
Implement local memory backend for persistent storage

Example: Ollama API Integration

Ollama HTTP API

import httpx

class OllamaModelClient:
    """Model client for Ollama API."""
    
    def __init__(self, base_url: str = "http://localhost:11434"):
        self.base_url = base_url
    
    async def call(
        self,
        system_prompt: str,
        user_prompt: str,
        model: str = "llama2",
        **kwargs
    ) -> ModelResponse:
        async with httpx.AsyncClient() as client:
            response = await client.post(
                f"{self.base_url}/api/generate",
                json={
                    "model": model,
                    "prompt": self._format_prompt(system_prompt, user_prompt),
                    "stream": False,
                },
            )
            data = response.json()
            return ModelResponse(content=data["response"])
    
    def _format_prompt(self, system: str, user: str) -> str:
        # Apply Llama 2 instruction template
        return f"[INST] <<SYS>>\n{system}\n<</SYS>>\n\n{user} [/INST]"

Contributing

Interested in implementing the Ollama backend?

Study the reference: Read anthropic_backend.py and gemini_backend.py
Install Ollama: https://ollama.ai/download
Test with local models: ollama run llama2
Implement OllamaBackend: Follow the BaseBackend interface
Create OllamaModelClient: Implement the ModelClient protocol
Submit a PR with tests and documentation

Compiler

Backends

Runtime

Ollama Backend

Ollama Backend (Stub)

Status

About Ollama

Stub Implementation

Design Considerations

1. Context Window Constraints

2. Instruction Format

3. Tool Support

4. Quantization Awareness

Planned Features

System Prompt Simplification

Instruction Template Detection

Tool Graceful Degradation

Implementation Roadmap

Phase 1: Core Compilation

Phase 2: Advanced Features

Phase 3: Optimization

Example: Ollama API Integration

Ollama HTTP API

Contributing

Next Steps

Backend Overview

Anthropic Reference

Build docs developers (and LLMs) love

Compiler

Backends

Runtime

​Ollama Backend (Stub)

​Status

​About Ollama

​Stub Implementation

​Design Considerations

​1. Context Window Constraints

​2. Instruction Format

​3. Tool Support

​4. Quantization Awareness

​Planned Features

​System Prompt Simplification

​Instruction Template Detection

​Tool Graceful Degradation

​Implementation Roadmap

​Phase 1: Core Compilation

​Phase 2: Advanced Features

​Phase 3: Optimization

​Example: Ollama API Integration

​Ollama HTTP API

​Contributing

​Next Steps

Backend Overview

Anthropic Reference

Build docs developers (and LLMs) love

Ollama Backend (Stub)

Status

About Ollama

Stub Implementation

Design Considerations

1. Context Window Constraints

2. Instruction Format

3. Tool Support

4. Quantization Awareness

Planned Features

System Prompt Simplification

Instruction Template Detection

Tool Graceful Degradation

Implementation Roadmap

Phase 1: Core Compilation

Phase 2: Advanced Features

Phase 3: Optimization

Example: Ollama API Integration

Ollama HTTP API

Contributing

Next Steps