Local Models

Crush supports running local AI models through OpenAI-compatible APIs. This allows you to use models running on your own hardware without sending data to external services.

Ollama

Ollama provides a simple way to run large language models locally. Once you have Ollama installed and running, you can configure Crush to use it.

Configuration

Add the following to your crush.json configuration file:

{
  "providers": {
    "ollama": {
      "name": "Ollama",
      "base_url": "http://localhost:11434/v1/",
      "type": "openai-compat",
      "models": [
        {
          "name": "Qwen 3 30B",
          "id": "qwen3:30b",
          "context_window": 256000,
          "default_max_tokens": 20000
        }
      ]
    }
  }
}

Key Configuration Fields

type: Must be "openai-compat" for OpenAI-compatible APIs
base_url: The endpoint where Ollama is running (default: http://localhost:11434/v1/)
id: The model identifier used by Ollama (e.g., qwen3:30b, llama3:70b)
context_window: Maximum number of tokens the model can process
default_max_tokens: Default output token limit

No API key is required for local Ollama instances.

LM Studio

LM Studio is a desktop application for running local LLMs with a user-friendly interface.

Configuration

Add the following to your crush.json configuration file:

{
  "providers": {
    "lmstudio": {
      "name": "LM Studio",
      "base_url": "http://localhost:1234/v1/",
      "type": "openai-compat",
      "models": [
        {
          "name": "Qwen 3 30B",
          "id": "qwen/qwen3-30b-a3b-2507",
          "context_window": 256000,
          "default_max_tokens": 20000
        }
      ]
    }
  }
}

Key Configuration Fields

type: Must be "openai-compat" for OpenAI-compatible APIs
base_url: The endpoint where LM Studio is running (default: http://localhost:1234/v1/)
id: The model identifier from LM Studio
context_window: Maximum number of tokens the model can process
default_max_tokens: Default output token limit

Make sure to start the local server in LM Studio before using it with Crush.

Model Context Windows and Token Limits

When configuring local models, it’s important to set appropriate values for context windows and token limits:

Context Window

The context_window defines the maximum number of tokens (input + output) the model can handle in a single request. Common values:

Small models (7B parameters): 4,000 - 8,000 tokens
Medium models (13-30B parameters): 32,000 - 128,000 tokens
Large models (70B+ parameters): 128,000 - 256,000 tokens

Default Max Tokens

The default_max_tokens sets the maximum number of tokens for the model’s response. This should be:

Less than the context window
Appropriate for your use case (e.g., 4,000 for code generation, 20,000 for longer explanations)
Balanced with input size to stay within the context window

Setting values too high for your model’s actual capabilities will result in errors or truncated responses.

Performance Considerations

When running local models, keep these factors in mind:

Hardware Requirements

GPU Memory: Larger models require more VRAM (e.g., 30B model needs ~20GB)
RAM: System RAM should be at least 2x the model size for smooth operation
CPU: Multi-core processors improve inference speed, especially without GPU acceleration

Response Time

Local models typically generate 10-50 tokens/second depending on hardware
GPU acceleration significantly improves speed (5-10x faster than CPU-only)
Smaller models respond faster but may have reduced capabilities

Model Selection

Code-focused models: Qwen Coder, CodeLlama, DeepSeek Coder
General purpose: Llama 3, Mistral, Mixtral
Balance size vs. capability based on your hardware

Start with smaller models (7B-13B) to test your setup, then scale up if your hardware can handle it.

Troubleshooting

Connection Errors

If Crush can’t connect to your local model:

Verify the service is running (Ollama or LM Studio)
Check the base_url matches the actual endpoint
Ensure no firewall is blocking local connections

Slow Performance

If responses are very slow:

Check if GPU acceleration is enabled in Ollama/LM Studio
Consider using a smaller model
Reduce default_max_tokens to generate shorter responses

Out of Memory Errors

If you see memory errors:

Close other applications to free up RAM/VRAM
Use a smaller model variant
Reduce the context window size

Next Steps

Custom Providers

Learn about configuring custom OpenAI-compatible providers

Air-Gapped Environments

Run Crush in restricted network environments

Get Started

Configuration

Guides

Advanced

Local Models

Ollama

Configuration

Key Configuration Fields

LM Studio

Configuration

Key Configuration Fields

Model Context Windows and Token Limits

Context Window

Default Max Tokens

Performance Considerations

Hardware Requirements

Response Time

Model Selection

Troubleshooting

Connection Errors

Slow Performance

Out of Memory Errors

Next Steps

Custom Providers

Air-Gapped Environments

Build docs developers (and LLMs) love

Get Started

Configuration

Guides

Advanced

​Ollama

​Configuration

​Key Configuration Fields

​LM Studio

​Configuration

​Key Configuration Fields

​Model Context Windows and Token Limits

​Context Window

​Default Max Tokens

​Performance Considerations

​Hardware Requirements

​Response Time

​Model Selection

​Troubleshooting

​Connection Errors

​Slow Performance

​Out of Memory Errors

​Next Steps

Custom Providers

Air-Gapped Environments

Build docs developers (and LLMs) love

Ollama

Configuration

Key Configuration Fields

LM Studio

Configuration

Key Configuration Fields

Model Context Windows and Token Limits

Context Window

Default Max Tokens

Performance Considerations

Hardware Requirements

Response Time

Model Selection

Troubleshooting

Connection Errors

Slow Performance

Out of Memory Errors

Next Steps