Ollama
Ollama provides a simple way to run large language models locally. Once you have Ollama installed and running, you can configure Crush to use it.Configuration
Add the following to yourcrush.json configuration file:
Key Configuration Fields
type: Must be"openai-compat"for OpenAI-compatible APIsbase_url: The endpoint where Ollama is running (default:http://localhost:11434/v1/)id: The model identifier used by Ollama (e.g.,qwen3:30b,llama3:70b)context_window: Maximum number of tokens the model can processdefault_max_tokens: Default output token limit
No API key is required for local Ollama instances.
LM Studio
LM Studio is a desktop application for running local LLMs with a user-friendly interface.Configuration
Add the following to yourcrush.json configuration file:
Key Configuration Fields
type: Must be"openai-compat"for OpenAI-compatible APIsbase_url: The endpoint where LM Studio is running (default:http://localhost:1234/v1/)id: The model identifier from LM Studiocontext_window: Maximum number of tokens the model can processdefault_max_tokens: Default output token limit
Make sure to start the local server in LM Studio before using it with Crush.
Model Context Windows and Token Limits
When configuring local models, it’s important to set appropriate values for context windows and token limits:Context Window
Thecontext_window defines the maximum number of tokens (input + output) the model can handle in a single request. Common values:
- Small models (7B parameters): 4,000 - 8,000 tokens
- Medium models (13-30B parameters): 32,000 - 128,000 tokens
- Large models (70B+ parameters): 128,000 - 256,000 tokens
Default Max Tokens
Thedefault_max_tokens sets the maximum number of tokens for the model’s response. This should be:
- Less than the context window
- Appropriate for your use case (e.g., 4,000 for code generation, 20,000 for longer explanations)
- Balanced with input size to stay within the context window
Performance Considerations
When running local models, keep these factors in mind:Hardware Requirements
- GPU Memory: Larger models require more VRAM (e.g., 30B model needs ~20GB)
- RAM: System RAM should be at least 2x the model size for smooth operation
- CPU: Multi-core processors improve inference speed, especially without GPU acceleration
Response Time
- Local models typically generate 10-50 tokens/second depending on hardware
- GPU acceleration significantly improves speed (5-10x faster than CPU-only)
- Smaller models respond faster but may have reduced capabilities
Model Selection
- Code-focused models: Qwen Coder, CodeLlama, DeepSeek Coder
- General purpose: Llama 3, Mistral, Mixtral
- Balance size vs. capability based on your hardware
Troubleshooting
Connection Errors
If Crush can’t connect to your local model:- Verify the service is running (Ollama or LM Studio)
- Check the
base_urlmatches the actual endpoint - Ensure no firewall is blocking local connections
Slow Performance
If responses are very slow:- Check if GPU acceleration is enabled in Ollama/LM Studio
- Consider using a smaller model
- Reduce
default_max_tokensto generate shorter responses
Out of Memory Errors
If you see memory errors:- Close other applications to free up RAM/VRAM
- Use a smaller model variant
- Reduce the context window size
Next Steps
Custom Providers
Learn about configuring custom OpenAI-compatible providers
Air-Gapped Environments
Run Crush in restricted network environments