Overview
Ollama integration allows you to run open-source large language models locally on your machine. Perfect for development, privacy-sensitive applications, and offline usage.Setup
Install Ollama
Download and install Ollama from ollama.ai
Configuration
Basic Parameters
URL where Ollama is running. Use default for local installation
Name of the model to use. Must match a pulled model:
llama2- Meta’s Llama 2 (7B, 13B, 70B variants)llama2:13b- Specific size variantmistral- Mistral 7Bmixtral- Mixtral 8x7B MoEcodellama- Code-specialized Llamaphi- Microsoft Phi-2gemma- Google Gemmavicuna- Vicuna chat model
Sampling temperature (0.0 to 2.0). Lower = more focused, higher = more creative
Enable token streaming for real-time responses
Advanced Parameters
Nucleus sampling threshold (0.0 to 1.0). Higher = more diverse text
Limit token selection to top K options. Lower = more conservative
Enable Mirostat sampling for controlling perplexity:
0- Disabled (default)1- Mirostat 1.02- Mirostat 2.0
Mirostat learning rate. Controls how quickly algorithm responds to feedback
Mirostat target perplexity. Controls coherence vs diversity balance
Context & Performance
Context window size. Larger = more memory but better long conversations
Number of GPU layers to use. On macOS defaults to 1 (Metal), 0 to disable GPU
Number of CPU threads. Defaults to optimal. Set to physical CPU cores for best performance
How far back to look for repetition prevention (0 = disabled, -1 = num_ctx)
Penalize repetitions (1.0 = no penalty, 1.5 = strong penalty)
Tail free sampling. Higher = reduced impact of low probability tokens
Additional Options
How long to keep model loaded. Duration string like “10m” or “24h”
Stop sequences (comma-separated). Generation stops when these appear
Force model to output only JSON. Specify JSON format in system prompt
Vision Models
Enable for vision-capable models like
llava or bakllavaAuthentication
Optional Ollama API key credential if authentication is enabled
Usage Examples
Basic Local Setup
Code Generation
JSON Mode
Vision Model
Remote Ollama Server
Available Models
Popular Models
| Model | Size | Use Case | GPU Memory |
|---|---|---|---|
| llama2 | 7B | General chat | ~8GB |
| llama2:13b | 13B | Better reasoning | ~16GB |
| llama2:70b | 70B | Highest quality | ~64GB |
| mistral | 7B | Fast, capable | ~8GB |
| mixtral | 47B | MoE, very capable | ~48GB |
| codellama | 7B/13B/34B | Code generation | ~8-40GB |
| phi | 2.7B | Small, efficient | ~4GB |
| gemma | 2B/7B | Google’s model | ~4-8GB |
| llava | 7B | Vision + text | ~8GB |
Find More Models
Browse the Ollama Library for 100+ models.Performance Optimization
GPU Acceleration
- Set
numGputo offload layers - Use CUDA on NVIDIA GPUs
- Use Metal on Apple Silicon
- Monitor GPU memory usage
Memory Management
- Reduce
numCtxif OOM errors - Use smaller model variants
- Adjust
keepAliveto free memory - Use quantized models (Q4, Q5)
Speed
- Use smaller models (7B vs 70B)
- Reduce context window
- Enable GPU acceleration
- Set
numThreadto CPU cores
Quality
- Use larger models when possible
- Increase context window
- Fine-tune temperature
- Use appropriate model for task
Best Practices
-
Model Selection
- Start with 7B models for testing
- Use code-specific models for programming tasks
- Consider model size vs. available RAM/VRAM
-
Resource Management
- Set appropriate
keepAliveduration - Monitor system resources
- Use GPU when available
- Close unused models
- Set appropriate
-
Prompt Engineering
- Be specific and clear
- Use system prompts effectively
- Provide examples for better results
- Test with different temperatures
-
Production Deployment
- Use dedicated GPU server
- Set up load balancing for multiple instances
- Monitor performance metrics
- Consider model quantization for speed
Common Issues
Connection Refused
Connection Refused
If Flowise can’t connect to Ollama:
- Verify Ollama is running:
ollama list - Check the base URL is correct
- Ensure firewall allows port 11434
- Try:
ollama serveto start manually
Model Not Found
Model Not Found
Error: “model ‘modelname’ not found”
- Pull the model first:
ollama pull modelname - Check exact model name:
ollama list - Ensure spelling matches exactly
Out of Memory
Out of Memory
If getting OOM errors:
- Use smaller model variant (7B instead of 13B)
- Reduce
numCtxparameter - Close other applications
- Use quantized models (Q4_0)
- Reduce
numGputo use CPU more
Slow Performance
Slow Performance
To improve speed:
- Enable GPU acceleration
- Use smaller models
- Reduce context window
- Set
numThreadto your CPU cores - Use quantized models