Skip to main content

Overview

Ollama integration allows you to run open-source large language models locally on your machine. Perfect for development, privacy-sensitive applications, and offline usage.

Setup

1

Install Ollama

Download and install Ollama from ollama.ai
# macOS / Linux
curl https://ollama.ai/install.sh | sh

# Windows - download installer
2

Pull a Model

Download a model to use:
# Pull Llama 2
ollama pull llama2

# Or Mistral
ollama pull mistral

# Or CodeLlama
ollama pull codellama
3

Verify Ollama is Running

Ensure Ollama is accessible:
curl http://localhost:11434
4

Add to Flowise

Drag the ChatOllama node from Chat Models category to your canvas

Configuration

Basic Parameters

baseUrl
string
default:"http://localhost:11434"
required
URL where Ollama is running. Use default for local installation
modelName
string
required
Name of the model to use. Must match a pulled model:
  • llama2 - Meta’s Llama 2 (7B, 13B, 70B variants)
  • llama2:13b - Specific size variant
  • mistral - Mistral 7B
  • mixtral - Mixtral 8x7B MoE
  • codellama - Code-specialized Llama
  • phi - Microsoft Phi-2
  • gemma - Google Gemma
  • vicuna - Vicuna chat model
temperature
number
Sampling temperature (0.0 to 2.0). Lower = more focused, higher = more creative
streaming
boolean
default:true
Enable token streaming for real-time responses

Advanced Parameters

topP
number
Nucleus sampling threshold (0.0 to 1.0). Higher = more diverse text
topK
number
Limit token selection to top K options. Lower = more conservative
mirostat
number
Enable Mirostat sampling for controlling perplexity:
  • 0 - Disabled (default)
  • 1 - Mirostat 1.0
  • 2 - Mirostat 2.0
mirostatEta
number
Mirostat learning rate. Controls how quickly algorithm responds to feedback
mirostatTau
number
Mirostat target perplexity. Controls coherence vs diversity balance

Context & Performance

numCtx
number
default:2048
Context window size. Larger = more memory but better long conversations
numGpu
number
Number of GPU layers to use. On macOS defaults to 1 (Metal), 0 to disable GPU
numThread
number
Number of CPU threads. Defaults to optimal. Set to physical CPU cores for best performance
repeatLastN
number
default:64
How far back to look for repetition prevention (0 = disabled, -1 = num_ctx)
repeatPenalty
number
Penalize repetitions (1.0 = no penalty, 1.5 = strong penalty)
tfsZ
number
Tail free sampling. Higher = reduced impact of low probability tokens

Additional Options

keepAlive
string
default:"5m"
How long to keep model loaded. Duration string like “10m” or “24h”
stop
string
Stop sequences (comma-separated). Generation stops when these appear
jsonMode
boolean
Force model to output only JSON. Specify JSON format in system prompt

Vision Models

allowImageUploads
boolean
default:false
Enable for vision-capable models like llava or bakllava

Authentication

credential
credential
Optional Ollama API key credential if authentication is enabled

Usage Examples

Basic Local Setup

// Simple Llama 2 chat
Base URL: http://localhost:11434
Model Name: llama2
Temperature: 0.7
Context Window Size: 4096

Code Generation

// Optimized for coding
Base URL: http://localhost:11434
Model Name: codellama
Temperature: 0.2
Top P: 0.95
Context Window Size: 8192

JSON Mode

// Force JSON output
Base URL: http://localhost:11434
Model Name: mistral
JSON Mode: true
Temperature: 0.1

// System prompt:
"Return all responses as JSON object with 'answer' and 'confidence' fields"

Vision Model

# First pull a vision model
ollama pull llava
// In Flowise
Model Name: llava
Allow Image Uploads: true
Temperature: 0.4

Remote Ollama Server

// Connect to Ollama on another machine
Base URL: http://192.168.1.100:11434
Model Name: llama2
Keep Alive: 30m

Available Models

ModelSizeUse CaseGPU Memory
llama27BGeneral chat~8GB
llama2:13b13BBetter reasoning~16GB
llama2:70b70BHighest quality~64GB
mistral7BFast, capable~8GB
mixtral47BMoE, very capable~48GB
codellama7B/13B/34BCode generation~8-40GB
phi2.7BSmall, efficient~4GB
gemma2B/7BGoogle’s model~4-8GB
llava7BVision + text~8GB

Find More Models

Browse the Ollama Library for 100+ models.

Performance Optimization

GPU Acceleration

  • Set numGpu to offload layers
  • Use CUDA on NVIDIA GPUs
  • Use Metal on Apple Silicon
  • Monitor GPU memory usage

Memory Management

  • Reduce numCtx if OOM errors
  • Use smaller model variants
  • Adjust keepAlive to free memory
  • Use quantized models (Q4, Q5)

Speed

  • Use smaller models (7B vs 70B)
  • Reduce context window
  • Enable GPU acceleration
  • Set numThread to CPU cores

Quality

  • Use larger models when possible
  • Increase context window
  • Fine-tune temperature
  • Use appropriate model for task

Best Practices

  1. Model Selection
    • Start with 7B models for testing
    • Use code-specific models for programming tasks
    • Consider model size vs. available RAM/VRAM
  2. Resource Management
    • Set appropriate keepAlive duration
    • Monitor system resources
    • Use GPU when available
    • Close unused models
  3. Prompt Engineering
    • Be specific and clear
    • Use system prompts effectively
    • Provide examples for better results
    • Test with different temperatures
  4. Production Deployment
    • Use dedicated GPU server
    • Set up load balancing for multiple instances
    • Monitor performance metrics
    • Consider model quantization for speed

Common Issues

If Flowise can’t connect to Ollama:
  • Verify Ollama is running: ollama list
  • Check the base URL is correct
  • Ensure firewall allows port 11434
  • Try: ollama serve to start manually
Error: “model ‘modelname’ not found”
  • Pull the model first: ollama pull modelname
  • Check exact model name: ollama list
  • Ensure spelling matches exactly
If getting OOM errors:
  • Use smaller model variant (7B instead of 13B)
  • Reduce numCtx parameter
  • Close other applications
  • Use quantized models (Q4_0)
  • Reduce numGpu to use CPU more
To improve speed:
  • Enable GPU acceleration
  • Use smaller models
  • Reduce context window
  • Set numThread to your CPU cores
  • Use quantized models

Ollama Commands Reference

# List downloaded models
ollama list

# Pull a new model
ollama pull llama2

# Run model in terminal
ollama run llama2

# Remove a model
ollama rm llama2

# Show model info
ollama show llama2

# Start Ollama service
ollama serve

Build docs developers (and LLMs) love