Overview
Ollama enables you to run large language models locally on your own hardware. Perfect for development, testing, privacy-sensitive applications, and offline use. Access Llama, Mistral, Gemma, and many more models without any API costs. Base URL: Your local Ollama server (default:http://localhost:11434)
Supported Features
- ✅ Chat Completions
- ✅ Streaming
- ✅ Embeddings
- ✅ Vision (multimodal models)
- ✅ Custom Models
- ✅ Model Library (100+ models)
- ❌ Function Calling (limited support)
- ❌ Image Generation
Prerequisites
Install Ollama
- macOS
- Linux
- Windows
- Docker
Pull a Model
Quick Start
Chat Completions
Streaming
Popular Models
Meta Llama
| Model | Size | Memory | Description |
|---|---|---|---|
llama3.3 | 43GB | 32GB | Latest Llama 3.3 70B |
llama3.1 | 4.7GB | 8GB | Llama 3.1 8B |
llama3.1:70b | 40GB | 48GB | Llama 3.1 70B |
llama3.1:405b | 231GB | 256GB+ | Largest Llama |
llama2 | 3.8GB | 8GB | Llama 2 7B |
Mistral & Mixtral
| Model | Size | Memory | Description |
|---|---|---|---|
mistral | 4.1GB | 8GB | Mistral 7B |
mistral-large | 40GB | 48GB | Mistral Large |
mixtral | 26GB | 32GB | Mixtral 8x7B MoE |
Google Gemma
| Model | Size | Memory | Description |
|---|---|---|---|
gemma2 | 5.4GB | 8GB | Gemma 2 9B |
gemma2:27b | 16GB | 20GB | Gemma 2 27B |
gemma | 5.0GB | 8GB | Gemma 7B |
Vision Models
| Model | Size | Memory | Description |
|---|---|---|---|
llava | 4.7GB | 8GB | Llama with vision |
llava:34b | 20GB | 24GB | Larger vision model |
bakllava | 4.7GB | 8GB | Alternative vision |
Specialized Models
| Model | Size | Purpose |
|---|---|---|
codellama | 3.8GB | Code generation |
phi3 | 2.3GB | Microsoft’s small model |
qwen2.5 | 4.7GB | Multilingual |
deepseek-coder | 3.8GB | Advanced coding |
nous-hermes2 | 4.1GB | General purpose |
Ollama excels at:
- Privacy - Data never leaves your machine
- Zero cost - No API fees
- Offline use - Works without internet
- Fast iteration - No network latency
- Customization - Create and modify models
Configuration Options
Remote Ollama Server
Docker Container
Advanced Features
System Messages
Vision (Multimodal)
Embeddings
Temperature Control
Model Management
List Models
Pull Models
Remove Models
Run Interactive
Custom Models
Create a Custom Model
- Create a
Modelfile:
- Create the model:
- Use your custom model:
Fallback Configuration
Use local Ollama first, fallback to cloud:Best Practices
- Choose appropriate model size - Match to your hardware
- Use quantized models - Smaller, faster (q4_0, q5_1)
- Monitor memory usage - Leave headroom for system
- Keep models updated -
ollama pullto update - Use GPU if available - Much faster inference
- Warm up models - First request may be slow
- Batch similar requests - Amortize startup cost
- Create custom models - Optimize for your use case
Hardware Requirements
Minimum Specs
- CPU: Modern quad-core
- RAM: 8GB (for 7B models)
- Disk: 10GB free space
Recommended
- CPU: 8+ cores
- RAM: 16GB+ (for 13B models)
- GPU: NVIDIA with 8GB+ VRAM (optional but recommended)
- Disk: 50GB+ SSD
For Larger Models
- 70B models: 48GB+ RAM
- 405B models: 256GB+ RAM or multi-GPU setup
Performance Tips
Use GPU
Ollama automatically uses GPU if available (NVIDIA, Apple Silicon).Quantization Levels
| Suffix | Size | Quality | Speed |
|---|---|---|---|
q4_0 | Smallest | Good | Fastest |
q4_1 | Small | Better | Fast |
q5_0 | Medium | Good | Medium |
q5_1 | Medium | Better | Medium |
q8_0 | Large | Best | Slow |
| (none) | Largest | Perfect | Slowest |
Use Cases
Development & Testing
Privacy-Sensitive Applications
Offline Applications
Cost Optimization
Pricing
Ollama is completely free!- No API costs
- No rate limits
- No usage tracking
- Run unlimited requests
Troubleshooting
Model Not Found
Out of Memory
Slow Performance
Related Resources
Model Library
Browse 100+ available models
Fallback Routing
Fallback to cloud when needed
Cost Optimization
Optimize AI costs
Privacy
Private AI deployments