Overview
Ollama lets you run large language models locally. LiteLLM provides seamless integration with Ollama, supporting chat, embeddings, function calling, and reasoning models.Quick Start
Popular Models
- Llama
- Mistral
- Phi
- Code Models
Meta’s Llama models.
Configuration
- Default Localhost
- Custom Host
- Environment Variable
Streaming
Function Calling
Ollama 0.4+ supports native function calling.Reasoning Models
Use reasoning capabilities with compatible models.- GPT-OSS (DeepSeek)
- Other Models
JSON Mode
- JSON Object
- JSON Schema
Vision Models
Use vision-capable models with images.Embeddings
Advanced Configuration
Supported Parameters
| Parameter | Type | Description |
|---|---|---|
temperature | float | Randomness (0-1) |
max_tokens | int | Max output tokens |
max_completion_tokens | int | Alternative to max_tokens |
top_p | float | Nucleus sampling |
frequency_penalty | float | Maps to repeat_penalty |
stop | list | Stop sequences |
seed | int | Reproducibility |
num_ctx | int | Context window size |
num_predict | int | Max tokens to generate |
repeat_penalty | float | Penalize repetition |
top_k | int | Top-k sampling |
mirostat | int | Mirostat mode (0/1/2) |
keep_alive | str | Keep model loaded duration |
Error Handling
LiteLLM Proxy
Best Practices
Model Management
Model Management
- Pull models before use:
ollama pull model-name - Use
keep_aliveto keep frequently-used models loaded - Monitor system resources (RAM, GPU memory)
Performance
Performance
- Use GPU acceleration when available
- Adjust
num_ctxbased on your needs - Smaller models (7B/8B) for speed, larger (70B+) for quality
Function Calling
Function Calling
- Requires Ollama 0.4+
- Not all models support function calling equally
- Test with your specific model before production
Troubleshooting
Connection Errors
Connection Errors
Model Not Found
Model Not Found
Out of Memory
Out of Memory
- Use smaller models or quantized versions
- Reduce
num_ctxto lower memory usage - Close other applications