Included Services
Ollama
Local LLM inference for chat and embeddings
Whisper
Speech-to-text transcription
Skills Provided
Ollama Local LLM
Capabilities:- Chat completion
- Text generation
- Code generation
- Text embeddings for RAG
- JSON-structured output
- Multi-turn conversations
- Streaming responses
Whisper Transcribe
Capabilities:- Audio transcription
- Multiple language support
- Speaker diarization
- Timestamp generation
- Various audio formats
- Subtitle generation (SRT, VTT)
Use Cases
RAG (Retrieval-Augmented Generation)
Build a complete local RAG system:Video Transcription Pipeline
Combine with Video Creator pack:Code Assistant
Local code generation and review:Chatbot with Memory
Build a stateful chatbot:Recommended Models
General Purpose
| Model | Size | Use Case | Memory |
|---|---|---|---|
llama3.2 | 3B | Fast chat and reasoning | 4 GB |
llama3.2:70b | 70B | Complex reasoning | 40 GB |
mistral | 7B | Balanced performance | 5 GB |
phi3 | 3.8B | Efficient reasoning | 4 GB |
Code Generation
| Model | Size | Use Case | Memory |
|---|---|---|---|
codellama | 7B | Code generation | 5 GB |
codellama:13b | 13B | Advanced code tasks | 8 GB |
deepseek-coder | 6.7B | Multi-language coding | 5 GB |
Embeddings
| Model | Size | Dimensions | Memory |
|---|---|---|---|
nomic-embed-text | 137M | 768 | 1 GB |
mxbai-embed-large | 335M | 1024 | 2 GB |
all-minilm | 23M | 384 | 512 MB |
Managing Models
Configuration
Environment Variables
Volume Mounts
Models persist across restarts:Memory Requirements
Ollama
Memory depends on model size:- Small models (3B-7B): 4-6 GB
- Medium models (13B-30B): 10-20 GB
- Large models (70B+): 40+ GB
Whisper
Memory depends on model variant:- tiny: ~1 GB
- base: ~1 GB
- small: ~2 GB
- medium: ~5 GB
- large: ~10 GB
Performance Tips
Ollama
- Use GPU if available:
docker run --gpus all - Set
num_gpulayers in model config - Lower
temperaturefor consistent output - Use
seedfor reproducible results - Enable
stream: falsefor full responses
Whisper
- Use
baseorsmallmodel for real-time - Convert audio to 16kHz mono WAV for best performance
- Use
tinymodel for quick drafts,mediumfor accuracy - Enable GPU acceleration for large models
Embedding Generation
Batch embeddings for efficiency:GPU Acceleration
NVIDIA GPU
Enable GPU support in docker-compose:Next Steps
Knowledge Base Pack
Build RAG systems with vector search
Video Creator Pack
Add video transcription workflows