Skip to main content
The Local AI skill pack enables completely local AI inference without external API dependencies. Run language models and speech-to-text on your own hardware.

Included Services

Ollama

Local LLM inference for chat and embeddings

Whisper

Speech-to-text transcription

Skills Provided

Ollama Local LLM

Capabilities:
  • Chat completion
  • Text generation
  • Code generation
  • Text embeddings for RAG
  • JSON-structured output
  • Multi-turn conversations
  • Streaming responses
Example Usage:
# Chat completion
curl -X POST "http://ollama:11434/api/chat" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.2",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "Explain quantum computing"}
    ],
    "stream": false
  }'

# Text generation
curl -X POST "http://ollama:11434/api/generate" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "codellama",
    "prompt": "Write a Python function to calculate fibonacci numbers",
    "stream": false
  }'

# Generate embeddings
curl -X POST "http://ollama:11434/api/embed" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "nomic-embed-text",
    "input": ["Text to embed", "Another text"]
  }'

# JSON output mode
curl -X POST "http://ollama:11434/api/chat" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.2",
    "messages": [
      {"role": "user", "content": "List 3 programming languages"}
    ],
    "format": "json",
    "stream": false
  }'

Whisper Transcribe

Capabilities:
  • Audio transcription
  • Multiple language support
  • Speaker diarization
  • Timestamp generation
  • Various audio formats
  • Subtitle generation (SRT, VTT)
Example Usage:
# Transcribe audio file
curl -X POST "http://whisper:9000/asr?task=transcribe&language=en&output=json" \
  -F "audio_file=@/data/audio/recording.mp3"

# Response:
{
  "text": "Hello, this is a test recording.",
  "segments": [
    {
      "start": 0.0,
      "end": 2.5,
      "text": "Hello, this is a test recording."
    }
  ]
}

# Transcribe with timestamps (SRT format)
curl -X POST "http://whisper:9000/asr?task=transcribe&language=en&output=srt" \
  -F "audio_file=@/data/audio/recording.mp3"

# Translate to English
curl -X POST "http://whisper:9000/asr?task=translate&output=json" \
  -F "audio_file=@/data/audio/spanish.mp3"

Use Cases

RAG (Retrieval-Augmented Generation)

Build a complete local RAG system:
# 1. Generate embeddings for documents
curl -X POST "http://ollama:11434/api/embed" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "nomic-embed-text",
    "input": ["Document content here..."]
  }'

# 2. Store in Qdrant (from Knowledge Base pack)
curl -X PUT "http://qdrant:6333/collections/docs/points" \
  -H "Content-Type: application/json" \
  -d '{
    "points": [{
      "id": 1,
      "vector": [...embedding...],
      "payload": {"text": "Document content"}
    }]
  }'

# 3. Query: Generate query embedding
QUERY_EMBEDDING=$(curl -s -X POST "http://ollama:11434/api/embed" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "nomic-embed-text",
    "input": ["What is quantum computing?"]
  }' | jq -r '.embeddings[0]')

# 4. Search Qdrant
RESULTS=$(curl -s -X POST "http://qdrant:6333/collections/docs/points/search" \
  -H "Content-Type: application/json" \
  -d "{
    \"vector\": $QUERY_EMBEDDING,
    \"limit\": 5
  }")

# 5. Generate answer with context
curl -X POST "http://ollama:11434/api/chat" \
  -H "Content-Type: application/json" \
  -d "{
    \"model\": \"llama3.2\",
    \"messages\": [
      {\"role\": \"system\", \"content\": \"Answer based on this context: $RESULTS\"},
      {\"role\": \"user\", \"content\": \"What is quantum computing?\"}
    ],
    \"stream\": false
  }"

Video Transcription Pipeline

Combine with Video Creator pack:
# 1. Extract audio from video (FFmpeg)
ffmpeg -i /data/videos/lecture.mp4 \
  -vn -ar 16000 -ac 1 \
  /data/audio/lecture.wav

# 2. Transcribe with Whisper
curl -X POST "http://whisper:9000/asr?task=transcribe&language=en&output=srt" \
  -F "audio_file=@/data/audio/lecture.wav" \
  -o /data/subtitles/lecture.srt

# 3. Burn subtitles into video
ffmpeg -i /data/videos/lecture.mp4 \
  -vf "subtitles=/data/subtitles/lecture.srt" \
  /data/output/lecture_subtitled.mp4

# 4. Generate summary with Ollama
TRANSCRIPT=$(cat /data/subtitles/lecture.srt)
curl -X POST "http://ollama:11434/api/chat" \
  -H "Content-Type: application/json" \
  -d "{
    \"model\": \"llama3.2\",
    \"messages\": [
      {\"role\": \"system\", \"content\": \"Summarize this lecture transcript.\"},
      {\"role\": \"user\", \"content\": \"$TRANSCRIPT\"}
    ],
    \"stream\": false
  }"

Code Assistant

Local code generation and review:
# Generate code
curl -X POST "http://ollama:11434/api/generate" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "codellama",
    "prompt": "Write a REST API endpoint in Python using FastAPI for user registration",
    "stream": false
  }'

# Code review
curl -X POST "http://ollama:11434/api/chat" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "codellama",
    "messages": [
      {
        "role": "system",
        "content": "You are a code reviewer. Find bugs and suggest improvements."
      },
      {
        "role": "user",
        "content": "Review this code: '$(cat app.py)'"
      }
    ],
    "stream": false
  }'

Chatbot with Memory

Build a stateful chatbot:
// Store conversation in Redis (from DevOps pack)
const conversationKey = `chat:${userId}:history`;

// Add user message
await redis.rpush(conversationKey, JSON.stringify({
  role: 'user',
  content: userMessage
}));

// Get conversation history
const history = await redis.lrange(conversationKey, -10, -1);
const messages = history.map(JSON.parse);

// Generate response
const response = await fetch('http://ollama:11434/api/chat', {
  method: 'POST',
  headers: { 'Content-Type': 'application/json' },
  body: JSON.stringify({
    model: 'llama3.2',
    messages: messages,
    stream: false
  })
});

// Store assistant response
const answer = await response.json();
await redis.rpush(conversationKey, JSON.stringify({
  role: 'assistant',
  content: answer.message.content
}));

// Set expiry (24 hours)
await redis.expire(conversationKey, 86400);

General Purpose

ModelSizeUse CaseMemory
llama3.23BFast chat and reasoning4 GB
llama3.2:70b70BComplex reasoning40 GB
mistral7BBalanced performance5 GB
phi33.8BEfficient reasoning4 GB

Code Generation

ModelSizeUse CaseMemory
codellama7BCode generation5 GB
codellama:13b13BAdvanced code tasks8 GB
deepseek-coder6.7BMulti-language coding5 GB

Embeddings

ModelSizeDimensionsMemory
nomic-embed-text137M7681 GB
mxbai-embed-large335M10242 GB
all-minilm23M384512 MB

Managing Models

# List installed models
curl "http://ollama:11434/api/tags"

# Pull a new model
curl -X POST "http://ollama:11434/api/pull" \
  -H "Content-Type: application/json" \
  -d '{"name": "llama3.2"}'

# Delete a model
curl -X DELETE "http://ollama:11434/api/delete" \
  -H "Content-Type: application/json" \
  -d '{"name": "old-model"}'

# Show model info
curl -X POST "http://ollama:11434/api/show" \
  -H "Content-Type: application/json" \
  -d '{"name": "llama3.2"}'

Configuration

Environment Variables

# Ollama
OLLAMA_HOST=ollama
OLLAMA_PORT=11434
OLLAMA_MODELS=/data/ollama/models  # Model storage

# Whisper
WHISPER_HOST=whisper
WHISPER_PORT=9000
WHISPER_MODEL=base  # tiny, base, small, medium, large

Volume Mounts

Models persist across restarts:
services:
  ollama:
    volumes:
      - ollama_models:/root/.ollama
  
  whisper:
    volumes:
      - whisper_models:/root/.cache/whisper

volumes:
  ollama_models:
  whisper_models:

Memory Requirements

Ollama

Memory depends on model size:
  • Small models (3B-7B): 4-6 GB
  • Medium models (13B-30B): 10-20 GB
  • Large models (70B+): 40+ GB
GPU acceleration recommended for larger models.

Whisper

Memory depends on model variant:
  • tiny: ~1 GB
  • base: ~1 GB
  • small: ~2 GB
  • medium: ~5 GB
  • large: ~10 GB
Total Pack: ~4-8 GB minimum (with small models)

Performance Tips

Ollama

  • Use GPU if available: docker run --gpus all
  • Set num_gpu layers in model config
  • Lower temperature for consistent output
  • Use seed for reproducible results
  • Enable stream: false for full responses

Whisper

  • Use base or small model for real-time
  • Convert audio to 16kHz mono WAV for best performance
  • Use tiny model for quick drafts, medium for accuracy
  • Enable GPU acceleration for large models

Embedding Generation

Batch embeddings for efficiency:
# Single request with multiple inputs
curl -X POST "http://ollama:11434/api/embed" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "nomic-embed-text",
    "input": [
      "Document 1 text",
      "Document 2 text",
      "Document 3 text"
    ]
  }'

GPU Acceleration

NVIDIA GPU

Enable GPU support in docker-compose:
services:
  ollama:
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
Verify GPU is detected:
docker exec ollama nvidia-smi

Next Steps

Knowledge Base Pack

Build RAG systems with vector search

Video Creator Pack

Add video transcription workflows

Build docs developers (and LLMs) love