Skip to main content

Local AI Models

Run AI models locally for privacy, cost savings, and offline operation. Includes LLM inference, image generation, and speech-to-text.

Available Services

Ollama

Port: 11434 | Memory: 2048 MB | Maturity: StableRun large language models locally with an easy-to-use API. Supports Llama, Mistral, Gemma, and many more open-source models.Features:
  • 100+ open-source models
  • Simple REST API
  • Model management CLI
  • Streaming responses
  • OpenAI-compatible API
  • CPU and GPU support
Supported Models:
  • Llama 3.3, Llama 3.2, Llama 3.1
  • Mistral, Mixtral
  • Gemma 2, CodeGemma
  • Phi-3, Qwen 2.5
  • DeepSeek-Coder
OpenClaw Integration:
  • Skill: ollama-local-llm
  • Environment: OLLAMA_HOST, OLLAMA_PORT
Documentation

ComfyUI

Port: 8188 | Memory: 4096 MB | Maturity: ExperimentalNode-based visual workflow editor for Stable Diffusion and other generative AI models. Design complex image/video generation pipelines.Features:
  • Node-based workflow editor
  • Stable Diffusion support
  • ControlNet, LoRA, VAE support
  • Custom nodes ecosystem
  • REST API
  • Batch processing
Requirements:
  • NVIDIA GPU with CUDA
  • nvidia-docker2 installed
  • Minimum 4 GB VRAM (8 GB+ recommended)
OpenClaw Integration:
  • Skill: comfyui-generate
  • Environment: COMFYUI_HOST, COMFYUI_PORT
⚠️ GPU RequiredDocumentation

Stable Diffusion WebUI

Port: 7860 | Memory: 4096 MB | Maturity: ExperimentalLocal AI image generation with a web interface. Generate images from text prompts using Stable Diffusion.Features:
  • Text-to-image generation
  • Image-to-image transformation
  • Inpainting and outpainting
  • Model management
  • Extensions support
  • Batch processing
Requirements:
  • NVIDIA GPU with CUDA
  • nvidia-docker2 installed
  • Minimum 4 GB VRAM
⚠️ GPU RequiredDocumentation

Faster Whisper Server

Port: 8001 | Memory: 1024 MB | Maturity: BetaSelf-hosted speech-to-text transcription service using the Faster Whisper engine for high-performance audio transcription.Features:
  • OpenAI Whisper models
  • Fast inference (CTranslate2)
  • Multiple languages
  • OpenAI-compatible API
  • Timestamp support
  • CPU and GPU support
Supported Models:
  • tiny, base, small, medium, large
  • Multilingual and English-only variants
OpenClaw Integration:
  • Skill: whisper-transcribe
  • Environment: WHISPER_HOST, WHISPER_PORT
Documentation

Usage Examples

Local LLM Stack

npx create-better-openclaw --services ollama,open-webui --yes

Image Generation Stack (GPU Required)

npx create-better-openclaw --services comfyui,stable-diffusion --yes

Complete Local AI Stack

npx create-better-openclaw --preset local-ai --yes

Audio Transcription Stack

npx create-better-openclaw --services whisper,redis --yes

Model Management

Ollama Models

Pull models into Ollama:
# Access Ollama container
docker exec -it ollama bash

# Pull a model
ollama pull llama3.3
ollama pull mistral
ollama pull codellama

ComfyUI Models

Download Stable Diffusion checkpoints to the comfyui-models volume:
# Models go in: /opt/ComfyUI/models/checkpoints/
# LoRAs go in: /opt/ComfyUI/models/loras/
# VAEs go in: /opt/ComfyUI/models/vae/

Hardware Requirements

CPU-Only (LLMs)

Model SizeRAM RequiredPerformance
7B params8 GBGood
13B params16 GBModerate
34B params32 GBSlow
70B params64 GB+Very Slow

GPU (Image Generation)

VRAMSupported ModelsPerformance
4 GBSD 1.5Slow
6 GBSD 1.5, SDXL (low res)Moderate
8 GBSDXLGood
12 GB+SDXL, SD3Excellent

Performance Tips

Ollama Optimization

  1. Model Selection: Start with smaller models (7B) for faster inference
  2. Context Length: Reduce context window for speed
  3. Quantization: Use Q4 or Q5 quantized models
  4. GPU Acceleration: Add NVIDIA GPU support with nvidia-docker2

Image Generation Optimization

  1. GPU Required: CPU inference is extremely slow (minutes per image)
  2. VRAM Management: Close other GPU applications
  3. Batch Size: Reduce for lower VRAM usage
  4. Resolution: Start with 512x512, then scale up

Privacy & Offline Operation

Local models provide:
  • Data Privacy: All processing happens on your infrastructure
  • Offline Operation: No internet required after model download
  • Cost Savings: No API costs for inference
  • Customization: Fine-tune models for specific tasks
  • Low Latency: No network round trips

Integration Patterns

Local LLM + RAG

npx create-better-openclaw \
  --services ollama,open-webui,qdrant,redis \
  --yes

Multi-Modal Local AI

npx create-better-openclaw \
  --services ollama,comfyui,whisper \
  --yes

Local AI Development

npx create-better-openclaw \
  --services ollama,opencode,redis,postgresql \
  --yes

Build docs developers (and LLMs) love