Why Local LLMs?
Privacy: News headlines never sent to third-party APIsCost: Zero API fees, unlimited usage
Speed: No network latency for inference
Offline: Works without internet connection (after model download)
Control: Choose your own models and parameters
Ollama Setup
1. Install Ollama
Download from https://ollama.com/download:2. Download a Model
Recommended models for summarization:Model size = approximate disk + RAM usage. 8GB+ RAM recommended for 7-8B models.
3. Start Ollama Server
Ollama runs as a background service after installation. Verify it’s running:4. Configure World Monitor
- Desktop App
- Web / Self-Hosted
- Open Settings (Cmd+, or Ctrl+,)
- Navigate to AI & Summarization tab
- Enter Ollama URL:
http://localhost:11434 - Select model from dropdown (auto-discovered)
- Click Save & Verify
- Discovers available models
- Filters out embedding-only models
- Validates the endpoint
- Sets the model as the primary provider
LM Studio Setup
1. Install LM Studio
Download from https://lmstudio.ai/ (available for macOS, Windows, Linux).2. Download a Model
- Open LM Studio
- Navigate to Discover tab
- Search for models:
llama-3.1-8b-instruct(recommended)mistral-7b-instructqwen2.5-7b-instruct
- Click Download
3. Start Local Server
- Navigate to Local Server tab (icon in left sidebar)
- Select your downloaded model
- Click Start Server
- Server starts on
http://localhost:1234by default
4. Configure World Monitor
- Desktop App
- Web / Self-Hosted
- Open Settings (Cmd+, or Ctrl+,)
- Navigate to AI & Summarization tab
- Enter LM Studio URL:
http://localhost:1234 - Select model from dropdown (auto-discovered via
/v1/models) - Click Save & Verify
LM Studio uses the OpenAI-compatible
/v1/chat/completions endpoint, same as Ollama. The dashboard auto-detects the server type.Model Selection Guide
| Model | Size | RAM Required | Speed | Quality | Best For |
|---|---|---|---|---|---|
llama3.1:8b | 4.7GB | 8GB+ | Fast | Excellent | Recommended |
mistral | 4.1GB | 6GB+ | Very Fast | Good | Low-resource systems |
qwen2.5:7b | 5.5GB | 8GB+ | Medium | Excellent | High-quality summaries |
gemma2:2b | 1.6GB | 4GB+ | Very Fast | Fair | Ultra-lightweight |
gemma2:9b | 5.4GB | 10GB+ | Slow | Excellent | Maximum quality |
Advanced Configuration
Custom Ollama Port
If Ollama is running on a different port:Remote Ollama Server
Run Ollama on a different machine:Custom Token Limit
Override the maximum tokens for summaries:Model Parameters
Ollama models use default parameters optimized for summarization:- Temperature: 0.3 (factual, low creativity)
- Max Tokens: 300 (concise summaries)
- Stop Sequences: None
server/worldmonitor/news/v1/_shared.ts:166.
Desktop Settings
The desktop app provides a visual model selector:- Open Settings (Cmd+, or Ctrl+,)
- Navigate to AI & Summarization
- Enter Ollama/LM Studio URL
- Click outside the input field
- Model dropdown populates automatically
- Select your preferred model
- Click Save & Verify
- Tries Ollama native endpoint:
GET /api/tags - Falls back to OpenAI-compatible:
GET /v1/models - Filters out embedding models (name contains
embed) - Populates dropdown with valid models
- If discovery fails, shows manual text input
- macOS: Keychain Access (
secrets-vaultentry) - Windows: Credential Manager
- Linux: Secret Service API
localStorage event. The main dashboard hot-reloads secrets without restart.
Fallback Chain
AI summarization uses a 4-tier fallback:Tier 1 (local) is always attempted first when
OLLAMA_API_URL is configured, even if cloud keys are present.Performance Tuning
GPU Acceleration
Ollama automatically uses GPU if available:- NVIDIA: CUDA (automatic)
- Apple Silicon: Metal (automatic)
- AMD: ROCm (requires manual setup)
RAM Optimization
If you see OOM errors, use smaller quantization:Concurrent Requests
Ollama handles 1 request at a time by default. For higher concurrency:Troubleshooting
”Ollama endpoint unreachable”
- Verify Ollama is running:
- Check firewall settings
- Ensure correct port in
OLLAMA_API_URL
”No models available”
- Download at least one model:
- Verify models are listed:
“Model not found”
Model name in config doesn’t match Ollama:Slow inference
- Check GPU utilization:
- Use smaller model (
mistralvsllama3.1:8b) - Enable GPU acceleration if not already active
High memory usage
Ollama keeps models in RAM. To unload:Security Considerations
Recommended setup:- Bind to
localhostonly (default) - Use SSH tunneling for remote access
- Run behind a reverse proxy with auth (Nginx, Caddy)
- Sidecar API protected by session token
- Token rotates on each app launch
- Secrets stored in OS keychain, never in plaintext
OpenAI-Compatible Servers
Any server implementing/v1/chat/completions works:
- llama.cpp server:
./server -m model.gguf --port 8080 - vLLM:
vllm serve model_name --port 8080 - text-generation-webui: Enable OpenAI extension
- LocalAI: Compatible out of the box