Skip to main content

GPU & OpenCL Crashes

Issue: App crashes during GPU layer offloading

Symptoms:
  • App crashes immediately after loading model with GPU layers > 0
  • Native crash (SIGSEGV, SIGABRT) with llama.cpp stack trace
  • OpenCL initialization failure on Android
Cause: OpenCL backend can crash on some Qualcomm devices during layer offload initialization. The crash typically happens before JavaScript can catch the error. Solution:
Always begin with 0 GPU layers and incrementally increase while monitoring stability.
// In model settings
gpuLayers: 0  // Start here
Then test:
  1. Load model with 0 layers (CPU-only) — verify stability
  2. Increase to 10 layers — test generation
  3. Increase to 20 layers — test generation
  4. Continue until crash or diminishing returns
The app has automatic fallback to CPU if GPU initialization fails, but this only works if the crash happens at the right stage.
If crashes persist, disable GPU acceleration completely:
// In ModelSettingsScreen or appStore
enableGpu: false,
gpuLayers: 0,
CPU inference uses ARM NEON, i8mm, and dotprod SIMD instructions — still performant on modern devices.
Flash attention is automatically disabled when GPU layers > 0 on Android due to llama.cpp compatibility issues.

Metal Crashes on Low-RAM Devices

Issue: App crashes on iPhone XS, iPhone 8, or other ≤4GB RAM devices

Symptoms:
  • App killed instantly during model load (no JavaScript error)
  • Metal buffer allocation crash
  • CLIP warmup crash during vision model initialization
Cause: On devices with ≤4GB RAM, Metal buffer allocation for LLM inference and CLIP can call abort(), killing the app before JavaScript can catch the error. This is a POSIX signal that bypasses try/catch. Solution:
Off Grid automatically applies RAM-based caps before any native call:
Device RAMGPU LayersContext CapCLIP GPU
≤4GB0 (CPU-only)2048Off
4-6GBRequested2048On
6-8GBRequested4096On
>8GBRequested8192On
Helpers in src/services/llmHelpers.ts:
  • getMaxContextForDevice(totalMemoryBytes) — caps auto-scaled context length
  • getGpuLayersForDevice(totalMemoryBytes, requestedLayers) — disables Metal on ≤4GB devices
The GPU check runs before initContextWithFallback() so the dangerous Metal allocation is never attempted.
If you’re developing on a low-RAM device and still experiencing crashes, verify that getGpuLayersForDevice() is being called before model initialization.

Model Loading Failures

Issue: “Insufficient RAM” or “Memory budget exceeded”

Symptoms:
  • Red error message when trying to load a model
  • Warning banner showing memory usage
  • Model load blocked before attempting
Cause: Pre-load memory checks prevent loading models that would exceed the device’s safe RAM budget (60% of total RAM). Solution:
Unload the currently active model before loading a new one:
// In HomeScreen or ModelSelector
await activeModelService.unloadTextModel();
await activeModelService.loadTextModel(newModelId);
Or use the UI:
  1. Go to Home screen
  2. Tap “Unload” on active model card
  3. Return to Models screen and load new model
Select a model with lower quantization or smaller parameter count:
  • Instead of Qwen3-7B-Q5_K_M (~5.5GB), try Qwen3-7B-Q4_K_M (~4.0GB)
  • Instead of 7B models, try 3B models (Llama 3.2 3B, Qwen 3 3B)
  • For very constrained devices, use 0.6B-1.5B models (SmolLM3, Qwen3-0.6B)
See Quantization Guide for size/quality tradeoffs.
Lower the context window to free up RAM:
// Settings → Model Settings → Context Length
contextLength: 2048  // Down from 4096 or 8192
Each halving of context length saves ~1-2GB RAM depending on model size.

Memory Budget Calculation

// Text models
requiredRAM = fileSize * 1.5  // KV cache, activations

// Vision models
requiredRAM = (modelFileSize + mmProjSize) * 1.5

// Image models
requiredRAM = fileSize * 1.8  // MNN/QNN runtime overhead

// Safe limit
budget = deviceTotalRAM * 0.60

Download Issues

Issue: Model download fails or stalls

Symptoms:
  • Download progress stuck at 0% or mid-download
  • “Download failed” error
  • Partial file in storage
Solutions:
  • Ensure stable Wi-Fi or cellular connection
  • Large models (>4GB) may take 10-30 minutes on slower connections
  • Hugging Face CDN can be slow during peak hours
Pre-download storage check should prevent this, but verify:
# Android
adb shell df /data

# iOS
# Settings → General → iPhone Storage
Ensure at least 2x model size free space for safe extraction.
Android DownloadManager supports automatic retry, but manual retry:
  1. Go to Download Manager screen
  2. Tap failed download → Remove
  3. Return to Models screen and re-download
Interrupted downloads may leave orphaned GGUF files:
  1. Go to Settings → Storage Settings
  2. Tap “Scan for orphaned files”
  3. Review list and tap “Delete all orphaned files”
  4. Retry download

Vision Model Performance

Issue: Vision inference extremely slow after prompt enhancement

Symptoms:
  • First vision inference fast (~7s)
  • Second vision inference slow (~30-60s)
  • Happens after using AI prompt enhancement for image generation
Cause: KV cache cleanup after prompt enhancement impacts vision inference speed. The generation service used to clear the KV cache after enhancement, which forced the next vision inference to rebuild the entire cache. Solution:
Off Grid now only calls stopGeneration() after prompt enhancement, not clearKvCache(). This preserves the KV cache and keeps vision inference fast.
// After enhancement completes
await llmService.stopGeneration();  // Clear generating flag
// Note: KV cache NOT cleared to preserve vision inference speed
If you’re modifying the codebase, ensure you never clear the KV cache unless explicitly requested by the user.
Clearing the KV cache can make vision inference 30-60s slower. Only clear when:
  • User explicitly taps “Clear KV Cache” in settings
  • Model is being unloaded
  • Switching to a different model

Image Generation Issues

Issue: Image generation fails or produces black images

Symptoms:
  • Generation completes but image is blank/black
  • Native crash during generation
  • “Model not loaded” error
Solutions:
Ensure you’re using the correct model for your chipset:MNN (CPU) models — Work on all ARM64 devices
  • Anything V5, Absolute Reality, QteaMix, ChilloutMix, CuteYukiMix
QNN (NPU) models — Require Snapdragon 8 Gen 1+
  • All MNN models plus DreamShaper, Realistic Vision, etc.
  • Variants: min, 8gen1, 8gen2
The app auto-detects NPU support, but you can verify:
// In DeviceInfoScreen or debug logs
const hasQNN = await localDreamGeneratorService.supportsQNN();
Ensure you downloaded a Core ML model (not MNN/QNN):
  • SD 1.5 Palettized (~1GB)
  • SD 2.1 Palettized (~1GB)
  • SDXL iOS (~2GB)
  • SD 1.5 Full (~4GB)
  • SD 2.1 Base Full (~4GB)
Models from Apple’s official HuggingFace repos only.
Lower the generation parameters:
steps: 15         // Down from 20-25
guidanceScale: 7  // Down from 10+
Higher values = longer generation, more memory usage.

Performance Tuning

Issue: Text generation is slow

Symptoms:
  • Less than 5 tok/s on flagship device
  • Long first token time (over 3s)
Solutions:
More threads = faster inference (to a point):
// Settings → Model Settings → CPU Threads
cpuThreads: 6  // Optimal for most devices (up to 8 on flagships)
Recommendations:
  • Mid-range: 4-6 threads
  • Flagship: 6-8 threads
  • Diminishing returns beyond 8
// Settings → Model Settings → Batch Size
batchSize: 256  // Balanced default
Tradeoffs:
  • Smaller (32-128): Faster first token
  • Larger (256-512): Better throughput
contextLength: 2048  // Down from 4096+
Shorter context = faster inference + less memory.
Switch from Q5_K_M/Q6_K to Q4_K_M:
  • Q4_K_M: Faster, smaller, slightly lower quality
  • Q5_K_M: Balanced
  • Q6_K: Slower, larger, highest quality
See Quantization Guide.

Development Issues

Issue: Pre-commit hooks fail

Symptoms:
  • git commit fails with linting/type errors
  • Tests fail during commit
Solution:
Pre-commit hooks run quality gates scoped to staged files:
Staged file typeChecks
.ts/.tsx/.js/.jsxESLint, tsc --noEmit, npm test
.swiftSwiftLint, npm run test:ios
.kt/.ktscompileDebugKotlin, lintDebug, npm run test:android
Fix the errors and recommit. Never skip with --no-verify.

Issue: SwiftLint not installed

Solution:
brew install swiftlint
The hook skips SwiftLint with a warning if not installed, but it’s required for iOS development.

Additional Resources

Architecture Reference

Full technical documentation

GitHub Issues

Report bugs and request features
If you’re experiencing an issue not covered here, check the GitHub Issues or ask in the Slack community.

Build docs developers (and LLMs) love