Overview
Off Grid provides extensive performance tuning options to optimize AI inference for your device. All settings are global and affect every model uniformly. Access settings from: Settings → Model Settings Refer toARCHITECTURE.md:272-297 for the complete settings reference.
CPU Threads Settings
Optimal Thread Count
CPU threads control how many processor cores are used for inference: FromARCHITECTURE.md:346-350:
- More threads = faster inference (up to a point)
- Optimal: 4-6 threads on most devices
- Flagship devices: 6-8 threads
- Diminishing returns beyond 8 threads
How to Configure
Batch Size Tuning
What is Batch Size?
Batch size determines how many tokens are processed in a single inference pass. FromARCHITECTURE.md:352-355:
- Smaller (32-128): Faster first token (lower TTFT)
- Larger (256-512): Better throughput (higher tok/s)
- Default: 512: Balanced for mobile
Trade-offs
| Batch Size | First Token Speed | Overall Throughput | Memory Usage |
|---|---|---|---|
| 32-64 | Very Fast | Lower | Low |
| 128-256 | Fast | Good | Medium |
| 512 | Medium | Best | Higher |
Configuration
The default of 512 is optimized for mobile devices and provides the best balance of throughput and memory efficiency.
Context Length Trade-offs
Memory and Speed Impact
Context length controls how much conversation history the model remembers: FromARCHITECTURE.md:357-363:
- Longer context = more memory + slower inference
- Automatic truncation when context limit exceeded
Recommended Settings
| Context Length | RAM Impact | Use Case |
|---|---|---|
| 512 | Minimal | Short Q&A, one-shot prompts |
| 2048 | Low | Standard conversations (default) |
| 4096 | Medium | Long conversations, document analysis |
| 8192 | High | Multi-turn complex tasks (requires 8GB+ RAM) |
ARCHITECTURE.md:327-342 for the RAM-aware context capping logic.
Configuration
Choose Based on RAM
Select a value appropriate for your device’s RAM:
- 4GB RAM: 512 or 2048
- 6GB RAM: 2048 or 4096
- 8GB+ RAM: 4096 or 8192
GPU/NPU Offloading
Platform Support
Android: OpenCL GPU offloading on Qualcomm Adreno GPUs (experimental) iOS: Metal GPU acceleration (enabled by default)How GPU Layers Work
FromARCHITECTURE.md:365-372:
- Specify number of transformer layers to offload (0-99)
- More layers = faster (if stable)
- OpenCL backend experimental - can crash on some devices
- Start with 0, incrementally increase
- Automatic fallback to CPU if GPU initialization fails
Safety on Low-RAM Devices
FromARCHITECTURE.md:327-342:
Devices with ≤4GB RAM (iPhone XS, iPhone 8, etc.):
- GPU layers forced to 0 (CPU-only)
- Reason: Metal/OpenCL buffer allocation can call
abort()before JavaScript catches errors, crashing the app - CLIP GPU also disabled for vision models
- Metal/OpenCL enabled normally
- Configurable GPU layers (0-99)
Configuration (Android)
Increment Gradually
Increase by 5-10 layers at a time:
- Test with a conversation
- Monitor for crashes
- If stable, increase further
Configuration (iOS)
On iOS, Metal GPU is enabled by default:- 6GB+ RAM devices: Full Metal support, adjustable GPU layers
- 4GB RAM devices: GPU disabled automatically for safety
You can adjust GPU layers in Settings → Model Settings → GPU Layers, but the app will enforce device-specific safety limits.
Flash Attention
What is Flash Attention?
Flash attention is an optimized attention mechanism that speeds up inference. FromARCHITECTURE.md:284:
- User-configurable toggle for faster inference
- Automatically disabled when GPU layers > 0 on Android (llama.cpp compatibility)
- Default: Enabled
Compatibility
Configuration
KV Cache Type
Memory vs Quality Tradeoff
KV cache stores attention key-value pairs. You can quantize the cache to save memory: FromARCHITECTURE.md:285:
Configurable cache quantization options:
- f16 - Full precision (highest quality, most memory)
- q8_0 - 8-bit quantization (balanced, default)
- q4_0 - 4-bit quantization (lowest memory, slight quality loss)
When to Use Each Type
| Cache Type | Quality | Memory Savings | Use Case |
|---|---|---|---|
| f16 | Best | None | 8GB+ RAM devices, maximum quality |
| q8_0 | Excellent | ~50% | Default, best balance |
| q4_0 | Good | ~75% | Low RAM devices, very long contexts |
Configuration
Choose Based on Device
- 4GB RAM: Use q4_0 to maximize context length
- 6-8GB RAM: Use q8_0 (default)
- 8GB+ RAM: Use f16 for maximum quality
The default q8_0 setting (from
src/stores/appStore.ts:117) provides excellent quality with significant memory savings compared to f16.Device-Specific Recommendations
Low-End Devices (4GB RAM)
Optimal Settings:- CPU Threads: 4
- Batch Size: 256
- Context Length: 2048 (hard cap)
- GPU Layers: 0 (forced, safety)
- Flash Attention: On
- KV Cache Type: q4_0
- Model: Q4_K_M quantization, under 2B parameters
Mid-Range Devices (6-8GB RAM)
Optimal Settings:- CPU Threads: 6
- Batch Size: 512
- Context Length: 4096
- GPU Layers: 0-20 (test incrementally)
- Flash Attention: On
- KV Cache Type: q8_0
- Model: Q4_K_M or Q5_K_M, up to 7B parameters
Flagship Devices (8GB+ RAM)
Optimal Settings:- CPU Threads: 6-8
- Batch Size: 512
- Context Length: 8192
- GPU Layers: 20-40 (test incrementally)
- Flash Attention: On
- KV Cache Type: q8_0 or f16
- Model: Q5_K_M or Q6_K, up to 8B parameters
Image Generation Settings
Steps (Quality vs Speed)
FromARCHITECTURE.md:288:
- Range: 4-50 steps
- Default: 20
- More steps = better quality, slower generation
- Fewer steps = faster, lower quality
Guidance Scale (Prompt Adherence)
FromARCHITECTURE.md:289:
- Range: 1-20
- Default: 7.5
- Higher values = stricter adherence to prompt, more contrast
- Lower values = more creative freedom, softer images
Image Generation Threads
FromARCHITECTURE.md:296:
- Range: 1-8 CPU threads
- Default: 4
- Only affects CPU-based image generation (MNN backend on Android)
Platform-Specific Performance
FromARCHITECTURE.md:138-141:
| Platform | Backend | Performance (512×512, 20 steps) |
|---|---|---|
| Android CPU (MNN) | Snapdragon 8 Gen 3 | ~15s |
| Android CPU (MNN) | Snapdragon 7 series | ~30s |
| Android NPU (QNN) | Snapdragon 8 Gen 1+ | ~5-10s (chipset-dependent) |
| iOS ANE (Core ML) | A17 Pro/M-series | ~8-15s |
| iOS ANE (Core ML) | Palettized models | ~2x slower (dequantization overhead) |
Performance Monitoring
Generation Details
Enable detailed performance metrics:View Metrics During Generation
See real-time metrics in chat:
- tok/s - Overall tokens per second
- Decode tok/s - Decode-only speed
- TTFT - Time to first token
- Token count - Total tokens generated
Device Info
Check current hardware status:Expected Performance
FromARCHITECTURE.md:908-926:
Text Generation
Flagship devices (Snapdragon 8 Gen 2+):- CPU: 15-30 tok/s (4-8 threads)
- GPU (OpenCL): 20-40 tok/s (experimental, stability varies)
- TTFT: 0.5-2s depending on context length
- CPU: 5-15 tok/s
- TTFT: 1-3s
- Model size (larger = slower)
- Quantization (lower bits = faster)
- Context length (more tokens = slower)
- Thread count (4-8 threads optimal)
- GPU layers (more = faster if stable)
Vision Inference
SmolVLM 500M:- Flagship: ~7s per image
- Mid-range: ~15s per image
- Flagship: ~10-15s per image
- Mid-range: ~25-35s per image
Troubleshooting
App Crashes During Inference
- Reduce GPU layers to 0 (disable GPU offloading)
- Lower context length to 2048
- Switch to smaller model or lower quantization
- Reduce batch size to 256
- Use q4_0 KV cache type
Slow Inference Speed
- Increase CPU threads to 6 or 8
- Enable flash attention
- Increase batch size to 512
- Test GPU offloading (if stable on your device)
- Switch to lower quantization (Q4_K_M instead of Q6_K)
High Memory Usage
- Reduce context length
- Switch to q4_0 KV cache type
- Use smaller model
- Clear conversation history (Settings → Clear All Conversations)
Thermal Throttling
- Reduce CPU threads to 4
- Disable GPU offloading
- Use smaller batch size
- Let device cool down between long generations