Memory Management

Overview

Off Grid uses aggressive memory management to prevent crashes and ensure smooth operation on mobile devices with limited RAM. The system calculates memory budgets, estimates model memory usage, and enforces runtime safeguards before loading models.

Exceeding device RAM limits can cause app crashes, system instability, or out-of-memory errors. Always respect memory warnings and recommendations.

Memory Budget Calculation

Off Grid uses a 60% safe limit of total device RAM as the maximum budget for AI models.

Budget Formula

const totalDeviceRAM = await DeviceInfo.getTotalMemory(); // bytes
const memoryBudget = totalDeviceRAM * 0.60; // 60% of total RAM

Warning Thresholds

Threshold	Status	UI Indicator	Action
< 50%	Safe	Green/None	Model loads normally
50-60%	Warning	Yellow warning	Model loads with warning message
≥ 60%	Critical	Red error	Model load blocked

The 60% limit accounts for OS overhead, background processes, and React Native’s JavaScript runtime. Exceeding this limit risks system memory pressure and app termination.

Example Budget by Device

Device RAM	Safe Budget (60%)	Typical Use
4GB	~2.4GB	Qwen3 0.6B Q4_K_M, SmolLM3 135M
6GB	~3.6GB	Llama 3.2 3B Q4_K_M, Qwen3 1.6B
8GB	~4.8GB	Qwen3 7B Q4_K_M, Llama 3.2 3B Q6_K
12GB	~7.2GB	Qwen3 14B Q4_K_M, Llama 3.3 8B Q5_K_M
16GB	~9.6GB	Qwen3 14B Q8_0, Llama 3.3 8B Q8_0

Model Memory Estimates

Before loading a model, the system estimates total RAM usage based on model type and file size.

Text Models

Formula:

requiredRAM = fileSize × 1.5

Overhead includes:

Model weights loaded into memory
KV (key-value) cache for context window
Activations during inference
llama.cpp runtime buffers

Example:

Qwen3 0.6B Q4_K_M: 395 MB × 1.5 = ~593 MB
Llama 3.2 3B Q4_K_M: 2.0 GB × 1.5 = ~3.0 GB
Qwen3 7B Q4_K_M: 4.0 GB × 1.5 = ~6.0 GB

Vision Models

Formula:

requiredRAM = (modelFileSize + mmProjSize) × 1.5

Vision models require:

Main GGUF model file
mmproj (multimodal projector) companion file
Combined size × 1.5 for overhead

Example:

SmolVLM 500M:
- Model: 475 MB
- mmproj: 125 MB
- Total: (475 + 125) × 1.5 = ~900 MB
Qwen3-VL 2B:
- Model: 1.2 GB
- mmproj: 350 MB
- Total: (1.2 + 0.35) × 1.5 = ~2.3 GB

Vision models automatically download required mmproj files. If mmproj wasn’t linked during download, the system searches the model directory at load time.

Image Generation Models

Formula:

requiredRAM = fileSize × 1.8

Overhead includes:

Model weights (UNet, VAE, text encoder)
MNN/QNN runtime allocations
Intermediate tensors during denoising
Image preview buffers

Example:

SD 1.5 Palettized (iOS): 1.0 GB × 1.8 = ~1.8 GB
Anything V5 MNN (Android): 1.2 GB × 1.8 = ~2.2 GB
SD 1.5 Full (iOS): 4.0 GB × 1.8 = ~7.2 GB

Image generation models have higher overhead (1.8x) due to MNN/QNN runtime and intermediate tensor storage. Full-precision Core ML models (~4GB) require flagship devices with 8GB+ RAM.

Pre-Load Memory Checks

Before loading any model, the system performs RAM validation.

Check Process

// 1. Get device memory info
const deviceInfo = await hardwareService.getDeviceInfo();
const totalRAM = deviceInfo.totalMemory / (1024 ** 3); // Convert to GB
const memoryBudget = totalRAM * 0.60;

// 2. Estimate model RAM requirement
const model = getModelById(modelId);
const requiredRAM = estimateModelMemory(model, modelType);

// 3. Check if currently loaded models fit with new model
const currentlyLoaded = getCurrentlyLoadedMemory();
const totalRequired = requiredRAM + currentlyLoaded;

// 4. Compare against budget
if (totalRequired > memoryBudget) {
  return {
    canLoad: false,
    severity: 'critical',
    message: `Cannot load ${model.name} (~${requiredRAM.toFixed(1)}GB required) - would exceed device safe limit of ${memoryBudget.toFixed(1)}GB. Unload current model or choose smaller.`
  };
}

Memory Check Results

Status	Severity	Action	Example Message
Can load safely	`none`	Proceed	Model loads without warning
Can load with warning	`warning`	Proceed with caution	”Model will use 55% of available RAM”
Cannot load	`critical`	Block load	”Cannot load Qwen3-7B-Q4_K_M (~5.5GB required) - would exceed device safe limit of 4.8GB”

Pre-load checks prevent out-of-memory crashes by blocking loads that would exceed the 60% RAM budget. Users must unload the current model or choose a smaller model.

RAM-Aware Runtime Safeguards

On low-RAM devices, llama.cpp and CLIP can call abort() during Metal/OpenCL GPU allocation, bypassing JavaScript error handling and crashing the app instantly. To prevent this, the system applies device-RAM-based caps before any native call.

Device-Specific Caps

Device RAM	GPU Layers	Context Cap	CLIP GPU	Rationale
≤4GB	0 (CPU-only)	2048	Off	Prevent Metal/OpenCL `abort()` crashes
4-6GB	Requested	2048	On	Limited headroom for GPU buffers
6-8GB	Requested	4096	On	Sufficient RAM for moderate contexts
>8GB	Requested	8192	On	Full feature set available

Implementation

GPU Layer Override:

// src/services/llmHelpers.ts
export function getGpuLayersForDevice(
  totalMemoryBytes: number,
  requestedLayers: number
): number {
  const totalGB = totalMemoryBytes / (1024 ** 3);
  
  // Force CPU-only on ≤4GB devices (prevent Metal abort)
  if (totalGB <= 4) {
    return 0;
  }
  
  return requestedLayers;
}

Context Length Cap:

// src/services/llmHelpers.ts
export function getMaxContextForDevice(
  totalMemoryBytes: number
): number {
  const totalGB = totalMemoryBytes / (1024 ** 3);
  
  if (totalGB <= 4) return 2048;
  if (totalGB <= 6) return 2048;
  if (totalGB <= 8) return 4096;
  return 8192;
}

CLIP GPU Override:

// Vision models disable CLIP GPU on ≤4GB devices
const useGpuForClip = totalMemoryGB > 4;

await llmService.initializeMultimodal({
  modelPath,
  mmProjPath,
  useGpuForClip // false on ≤4GB devices
});

Critical for iOS devices with ≤4GB RAM (iPhone XS, iPhone 8): Metal GPU allocation can call abort() before JavaScript catches the error, killing the app instantly. GPU layers are automatically forced to 0 on these devices.

Why These Safeguards Matter

Without safeguards:

iPhone XS (4GB RAM) attempts Metal GPU allocation
Metal buffer allocation fails, calls abort()
POSIX signal bypasses JavaScript try/catch
App terminates immediately with SIGABRT

With safeguards:

System detects 4GB RAM before native call
GPU layers forced to 0 (CPU-only mode)
Context length capped at 2048
CLIP GPU disabled
App runs stably without crashes

These runtime caps are applied automatically. Users on low-RAM devices will see GPU options disabled in settings with explanatory messages.

Memory Monitoring

The hardware service provides real-time memory tracking.

Get Current Memory Usage

import { hardwareService } from '../services/hardware';

// Get app-specific memory usage
const memoryUsage = await hardwareService.getAppMemoryUsage();
console.log('Used:', memoryUsage.used / (1024 ** 3), 'GB');
console.log('Available:', memoryUsage.available / (1024 ** 3), 'GB');
console.log('Total:', memoryUsage.total / (1024 ** 3), 'GB');

// Refresh memory info (force fresh fetch)
const deviceInfo = await hardwareService.refreshMemoryInfo();
console.log('Available RAM:', deviceInfo.availableMemory);

Model Memory Helpers

// Get combined size (model + mmproj for vision models)
const totalSize = hardwareService.getModelTotalSize(model);

// Format for display
const sizeString = hardwareService.formatModelSize(model); // "1.2 GB"

// Estimate RAM usage
const ramEstimate = hardwareService.estimateModelRam(model, 1.5);
const ramString = hardwareService.formatModelRam(model, 1.5); // "~1.8 GB"

Memory monitoring uses react-native-device-info for system-level memory stats. Native allocations (llama.cpp, MNN/QNN) may not be fully reflected immediately in JavaScript layer.

User-Friendly Error Messages

When memory checks fail, the system provides actionable error messages.

Example Messages

Critical (blocked load):

Cannot load Qwen3-7B-Q4_K_M (~5.5GB required) - would exceed
device safe limit of 4.8GB. Unload current model or choose smaller.

Warning (allowed but cautioned):

Loading Llama 3.2 3B Q4_K_M will use ~3.0GB (~55% of available RAM).
Performance may degrade if other apps are running.

Low-RAM device advisory:

Your device has limited memory (4GB). Only the smallest models will
work well. GPU acceleration disabled automatically for stability.

Emulator warning:

Running in emulator. Performance may be significantly slower than
real device. Memory estimates may be inaccurate.

Best Practices

1. Choose Models Based on Device RAM

4GB RAM: Qwen3 0.6B, SmolLM3 135M (Q2_K, Q3_K_M)
6GB RAM: Llama 3.2 3B, Qwen3 1.6B (Q4_K_M)
8GB RAM: Qwen3 7B, Llama 3.2 3B (Q4_K_M, Q5_K_M)
12GB+ RAM: Qwen3 14B, Llama 3.3 8B (Q5_K_M, Q6_K, Q8_0)

2. Unload Models When Switching

Always unload the current model before loading a new one to free RAM:

await activeModelService.unloadTextModel();
await activeModelService.loadTextModel(newModelId);

3. Monitor Memory During Development

Use the DeviceInfoScreen to check real-time memory usage:

Total RAM
Used RAM
Available RAM
Current model memory estimate

4. Test on Low-RAM Devices

Always test new features on devices with 4-6GB RAM to ensure safeguards work correctly.

5. Respect Warning Thresholds

If a model triggers a 50-60% warning:

Close background apps
Reduce context length
Consider smaller quantization
Monitor for performance degradation

Memory Overhead Breakdown

Text Model Overhead (1.5x multiplier)

Component	Percentage	Purpose
Model weights	~67%	GGUF file loaded in RAM
KV cache	~20%	Context window storage
Activations	~8%	Intermediate layer outputs
Runtime buffers	~5%	llama.cpp allocations

Image Model Overhead (1.8x multiplier)

Component	Percentage	Purpose
Model weights	~56%	UNet, VAE, text encoder
MNN/QNN runtime	~25%	Framework allocations
Intermediate tensors	~15%	Denoising step buffers
Preview buffers	~4%	Progressive image display

Vision Model Overhead (1.5x multiplier)

Component	Percentage	Purpose
Model weights	~50%	Main GGUF file
mmproj weights	~17%	Multimodal projector
KV cache	~20%	Context window
Image embeddings	~8%	CLIP encodings
Runtime buffers	~5%	llama.cpp allocations

These overhead estimates are conservative to prevent out-of-memory crashes. Actual memory usage may be slightly lower depending on model architecture and settings.

Troubleshooting Memory Issues

App Crashes on Model Load

Symptoms: App terminates immediately when loading a model Causes:

Model exceeds device RAM budget
GPU allocation failure on low-RAM device
Corrupted model file

Solutions:

Check RAM budget: deviceRAM × 0.60 > estimatedModelRAM
Disable GPU offloading (set layers to 0)
Re-download model file
Choose smaller model or lower quantization

Slow Performance After Loading Model

Symptoms: App becomes sluggish, inference is very slow Causes:

Model using 50-60% of RAM (memory pressure)
System swapping memory to disk
Thermal throttling

Solutions:

Close background apps
Reduce context length (2048 → 512)
Choose smaller model
Wait for device to cool down

”Cannot Load Model” Error

Symptoms: Red error message, load blocked Causes:

Estimated RAM usage exceeds 60% budget
Another model already loaded

Solutions:

Unload current model first
Choose smaller model or lower quantization
Close other apps to free system RAM

Vision Model Missing mmproj

Symptoms: Vision model fails to load, mmproj not found Causes:

mmproj file not downloaded
mmproj file in wrong directory

Solutions:

System automatically searches model directory for mmproj
Re-download model (automatic mmproj download)
Manually place mmproj file in same directory as model GGUF

Architecture

Platform Details

Performance

​Overview

​Memory Budget Calculation

​Budget Formula

​Warning Thresholds

​Example Budget by Device

​Model Memory Estimates

​Text Models

​Vision Models

​Image Generation Models

​Pre-Load Memory Checks

​Check Process

​Memory Check Results

​RAM-Aware Runtime Safeguards

​Device-Specific Caps

​Implementation

​Why These Safeguards Matter

​Memory Monitoring

​Get Current Memory Usage

​Model Memory Helpers

​User-Friendly Error Messages

​Example Messages

​Best Practices

​1. Choose Models Based on Device RAM

​2. Unload Models When Switching

​3. Monitor Memory During Development

​4. Test on Low-RAM Devices

​5. Respect Warning Thresholds

​Memory Overhead Breakdown

​Text Model Overhead (1.5x multiplier)

​Image Model Overhead (1.8x multiplier)

​Vision Model Overhead (1.5x multiplier)

​Troubleshooting Memory Issues

​App Crashes on Model Load

​Slow Performance After Loading Model

​”Cannot Load Model” Error

​Vision Model Missing mmproj

Build docs developers (and LLMs) love

Overview

Memory Budget Calculation

Budget Formula

Warning Thresholds

Example Budget by Device

Model Memory Estimates

Text Models

Vision Models

Image Generation Models

Pre-Load Memory Checks

Check Process

Memory Check Results

RAM-Aware Runtime Safeguards

Device-Specific Caps

Implementation

Why These Safeguards Matter

Memory Monitoring

Get Current Memory Usage

Model Memory Helpers

User-Friendly Error Messages

Example Messages

Best Practices

1. Choose Models Based on Device RAM

2. Unload Models When Switching

3. Monitor Memory During Development

4. Test on Low-RAM Devices

5. Respect Warning Thresholds

Memory Overhead Breakdown

Text Model Overhead (1.5x multiplier)

Image Model Overhead (1.8x multiplier)

Vision Model Overhead (1.5x multiplier)

Troubleshooting Memory Issues

App Crashes on Model Load

Slow Performance After Loading Model

”Cannot Load Model” Error

Vision Model Missing mmproj