Vision AI

Overview

Vision AI brings multimodal understanding to Off Grid. Point your camera at anything — receipts, documents, photos, scenes — and ask questions. The model sees what you see and responds with detailed analysis, all processed entirely on your device.

What Can It Do?

Document Analysis — Extract text from receipts, invoices, forms, and contracts
Scene Description — Describe photos and images in detail
Visual Q&A — Ask questions about what’s in an image
OCR — Read text from images, signs, and screenshots
Object Recognition — Identify objects, animals, plants, and more
Multilingual — Read and understand text in multiple languages (Qwen models)

Supported Models

Off Grid supports vision-language models (VLMs) that combine image understanding with text generation:

SmolVLM (500M, 2.2B)

Fastest — 7-10s inference on flagship devices
Compact — 500M model is ~600MB total (including mmproj)
Best for: Quick document scans, receipt extraction, general scene description

Qwen3-VL (2B, 8B)

Thinking mode — Shows reasoning process before answering
Multilingual — Excellent understanding of non-English text
Best for: Complex visual reasoning, multilingual documents, detailed analysis

Gemma 3n E4B

Multimodal — Vision + audio support
Mobile-optimized — Selective activation for efficiency
Best for: Advanced use cases requiring multiple modalities

LLaVA

General-purpose — Large Language and Vision Assistant
Best for: Detailed image descriptions and visual conversations

MiniCPM-V

Efficient — Balanced speed and quality
Best for: Resource-constrained devices

All vision models automatically download a mmproj (multimodal projector) companion file. This file is required for vision capabilities and is typically 200-400MB. Off Grid handles this automatically — you’ll see combined progress during download.

How to Use

1. Download a Vision Model

Open Models → Text Models
Filter by Model Type → Vision
Select a vision model (SmolVLM-500M recommended for first-time users)
Tap Download and wait for both the model and mmproj to download

2. Load the Model

Open a conversation
Tap the model selector at the top
Select your vision model
Wait for it to load (takes 10-20 seconds for vision models)

3. Add an Image

There are two ways to add images:

Camera
Photo Library

Tap the camera icon in the chat input
Take a photo of what you want to analyze
Review the photo and tap Use Photo
Type your question about the image
Send

4. Get Analysis

The model processes the image and your question, then streams the response in real-time. For complex questions, you may see a thinking indicator (on models that support it) before the final answer.

Example Use Cases

Receipt Extraction

Photo: Receipt from a restaurant Prompt: “Extract all line items and the total amount.” Response: The model lists each item, price, and calculates the total.

Document Q&A

Photo: Screenshot of a contract Prompt: “What is the cancellation policy?” Response: The model reads the document and extracts the relevant clause.

Scene Description

Photo: Landscape photo Prompt: “Describe this scene in detail.” Response: The model describes the setting, objects, colors, mood, and composition.

Visual Reasoning

Photo: Math problem on a whiteboard Prompt: “Solve this equation step by step.” Response: The model reads the equation, shows its reasoning (thinking mode), and provides the solution.

Performance

Inference Times

Model	Device Class	Inference Time
SmolVLM 500M	Flagship	~7s
SmolVLM 500M	Mid-range	~15s
SmolVLM 2.2B	Flagship	~10-15s
SmolVLM 2.2B	Mid-range	~25-35s
Qwen3-VL 2B	Flagship	~10-20s
Qwen3-VL 8B	Flagship (8GB+ RAM)	~30-60s

Vision inference is slower than text-only generation because the model must process both the image and your question. The first inference after loading a vision model may take longer due to CLIP warmup.

Factors Affecting Speed

Model size — Larger models (2B+) are slower but more accurate
Image resolution — Higher resolution images take longer to process
Question complexity — Complex reasoning takes more time
Device RAM — More RAM allows faster processing

Technical Details

How It Works

Vision models use a multimodal projector (mmproj) to bridge image and text understanding:

Image encoding — Your photo is processed by a vision encoder (CLIP)
Projection — Visual features are projected into the language model’s embedding space
Joint reasoning — The LLM reasons about both the image and your text prompt
Response generation — Streams the answer just like text-only models

mmproj Files

Automatically downloaded — Off Grid detects when a model needs mmproj and downloads it
Combined tracking — Model size estimates include mmproj overhead
Runtime discovery — If mmproj wasn’t linked during download, Off Grid searches the model directory on load
Storage location — Stored alongside the model file

OpenAI-Compatible Format

Off Grid uses llama.rn’s OpenAI-compatible message format for vision:

{
  "role": "user",
  "content": [
    { "type": "image_url", "image_url": { "url": "file:///path/to/image.jpg" } },
    { "type": "text", "text": "What's in this image?" }
  ]
}

This allows seamless integration with llama.cpp’s multimodal inference.

GPU Acceleration

iOS:

CLIP GPU acceleration is enabled by default on devices with >4GB RAM
Disabled automatically on ≤4GB RAM devices to prevent crashes

Android:

OpenCL GPU offloading available (experimental)
Configure via GPU layers setting

On devices with ≤4GB RAM (iPhone XS, older Androids), CLIP GPU is disabled to prevent abort() crashes during Metal buffer allocation. Vision models still work, just slower.

Tips

Getting the Best Results

Clear, well-lit photos — Better image quality = better understanding
Specific questions — Ask exactly what you want to know
One question at a time — Break complex queries into steps
Use thinking mode — Qwen models show reasoning for complex questions

Choosing the Right Model

Speed priority: SmolVLM-500M
Quality priority: Qwen3-VL 2B or SmolVLM-2.2B
Multilingual: Qwen3-VL
Advanced reasoning: Qwen3-VL (thinking mode)

Memory Management

Vision models require more RAM than text-only models:

RAM estimate = (model file size + mmproj size) × 1.5
SmolVLM-500M + mmproj ≈ 900MB RAM required
Qwen3-VL 2B + mmproj ≈ 3.5GB RAM required

Off Grid checks available RAM before loading and warns if insufficient.

Troubleshooting

Vision inference is very slow:

Try a smaller model (SmolVLM-500M)
Reduce image resolution before sending
Ensure you’re on a flagship device for best performance

Model fails to load:

Check if mmproj downloaded correctly (visible in download progress)
Verify sufficient RAM (see Settings → Device Info)
Try unloading the current model first

Inference hangs after prompt enhancement:

This is a known issue — Off Grid explicitly resets LLM state after enhancement
If it still hangs, disable prompt enhancement in image generation settings

Privacy

All vision inference happens 100% on-device:

Your photos never leave your device
No cloud processing
No data collection
Works completely offline (after model download)

You can enable airplane mode and use vision models indefinitely.

Get Started

Core Features

Guides

Overview

What Can It Do?

Supported Models

SmolVLM (500M, 2.2B)

Qwen3-VL (2B, 8B)

Gemma 3n E4B

LLaVA

MiniCPM-V

How to Use

1. Download a Vision Model

2. Load the Model

3. Add an Image

4. Get Analysis

Example Use Cases

Receipt Extraction

Document Q&A

Scene Description

Visual Reasoning

Performance

Inference Times

Factors Affecting Speed

Technical Details

How It Works

mmproj Files

OpenAI-Compatible Format

GPU Acceleration

Tips

Getting the Best Results

Choosing the Right Model

Memory Management

Troubleshooting

Privacy

Build docs developers (and LLMs) love

Get Started

Core Features

Guides

​Overview

​What Can It Do?

​Supported Models

​SmolVLM (500M, 2.2B)

​Qwen3-VL (2B, 8B)

​Gemma 3n E4B

​LLaVA

​MiniCPM-V

​How to Use

​1. Download a Vision Model

​2. Load the Model

​3. Add an Image

​4. Get Analysis

​Example Use Cases

​Receipt Extraction

​Document Q&A

​Scene Description

​Visual Reasoning

​Performance

​Inference Times

​Factors Affecting Speed

​Technical Details

​How It Works

​mmproj Files

​OpenAI-Compatible Format

​GPU Acceleration

​Tips

​Getting the Best Results

​Choosing the Right Model

​Memory Management

​Troubleshooting

​Privacy

Build docs developers (and LLMs) love

Overview

What Can It Do?

Supported Models

SmolVLM (500M, 2.2B)

Qwen3-VL (2B, 8B)

Gemma 3n E4B

LLaVA

MiniCPM-V

How to Use

1. Download a Vision Model

2. Load the Model

3. Add an Image

4. Get Analysis

Example Use Cases

Receipt Extraction

Document Q&A

Scene Description

Visual Reasoning

Performance

Inference Times

Factors Affecting Speed

Technical Details

How It Works

mmproj Files

OpenAI-Compatible Format

GPU Acceleration

Tips

Getting the Best Results

Choosing the Right Model

Memory Management

Troubleshooting

Privacy