Vision Models

Use vision models to analyze images alongside text prompts.

Setup

Vision models require a model file and a multimodal projector (mmproj):

Place model and projector

Put both files in your models directory:

models/
├── qwen-vl.gguf
└── qwen-vl-mmproj.gguf

The mmproj file must have the same base name with -mmproj suffix.

Start the daemon

Launch with the vision model:

nrvnad qwen-vl.gguf ./workspace

The daemon auto-detects the mmproj file.

Basic Usage

Submit jobs with images using the --image flag:

wrk ./workspace "What's in this image?" --image photo.jpg

CLI Reference

From work.hpp:47, the submit method accepts image paths:

SubmitResult submit(const std::string& prompt, 
                   const std::vector<std::filesystem::path>& imagePaths);

Jobs with images are automatically marked as JobType::Vision.

Batch Vision Processing

Process multiple images:

# Submit 100 image jobs
for img in photos/*.jpg; do
  wrk ./workspace "Caption this image" --image "$img" >> jobs.txt
done

# Check progress
ls ./workspace/output/ | wc -l

# Collect all results
for job in $(cat jobs.txt); do
  echo "=== $job ==="
  flw ./workspace $job
done

Vision processing is serialized with a mutex to prevent corruption. Jobs queue up and run sequentially, even with multiple workers.

Multi-Image Jobs

Pass multiple images to a single job:

wrk ./workspace "Compare these images" \
  --image photo1.jpg \
  --image photo2.jpg \
  --image photo3.jpg

The model receives all images in context.

Vision + Text Workflows

Combine vision analysis with text processing:

# Analyze image
caption=$(wrk ./ws-vision "Describe this image" --image photo.jpg | xargs flw ./ws-vision)

# Use result in text task
wrk ./ws-text "Write a story about: $caption"

Route to different workspaces for specialized models:

nrvnad qwen-vl.gguf  ./ws-vision &
nrvnad llama-3.gguf  ./ws-text   &

Job Types

From work.hpp:15-19:

enum class JobType : uint8_t {
    Text = 0,
    Embed = 1,
    Vision = 2
};

Vision jobs are automatically assigned JobType::Vision when images are provided.

Configuration

Tune vision model performance:

# Large context for detailed analysis
export NRVNA_MAX_CTX=16384

# Fewer workers (vision is serialized)
export NRVNA_WORKERS=2

# GPU acceleration
export NRVNA_GPU_LAYERS=99

nrvnad qwen-vl.gguf ./workspace

Tips

mmproj auto-detection — name it <model>-mmproj.gguf
Serialized execution — vision jobs run one at a time
Image formats — supports JPEG, PNG, and other common formats
Context overhead — images consume significant context tokens
Workspace routing — dedicate a workspace to vision tasks

Get Started

Core Concepts

CLI Tools

Guides

Configuration

Setup

Basic Usage

CLI Reference

Batch Vision Processing

Multi-Image Jobs

Vision + Text Workflows

Job Types

Configuration

Tips

Build docs developers (and LLMs) love

Get Started

Core Concepts

CLI Tools

Guides

Configuration

​Setup

​Basic Usage

​CLI Reference

​Batch Vision Processing

​Multi-Image Jobs

​Vision + Text Workflows

​Job Types

​Configuration

​Tips

Build docs developers (and LLMs) love

Setup

Basic Usage

CLI Reference

Batch Vision Processing

Multi-Image Jobs

Vision + Text Workflows

Job Types

Configuration

Tips