Skip to main content
Use vision models to analyze images alongside text prompts.

Setup

Vision models require a model file and a multimodal projector (mmproj):
1

Place model and projector

Put both files in your models directory:
models/
├── qwen-vl.gguf
└── qwen-vl-mmproj.gguf
The mmproj file must have the same base name with -mmproj suffix.
2

Start the daemon

Launch with the vision model:
nrvnad qwen-vl.gguf ./workspace
The daemon auto-detects the mmproj file.

Basic Usage

Submit jobs with images using the --image flag:
wrk ./workspace "What's in this image?" --image photo.jpg

CLI Reference

From work.hpp:47, the submit method accepts image paths:
SubmitResult submit(const std::string& prompt, 
                   const std::vector<std::filesystem::path>& imagePaths);
Jobs with images are automatically marked as JobType::Vision.

Batch Vision Processing

Process multiple images:
# Submit 100 image jobs
for img in photos/*.jpg; do
  wrk ./workspace "Caption this image" --image "$img" >> jobs.txt
done

# Check progress
ls ./workspace/output/ | wc -l

# Collect all results
for job in $(cat jobs.txt); do
  echo "=== $job ==="
  flw ./workspace $job
done
Vision processing is serialized with a mutex to prevent corruption. Jobs queue up and run sequentially, even with multiple workers.

Multi-Image Jobs

Pass multiple images to a single job:
wrk ./workspace "Compare these images" \
  --image photo1.jpg \
  --image photo2.jpg \
  --image photo3.jpg
The model receives all images in context.

Vision + Text Workflows

Combine vision analysis with text processing:
# Analyze image
caption=$(wrk ./ws-vision "Describe this image" --image photo.jpg | xargs flw ./ws-vision)

# Use result in text task
wrk ./ws-text "Write a story about: $caption"
Route to different workspaces for specialized models:
nrvnad qwen-vl.gguf  ./ws-vision &
nrvnad llama-3.gguf  ./ws-text   &

Job Types

From work.hpp:15-19:
enum class JobType : uint8_t {
    Text = 0,
    Embed = 1,
    Vision = 2
};
Vision jobs are automatically assigned JobType::Vision when images are provided.

Configuration

Tune vision model performance:
# Large context for detailed analysis
export NRVNA_MAX_CTX=16384

# Fewer workers (vision is serialized)
export NRVNA_WORKERS=2

# GPU acceleration
export NRVNA_GPU_LAYERS=99

nrvnad qwen-vl.gguf ./workspace

Tips

  • mmproj auto-detection — name it <model>-mmproj.gguf
  • Serialized execution — vision jobs run one at a time
  • Image formats — supports JPEG, PNG, and other common formats
  • Context overhead — images consume significant context tokens
  • Workspace routing — dedicate a workspace to vision tasks

Build docs developers (and LLMs) love