Vision Support

Local GPT supports vision models that can analyze images embedded in your notes. This allows you to ask questions about diagrams, screenshots, photos, and other visual content.

Example: Image analysis with bakllava model on MacBook Pro 13, M1, 16GB

How It Works

When you select text containing image references, Local GPT:

Detects image syntax in your selection
Extracts the images and converts them to base64
Sends both text and images to a vision-capable model
Receives AI-generated analysis or descriptions

Vision support automatically activates when your selection contains images and you have a vision provider configured.

Supported Image Formats

PNG

.png files

JPEG

.jpg and .jpeg files

private async extractImagesFromSelection(
  selectedText: string,
): Promise<{ cleanedText: string; imagesInBase64: string[] }> {
  const regexp = /!\[\[(.+?\.(?:png|jpe?g))\]\]/gi;
  const fileNames = Array.from(
    selectedText.matchAll(regexp),
    (match) => match[1],
  );

  const cleanedText = selectedText.replace(regexp, "");
  const imagesInBase64 =
    (
      await Promise.all<string>(
        fileNames.map((fileName) =>
          this.readImageAsDataUrl(fileName),
        ),
      )
    ).filter(Boolean) || [];

  return { cleanedText, imagesInBase64 };
}

Setup

1. Install a Vision Model

For Ollama users, install a vision-capable model:

bakllava (Recommended)
llava

ollama pull bakllava

A capable vision model that works well on consumer hardware.

ollama pull llava

Another popular vision model option.

2. Configure Vision Provider

Open Local GPT Settings
Find Vision Provider
Select your vision model provider from the dropdown

new Setting(containerEl)
  .setName(I18n.t("settings.visionProvider"))
  .setClass("local-gpt-ai-providers-select")
  .setDesc(I18n.t("settings.visionProviderDesc"))
  .addDropdown((dropdown) =>
    dropdown
      .addOptions(providers)
      .setValue(
        String(this.plugin.settings.aiProviders.vision),
      )
      .onChange(async (value) => {
        this.plugin.settings.aiProviders.vision = value;
        await this.plugin.saveSettings();
        await this.display();
      }),
  );

Image Reference Syntax

Use Obsidian’s standard image embedding syntax:

Wiki-Style Embed

![[screenshot.png]]

Multiple Images

Compare these diagrams:

![[diagram1.png]]
![[diagram2.jpg]]

What are the key differences?

You can include multiple images in a single selection. All images will be sent to the vision model.

Example Use Cases

Describe a Screenshot

![[ui-screenshot.png]]

Describe the user interface elements in this screenshot.

Select both the image and your prompt, then run an action like “General Help”.

Analyze a Diagram

![[system-architecture.png]]

Explain this system architecture diagram in simple terms.

Compare Images

Before: ![[before.jpg]]
After: ![[after.jpg]]

What changed between these two images?

Extract Text from Images

![[handwritten-notes.jpg]]

Transcribe the handwritten text in this image.

Identify Objects

![[photo.png]]

List all the objects visible in this photo.

How Images Are Processed

Local GPT converts images to base64-encoded data URLs for transmission:

private async readImageAsDataUrl(fileName: string): Promise<string> {
  const filePath = this.app.metadataCache.getFirstLinkpathDest(
    fileName,
    this.app.workspace.getActiveFile().path,
  );

  if (!filePath) {
    return "";
  }

  return this.app.vault.adapter
    .readBinary(filePath.path)
    .then((buffer) => {
      const extension = filePath.extension.toLowerCase();
      const mimeType = extension === "jpg" ? "jpeg" : extension;
      const blob = new Blob([buffer], {
        type: `image/${mimeType}`,
      });
      return new Promise((resolve) => {
        const reader = new FileReader();
        reader.onloadend = () => resolve(reader.result as string);
        reader.readAsDataURL(blob);
      });
    });
}

Provider Selection Logic

When images are detected, Local GPT automatically switches to your vision provider:

private selectProvider(
  aiProviders: IAIProvidersService,
  hasImages: boolean,
  overrideProviderId?: string | null,
): IAIProvider {
  const visionCandidate = hasImages
    ? aiProviders.providers.find(
        (p: IAIProvider) =>
          p.id === this.settings.aiProviders.vision,
      )
    : undefined;
  const preferredProviderId =
    overrideProviderId || this.settings.aiProviders.main;
  const fallback = aiProviders.providers.find(
    (p) => p.id === preferredProviderId,
  );

  const provider = visionCandidate || fallback;
  if (!provider) {
    throw new Error("No AI provider found");
  }
  return provider;
}

If images are present in your selection, the vision provider takes precedence over your main provider.

Performance Considerations

Vision models are computationally intensive. Processing large images or multiple images may take longer than text-only requests.

For faster results, resize large images before embedding them in your notes.

Combining Vision with RAG

You can combine vision support with Enhanced Actions (RAG):

Based on [[Project Context]] and this mockup:

![[ui-mockup.png]]

What improvements should we prioritize?

This will:

Process the image with the vision model
Retrieve context from “Project Context” using RAG
Generate a response informed by both the visual and textual context

const { cleanedText, imagesInBase64 } =
  await this.extractImagesFromSelection(selectedTextRef.value);
selectedTextRef.value = cleanedText;

const context = await this.enhanceWithContext(
  cleanedText,
  aiProviders,
  embeddingProvider,
  abortController,
  params.selectedFiles,
);

const provider = this.selectProvider(
  aiProviders,
  imagesInBase64.length > 0,
  params.overrideProviderId,
);

const fullText = await this.executeProviderRequest(
  aiProviders,
  adjustedProvider,
  params,
  cleanedText,
  context,
  imagesInBase64,  // Images sent to provider
  abortController,
  onUpdate,
);

Troubleshooting

Images not being processed

Verify your vision provider is configured in settings
Check that images use the correct syntax: ![[image.png]]
Ensure image files exist in your vault
Confirm image format is PNG or JPEG

Slow processing

Vision models require more compute resources
Consider using smaller/optimized models
Reduce image file sizes
Process fewer images at once

Provider errors

Ensure your vision model is properly installed
Check that the provider service is running
Verify the model supports image inputs

Next Steps

Community Actions

Browse and install community-contributed actions

Enhanced Actions

Learn about RAG for context-aware responses

Get Started

Features

Guides

Advanced

How It Works

Supported Image Formats

PNG

JPEG

Setup

1. Install a Vision Model

2. Configure Vision Provider

Image Reference Syntax

Wiki-Style Embed

Multiple Images

Example Use Cases

How Images Are Processed

Provider Selection Logic

Performance Considerations

Combining Vision with RAG

Troubleshooting

Next Steps

Community Actions

Enhanced Actions

Build docs developers (and LLMs) love

Get Started

Features

Guides

Advanced

​How It Works

​Supported Image Formats

PNG

JPEG

​Setup

​1. Install a Vision Model

​2. Configure Vision Provider

​Image Reference Syntax

​Wiki-Style Embed

​Multiple Images

​Example Use Cases

​How Images Are Processed

​Provider Selection Logic

​Performance Considerations

​Combining Vision with RAG

​Troubleshooting

​Next Steps

Community Actions

Enhanced Actions

Build docs developers (and LLMs) love

How It Works

Supported Image Formats

Setup

1. Install a Vision Model

2. Configure Vision Provider

Image Reference Syntax

Wiki-Style Embed

Multiple Images

Example Use Cases

How Images Are Processed

Provider Selection Logic

Performance Considerations

Combining Vision with RAG

Troubleshooting

Next Steps