Skip to main content
Local GPT supports vision models that can analyze images embedded in your notes. This allows you to ask questions about diagrams, screenshots, photos, and other visual content. Example: Image analysis with bakllava model on MacBook Pro 13, M1, 16GB

How It Works

When you select text containing image references, Local GPT:
  1. Detects image syntax in your selection
  2. Extracts the images and converts them to base64
  3. Sends both text and images to a vision-capable model
  4. Receives AI-generated analysis or descriptions
Vision support automatically activates when your selection contains images and you have a vision provider configured.

Supported Image Formats

PNG

.png files

JPEG

.jpg and .jpeg files
private async extractImagesFromSelection(
  selectedText: string,
): Promise<{ cleanedText: string; imagesInBase64: string[] }> {
  const regexp = /!\[\[(.+?\.(?:png|jpe?g))\]\]/gi;
  const fileNames = Array.from(
    selectedText.matchAll(regexp),
    (match) => match[1],
  );

  const cleanedText = selectedText.replace(regexp, "");
  const imagesInBase64 =
    (
      await Promise.all<string>(
        fileNames.map((fileName) =>
          this.readImageAsDataUrl(fileName),
        ),
      )
    ).filter(Boolean) || [];

  return { cleanedText, imagesInBase64 };
}

Setup

1. Install a Vision Model

For Ollama users, install a vision-capable model:

2. Configure Vision Provider

  1. Open Local GPT Settings
  2. Find Vision Provider
  3. Select your vision model provider from the dropdown
Vision Provider Settings
new Setting(containerEl)
  .setName(I18n.t("settings.visionProvider"))
  .setClass("local-gpt-ai-providers-select")
  .setDesc(I18n.t("settings.visionProviderDesc"))
  .addDropdown((dropdown) =>
    dropdown
      .addOptions(providers)
      .setValue(
        String(this.plugin.settings.aiProviders.vision),
      )
      .onChange(async (value) => {
        this.plugin.settings.aiProviders.vision = value;
        await this.plugin.saveSettings();
        await this.display();
      }),
  );

Image Reference Syntax

Use Obsidian’s standard image embedding syntax:

Wiki-Style Embed

![[screenshot.png]]

Multiple Images

Compare these diagrams:

![[diagram1.png]]
![[diagram2.jpg]]

What are the key differences?
You can include multiple images in a single selection. All images will be sent to the vision model.

Example Use Cases

![[ui-screenshot.png]]

Describe the user interface elements in this screenshot.
Select both the image and your prompt, then run an action like “General Help”.
![[system-architecture.png]]

Explain this system architecture diagram in simple terms.
Before: ![[before.jpg]]
After: ![[after.jpg]]

What changed between these two images?
![[handwritten-notes.jpg]]

Transcribe the handwritten text in this image.
![[photo.png]]

List all the objects visible in this photo.

How Images Are Processed

Local GPT converts images to base64-encoded data URLs for transmission:
private async readImageAsDataUrl(fileName: string): Promise<string> {
  const filePath = this.app.metadataCache.getFirstLinkpathDest(
    fileName,
    this.app.workspace.getActiveFile().path,
  );

  if (!filePath) {
    return "";
  }

  return this.app.vault.adapter
    .readBinary(filePath.path)
    .then((buffer) => {
      const extension = filePath.extension.toLowerCase();
      const mimeType = extension === "jpg" ? "jpeg" : extension;
      const blob = new Blob([buffer], {
        type: `image/${mimeType}`,
      });
      return new Promise((resolve) => {
        const reader = new FileReader();
        reader.onloadend = () => resolve(reader.result as string);
        reader.readAsDataURL(blob);
      });
    });
}

Provider Selection Logic

When images are detected, Local GPT automatically switches to your vision provider:
private selectProvider(
  aiProviders: IAIProvidersService,
  hasImages: boolean,
  overrideProviderId?: string | null,
): IAIProvider {
  const visionCandidate = hasImages
    ? aiProviders.providers.find(
        (p: IAIProvider) =>
          p.id === this.settings.aiProviders.vision,
      )
    : undefined;
  const preferredProviderId =
    overrideProviderId || this.settings.aiProviders.main;
  const fallback = aiProviders.providers.find(
    (p) => p.id === preferredProviderId,
  );

  const provider = visionCandidate || fallback;
  if (!provider) {
    throw new Error("No AI provider found");
  }
  return provider;
}
If images are present in your selection, the vision provider takes precedence over your main provider.

Performance Considerations

Vision models are computationally intensive. Processing large images or multiple images may take longer than text-only requests.
For faster results, resize large images before embedding them in your notes.

Combining Vision with RAG

You can combine vision support with Enhanced Actions (RAG):
Based on [[Project Context]] and this mockup:

![[ui-mockup.png]]

What improvements should we prioritize?
This will:
  1. Process the image with the vision model
  2. Retrieve context from “Project Context” using RAG
  3. Generate a response informed by both the visual and textual context
const { cleanedText, imagesInBase64 } =
  await this.extractImagesFromSelection(selectedTextRef.value);
selectedTextRef.value = cleanedText;

const context = await this.enhanceWithContext(
  cleanedText,
  aiProviders,
  embeddingProvider,
  abortController,
  params.selectedFiles,
);

const provider = this.selectProvider(
  aiProviders,
  imagesInBase64.length > 0,
  params.overrideProviderId,
);

const fullText = await this.executeProviderRequest(
  aiProviders,
  adjustedProvider,
  params,
  cleanedText,
  context,
  imagesInBase64,  // Images sent to provider
  abortController,
  onUpdate,
);

Troubleshooting

  • Verify your vision provider is configured in settings
  • Check that images use the correct syntax: ![[image.png]]
  • Ensure image files exist in your vault
  • Confirm image format is PNG or JPEG
  • Vision models require more compute resources
  • Consider using smaller/optimized models
  • Reduce image file sizes
  • Process fewer images at once
  • Ensure your vision model is properly installed
  • Check that the provider service is running
  • Verify the model supports image inputs

Next Steps

Community Actions

Browse and install community-contributed actions

Enhanced Actions

Learn about RAG for context-aware responses

Build docs developers (and LLMs) love