Local GPT supports vision models that can analyze images embedded in your notes. This allows you to ask questions about diagrams, screenshots, photos, and other visual content.
Example: Image analysis with bakllava model on MacBook Pro 13, M1, 16GB
How It Works
When you select text containing image references, Local GPT:
Detects image syntax in your selection
Extracts the images and converts them to base64
Sends both text and images to a vision-capable model
Receives AI-generated analysis or descriptions
Vision support automatically activates when your selection contains images and you have a vision provider configured.
private async extractImagesFromSelection (
selectedText : string ,
): Promise < { cleanedText : string ; imagesInBase64 : string [] } > {
const regexp = /! \[\[ ( . +? \. (?: png | jpe ? g )) \]\] / gi ;
const fileNames = Array . from (
selectedText . matchAll ( regexp ),
( match ) => match [ 1 ],
);
const cleanedText = selectedText . replace ( regexp , "" );
const imagesInBase64 =
(
await Promise.all<string>(
fileNames.map((fileName) =>
this.readImageAsDataUrl(fileName),
),
)
).filter(Boolean) || [];
return { cleanedText, imagesInBase64 };
}
Setup
1. Install a Vision Model
For Ollama users, install a vision-capable model:
bakllava (Recommended)
llava
A capable vision model that works well on consumer hardware. Another popular vision model option.
Open Local GPT Settings
Find Vision Provider
Select your vision model provider from the dropdown
new Setting ( containerEl )
. setName ( I18n . t ( "settings.visionProvider" ))
. setClass ( "local-gpt-ai-providers-select" )
. setDesc ( I18n . t ( "settings.visionProviderDesc" ))
. addDropdown (( dropdown ) =>
dropdown
. addOptions ( providers )
. setValue (
String ( this . plugin . settings . aiProviders . vision ),
)
. onChange ( async ( value ) => {
this . plugin . settings . aiProviders . vision = value ;
await this . plugin . saveSettings ();
await this . display ();
}),
);
Image Reference Syntax
Use Obsidian’s standard image embedding syntax:
Wiki-Style Embed
Multiple Images
Compare these diagrams:
![[ diagram1.png ]]
![[ diagram2.jpg ]]
What are the key differences?
You can include multiple images in a single selection. All images will be sent to the vision model.
Example Use Cases
![[ ui-screenshot.png ]]
Describe the user interface elements in this screenshot.
Select both the image and your prompt, then run an action like “General Help”.
![[ system-architecture.png ]]
Explain this system architecture diagram in simple terms.
Before: ![[ before.jpg ]]
After: ![[ after.jpg ]]
What changed between these two images?
![[ photo.png ]]
List all the objects visible in this photo.
How Images Are Processed
Local GPT converts images to base64-encoded data URLs for transmission:
private async readImageAsDataUrl ( fileName : string ): Promise < string > {
const filePath = this . app . metadataCache . getFirstLinkpathDest (
fileName ,
this . app . workspace . getActiveFile (). path ,
);
if (! filePath ) {
return "" ;
}
return this.app.vault. adapter
.readBinary(filePath.path)
.then((buffer) => {
const extension = filePath . extension . toLowerCase ();
const mimeType = extension === "jpg" ? "jpeg" : extension ;
const blob = new Blob ([ buffer ], {
type: `image/ ${ mimeType } ` ,
});
return new Promise (( resolve ) => {
const reader = new FileReader ();
reader . onloadend = () => resolve ( reader . result as string );
reader . readAsDataURL ( blob );
});
});
}
Provider Selection Logic
When images are detected, Local GPT automatically switches to your vision provider:
private selectProvider (
aiProviders : IAIProvidersService ,
hasImages : boolean ,
overrideProviderId ?: string | null ,
): IAIProvider {
const visionCandidate = hasImages
? aiProviders . providers . find (
( p : IAIProvider ) =>
p . id === this . settings . aiProviders . vision ,
)
: undefined ;
const preferredProviderId =
overrideProviderId || this . settings . aiProviders . main ;
const fallback = aiProviders . providers . find (
( p ) => p . id === preferredProviderId ,
);
const provider = visionCandidate || fallback ;
if ( ! provider ) {
throw new Error ( "No AI provider found" );
}
return provider ;
}
If images are present in your selection, the vision provider takes precedence over your main provider.
Vision models are computationally intensive. Processing large images or multiple images may take longer than text-only requests.
For faster results, resize large images before embedding them in your notes.
Combining Vision with RAG
You can combine vision support with Enhanced Actions (RAG):
Based on [[Project Context]] and this mockup:
![[ ui-mockup.png ]]
What improvements should we prioritize?
This will:
Process the image with the vision model
Retrieve context from “Project Context” using RAG
Generate a response informed by both the visual and textual context
const { cleanedText , imagesInBase64 } =
await this . extractImagesFromSelection ( selectedTextRef . value );
selectedTextRef . value = cleanedText ;
const context = await this . enhanceWithContext (
cleanedText ,
aiProviders ,
embeddingProvider ,
abortController ,
params . selectedFiles ,
);
const provider = this . selectProvider (
aiProviders ,
imagesInBase64 . length > 0 ,
params . overrideProviderId ,
);
const fullText = await this . executeProviderRequest (
aiProviders ,
adjustedProvider ,
params ,
cleanedText ,
context ,
imagesInBase64 , // Images sent to provider
abortController ,
onUpdate ,
);
Troubleshooting
Images not being processed
Verify your vision provider is configured in settings
Check that images use the correct syntax: ![[image.png]]
Ensure image files exist in your vault
Confirm image format is PNG or JPEG
Vision models require more compute resources
Consider using smaller/optimized models
Reduce image file sizes
Process fewer images at once
Ensure your vision model is properly installed
Check that the provider service is running
Verify the model supports image inputs
Next Steps
Community Actions Browse and install community-contributed actions
Enhanced Actions Learn about RAG for context-aware responses