Multimodal Looker Agent

Overview

Multimodal Looker is a specialized agent for analyzing media files that cannot be read as plain text. It interprets PDFs, images, diagrams, and other visual content to extract specific information or provide summaries. Identity: Media interpretation specialist that saves context tokens by analyzing files and returning only requested information.

model

string

default:"kimi-k2.5-free"

Multimodal-capable model optimized for vision and document analysis

mode

string

default:"subagent"

Invoked by other agents when media analysis is needed

temperature

number

default:"0.1"

Very low temperature for consistent, accurate extraction

Model Configuration

Default Model

{
  "model": "kimi-k2.5-free",
  "temperature": 0.1
}

Gemini Variant

{
  "model": "gemini-3-flash",
  "temperature": 0.1
}

GPT Variant

{
  "model": "gpt-5.2",
  "temperature": 0.1
}

Fallback Chain

Multimodal Looker uses the deepest fallback chain to ensure vision capabilities:

Primary

string

opencode/kimi-k2.5-free

Fallback 1

string

google/gemini-3-flash

Fallback 2

string

openai/gpt-5.2

Fallback 3

string

zai-coding-plan/glm-4.6v

Fallback 4

string

openai/gpt-5-nano

Tool Permissions

Allowed Tools (Read Only)

read - Read and analyze files

Blocked Tools (All Others)

write

string

default:"deny"

Cannot create files

edit

string

default:"deny"

Cannot modify files

bash

string

default:"deny"

Cannot execute commands

grep

string

default:"deny"

Cannot search files (uses read only)

glob

string

default:"deny"

Cannot search for files

task

string

default:"deny"

Cannot delegate to other agents

Multimodal Looker has the strictest tool restrictions - only read is allowed. This ensures it focuses solely on interpreting the provided file.

When to Use Multimodal Looker

Recommended Scenarios

PDF analysis - Extract text, tables, or structure from documents

Image interpretation - Describe layouts, UI elements, charts, or diagrams

Diagram analysis - Explain relationships, flows, or architecture depicted

Specific data extraction - Pull particular information from visual content

Context token optimization - Need analyzed data, not entire raw file

Avoid Multimodal Looker For

Plain text files - Use read tool directly instead

Source code - Use read for exact contents needed for editing

Files needing modification - Looker only extracts, can’t edit

Simple file reading - No interpretation needed, use regular tools

How It Works

Multimodal Looker follows a focused 4-step process:

Receive request - Gets file path and specific extraction goal
Read and analyze - Deeply interprets the visual content
Extract target information - Returns ONLY what was requested
Pass to main agent - Main agent continues work without processing raw file

Key Principle

Context token efficiency: The main agent never processes the raw media file. Looker extracts and summarizes, saving thousands of tokens.

Response Rules

No preamble

boolean

Returns extracted information directly, without introduction

Clear when missing

boolean

States explicitly what information wasn’t found

Match request language

boolean

Responds in the same language as the request

Thorough on goal

boolean

Comprehensive on the specific extraction goal

Concise on everything else

boolean

Brief on context or unrelated content

Usage Examples

Example 1: PDF Data Extraction

task(
  subagent_type="multimodal-looker",
  description="Extract pricing table from PDF",
  prompt="Analyze pricing-2024.pdf and extract all pricing tiers, 
          their monthly costs, and included features into a structured format."
)

// Multimodal Looker response:
"Pricing Tiers:

1. Starter - $29/month
   - 10 users
   - 100GB storage
   - Email support
   - Basic analytics

2. Professional - $99/month
   - 50 users
   - 1TB storage
   - Priority support
   - Advanced analytics
   - API access

3. Enterprise - Custom pricing
   - Unlimited users
   - Unlimited storage
   - 24/7 phone support
   - Custom analytics
   - Dedicated account manager
   - SLA guarantee"

Example 2: Architecture Diagram Analysis

task(
  subagent_type="multimodal-looker",
  description="Explain system architecture from diagram",
  prompt="Examine architecture-diagram.png and describe:
          1. Main components and their responsibilities
          2. Data flow between components
          3. External services integrated"
)

// Multimodal Looker response:
"Architecture Components:

1. Frontend (React SPA)
   - User interface
   - Communicates via REST API
   - Hosted on Cloudflare Pages

2. API Gateway (Node.js/Express)
   - Request routing
   - Authentication middleware
   - Rate limiting
   - Routes to microservices

3. User Service (Python/FastAPI)
   - User management
   - Authentication
   - Writes to PostgreSQL

4. Payment Service (Node.js)
   - Payment processing
   - Integrates with Stripe API
   - Publishes events to message queue

5. Notification Service (Go)
   - Consumes queue events
   - Sends emails via SendGrid
   - Sends SMS via Twilio

Data Flow:
Frontend → API Gateway → Service → Database
                      ↓
           External APIs (Stripe, SendGrid, Twilio)
                      ↓
          Message Queue (RabbitMQ) between services

External Services:
- Stripe (payments)
- SendGrid (email)
- Twilio (SMS)
- PostgreSQL (primary database)
- Redis (caching and sessions)"

Example 3: UI Screenshot Analysis

task(
  subagent_type="multimodal-looker",
  description="Identify UI components in mockup",
  prompt="Analyze dashboard-mockup.png and list all UI components 
          with their positions and purposes."
)

// Multimodal Looker response:
"UI Components:

Header (Top):
- Logo (left): Brand identity
- Search bar (center): Global search
- User avatar (right): Profile menu
- Notifications bell (right): Alert center

Sidebar (Left):
- Dashboard icon: Overview page
- Projects icon: Project list
- Team icon: Team management
- Settings icon: Configuration
- Help icon: Documentation

Main Content (Center):
- Welcome heading: Page title
- Stats cards (row of 4):
  * Total users (blue)
  * Active projects (green)
  * Revenue (yellow)
  * Pending tasks (red)
- Activity feed: Recent events list
- Quick actions button (bottom right): Floating action button

Right Panel:
- Calendar widget: Schedule overview
- Team members list: Online status
- Recent files: Document quick access"

Example 4: Chart Data Extraction

task(
  subagent_type="multimodal-looker",
  description="Extract sales data from chart image",
  prompt="Read sales-chart-q4.png and extract the monthly sales figures 
          for October, November, and December."
)

// Multimodal Looker response:
"Q4 Sales Data:

October: $142,500
November: $198,300
December: $256,700

Total Q4: $597,500
Average: $199,167/month
Growth: 81% from Oct to Dec"

When NOT to Use

Multimodal Looker is specialized - avoid using it when simpler tools work:

Source code files: Use read tool directly. Looker interprets and summarizes, but editing requires exact original content.

Plain text documents: No interpretation needed - regular read is faster and more accurate.

Files you’ll modify later: Looker’s extracted summary can’t be edited back into the original file.

When you need the full file: If the entire content is needed, not just specific data, use read directly.

Best Practices

Be specific about what to extract - Clear goals produce better results

Request structured output - Ask for tables, lists, or specific formats

One file at a time - Focused analysis is more accurate

Use for token optimization - When you need data from large PDFs

Don’t expect perfection - Vision models may misread text or miss details

Don’t use for critical exact data - If precision is critical, verify manually

Configuration

Customize Multimodal Looker in oh-my-opencode.jsonc:

{
  "agents": {
    "multimodal-looker": {
      "model": "opencode/kimi-k2.5-free",
      "temperature": 0.1,
      "prompt_append": "Additional extraction guidelines...",
      "disable": false
    }
  }
}

Sisyphus - Orchestrator that uses Looker for media analysis
Librarian - Searches external docs (Looker analyzes local files)
Explore - Searches codebase (Looker interprets visual content)

Agents

Tools

Features

Overview

Model Configuration

Default Model

Gemini Variant

GPT Variant

Fallback Chain

Tool Permissions

Allowed Tools (Read Only)

Blocked Tools (All Others)

When to Use Multimodal Looker

Recommended Scenarios

Avoid Multimodal Looker For

How It Works

Key Principle

Response Rules

Usage Examples

Example 1: PDF Data Extraction

Example 2: Architecture Diagram Analysis

Example 3: UI Screenshot Analysis

Example 4: Chart Data Extraction

When NOT to Use

Best Practices

Configuration

Build docs developers (and LLMs) love

Agents

Tools

Features

​Overview

​Model Configuration

​Default Model

​Gemini Variant

​GPT Variant

​Fallback Chain

​Tool Permissions

​Allowed Tools (Read Only)

​Blocked Tools (All Others)

​When to Use Multimodal Looker

​Recommended Scenarios

​Avoid Multimodal Looker For

​How It Works

​Key Principle

​Response Rules

​Usage Examples

​Example 1: PDF Data Extraction

​Example 2: Architecture Diagram Analysis

​Example 3: UI Screenshot Analysis

​Example 4: Chart Data Extraction

​When NOT to Use

​Best Practices

​Configuration

​Related Agents

Build docs developers (and LLMs) love

Overview

Model Configuration

Default Model

Gemini Variant

GPT Variant

Fallback Chain

Tool Permissions

Allowed Tools (Read Only)

Blocked Tools (All Others)

When to Use Multimodal Looker

Recommended Scenarios

Avoid Multimodal Looker For

How It Works

Key Principle

Response Rules

Usage Examples

Example 1: PDF Data Extraction

Example 2: Architecture Diagram Analysis

Example 3: UI Screenshot Analysis

Example 4: Chart Data Extraction

When NOT to Use

Best Practices

Configuration

Related Agents