Vision Agent

Learn how to create agents that can analyze images, process visual content, and combine vision with language capabilities for powerful multimodal applications.

Overview

Vision agents can:

Analyze and describe images
Extract information from visual content
Answer questions about images
Combine visual analysis with tools
Process multiple images simultaneously
Generate insights from charts and diagrams

Basic Vision Agent

Here’s how to create a simple vision agent:

from swarms import Agent

# Create a vision-enabled agent
vision_agent = Agent(
    agent_name="Vision-Analyst",
    agent_description="An agent that analyzes images and provides detailed descriptions",
    model_name="gpt-4o",  # Vision-capable model
    multi_modal=True,  # Enable multimodal processing
    max_loops=1,
)

# Analyze an image
response = vision_agent.run(
    task="Describe what you see in this image in detail",
    img="path/to/image.jpg",  # Path to image file
)

print(response)

Image Input Formats

Vision agents support multiple image input formats:

1. File Path

response = agent.run(
    task="Analyze this image",
    img="/home/user/images/photo.jpg",
)

2. URL

response = agent.run(
    task="What's in this image?",
    img="https://example.com/image.jpg",
)

3. Base64 Encoded String

import base64

# Read and encode image
with open("image.jpg", "rb") as f:
    img_base64 = base64.b64encode(f.read()).decode("utf-8")

response = agent.run(
    task="Analyze this image",
    img=img_base64,
)

4. Data URI

response = agent.run(
    task="Describe the image",
    img="data:image/jpeg;base64,/9j/4AAQSkZJRg...",
)

Real-World Example: Quality Control Agent

Here’s a production-ready example for factory quality control:

import logging
from swarms import Agent
from swarms.prompts.logistics import Quality_Control_Agent_Prompt

# Set up logging
logging.basicConfig(level=logging.DEBUG)

def security_analysis(danger_level: str) -> str:
    """
    Analyzes security danger level and returns appropriate response.
    
    Args:
        danger_level (str): The level of danger ("low", "medium", "high")
        
    Returns:
        str: Detailed security analysis based on danger level
    """
    if danger_level == "low":
        return """SECURITY ANALYSIS - LOW DANGER LEVEL:
        ✅ Environment appears safe and well-controlled
        ✅ Standard security measures are adequate
        ✅ Low risk of accidents or security breaches
        ✅ Normal operational protocols can continue
        
        Recommendations: Maintain current security standards."""
    
    elif danger_level == "medium":
        return """SECURITY ANALYSIS - MEDIUM DANGER LEVEL:
        ⚠️  Moderate security concerns identified
        ⚠️  Enhanced monitoring recommended
        ⚠️  Some security measures may need strengthening
        
        Recommendations: Implement additional safety protocols."""
    
    elif danger_level == "high":
        return """SECURITY ANALYSIS - HIGH DANGER LEVEL:
        🚨 CRITICAL SECURITY CONCERNS DETECTED
        🚨 Immediate action required
        🚨 High risk of accidents or security breaches
        
        Recommendations: Immediate intervention required, evacuate if necessary."""
    
    return f"ERROR: Invalid danger level '{danger_level}'"

# Custom system prompt
custom_system_prompt = f"""
{Quality_Control_Agent_Prompt}

You have access to tools that can help with your analysis. When you need to
perform a security analysis, use the security_analysis function with an
appropriate danger level (low, medium, or high) based on your observations.
"""

# Quality control agent with vision and tools
quality_control_agent = Agent(
    agent_name="Quality-Control-Agent",
    agent_description="Analyzes images and provides detailed quality control reports",
    model_name="gpt-4.1",
    system_prompt=custom_system_prompt,
    multi_modal=True,  # Enable vision
    max_loops=1,
    output_type="str-all-except-first",
    tools=[security_analysis],  # Combine vision with tools
)

response = quality_control_agent.run(
    task="Analyze the image and perform a security analysis. Determine the danger level and call the security_analysis function.",
    img="factory_image.png",
)

print(response)

Vision with Multiple Images

Process multiple images in a single request:

from swarms import Agent

# Create vision agent
agent = Agent(
    agent_name="Multi-Image-Analyst",
    model_name="gpt-4o",
    multi_modal=True,
    max_loops=1,
)

# Process batch of images
images = [
    "image1.jpg",
    "image2.jpg",
    "image3.jpg",
]

for idx, img in enumerate(images, 1):
    response = agent.run(
        task=f"Analyze image {idx} and describe key features",
        img=img,
    )
    print(f"\n=== Image {idx} Analysis ===")
    print(response)

Advanced Vision Patterns

Document Analysis

from swarms import Agent

# Create document analysis agent
doc_agent = Agent(
    agent_name="Document-Analyzer",
    system_prompt="""You are an expert at analyzing documents, invoices,
    and forms. Extract all relevant information accurately.""",
    model_name="gpt-4o",
    multi_modal=True,
    max_loops=1,
)

response = doc_agent.run(
    task="""Extract the following information from this invoice:
    - Invoice number
    - Date
    - Total amount
    - Line items with quantities and prices
    - Vendor name and address
    """,
    img="invoice.pdf",
)

print(response)

Chart and Graph Analysis

# Create data visualization analyst
chart_agent = Agent(
    agent_name="Chart-Analyst",
    system_prompt="""You are an expert at analyzing charts, graphs, and
    data visualizations. Provide insights about trends and patterns.""",
    model_name="gpt-4o",
    multi_modal=True,
    max_loops=1,
)

response = chart_agent.run(
    task="""Analyze this chart and provide:
    1. Key trends and patterns
    2. Notable data points
    3. Statistical insights
    4. Recommendations based on the data
    """,
    img="sales_chart.png",
)

Medical Image Analysis

from swarms import Agent

# Create medical imaging agent
medical_agent = Agent(
    agent_name="Medical-Imaging-Analyst",
    system_prompt="""You are a medical imaging analyst assistant.
    Provide detailed observations about medical images. Note: This is for
    educational purposes only and not a substitute for professional diagnosis.""",
    model_name="gpt-4o",
    multi_modal=True,
    max_loops=1,
)

response = medical_agent.run(
    task="""Analyze this X-ray image and describe:
    1. What anatomical structures are visible
    2. Any notable features or anomalies
    3. Image quality and clarity
    """,
    img="xray.jpg",
)

Vision + Tools Integration

Combine vision capabilities with external tools:

from swarms import Agent
import httpx
import json

def search_product_database(product_name: str) -> str:
    """
    Search product database for information
    
    Args:
        product_name (str): Name or description of product
        
    Returns:
        str: Product information from database
    """
    # Implementation
    return f"Product info for {product_name}"

def check_inventory(product_id: str) -> str:
    """
    Check inventory levels for a product
    
    Args:
        product_id (str): Product ID or SKU
        
    Returns:
        str: Current inventory status
    """
    # Implementation
    return f"Inventory status for {product_id}"

# Create agent with vision and tools
product_agent = Agent(
    agent_name="Product-Recognition-Agent",
    system_prompt="""You analyze product images, identify products,
    and use tools to look up information about them.""",
    model_name="gpt-4o",
    multi_modal=True,
    max_loops=2,
    tools=[search_product_database, check_inventory],
)

response = product_agent.run(
    task="""Identify the products in this image, search the database
    for each product, and check inventory levels.""",
    img="warehouse_shelf.jpg",
)

Supported Vision Models

Swarms supports multiple vision-capable models:

# OpenAI GPT-4 Vision
agent_gpt4v = Agent(
    model_name="gpt-4o",
    multi_modal=True,
)

# OpenAI GPT-4o mini (cost-effective)
agent_gpt4o_mini = Agent(
    model_name="gpt-4o-mini",
    multi_modal=True,
)

# Anthropic Claude with vision
agent_claude = Agent(
    model_name="claude-sonnet-4-5",
    multi_modal=True,
)

# Groq with LLaVA
agent_groq = Agent(
    model_name="groq/llava-v1.5-7b-4096-preview",
    multi_modal=True,
)

Best Practices

1. Specific Task Instructions

# Bad: Vague instruction
response = agent.run(task="Look at this image", img="photo.jpg")

# Good: Specific instruction
response = agent.run(
    task="""Identify all vehicles in this image, count them by type
    (cars, trucks, motorcycles), and describe their colors and positions.""",
    img="traffic.jpg",
)

2. Image Quality

# Ensure images are:
# - Clear and well-lit
# - High enough resolution (min 512x512 recommended)
# - In supported formats (JPEG, PNG, WebP)
# - Not too large (under 20MB)

import os
from PIL import Image

def validate_image(image_path: str) -> bool:
    """Validate image before processing"""
    if not os.path.exists(image_path):
        return False
    
    try:
        img = Image.open(image_path)
        width, height = img.size
        
        # Check minimum resolution
        if width < 512 or height < 512:
            print("Warning: Image resolution is low")
        
        # Check file size
        file_size = os.path.getsize(image_path) / (1024 * 1024)  # MB
        if file_size > 20:
            print("Warning: Image file is large")
        
        return True
    except Exception as e:
        print(f"Image validation failed: {e}")
        return False

3. Structured Output

from pydantic import BaseModel, Field
from typing import List

class ImageAnalysis(BaseModel):
    description: str = Field(..., description="Overall image description")
    objects_detected: List[str] = Field(..., description="List of detected objects")
    dominant_colors: List[str] = Field(..., description="Main colors in image")
    scene_type: str = Field(..., description="Type of scene (indoor, outdoor, etc)")

agent = Agent(
    model_name="gpt-4o",
    multi_modal=True,
    output_type="json",
)

response = agent.run(
    task=f"""Analyze this image and return a JSON response matching this schema:
    {ImageAnalysis.model_json_schema()}""",
    img="scene.jpg",
)

result = ImageAnalysis.model_validate_json(response)
print(result)

4. Error Handling

def process_image_safely(agent: Agent, task: str, img_path: str) -> str:
    """Process image with error handling"""
    try:
        # Validate image exists
        if not os.path.exists(img_path):
            return f"Error: Image not found at {img_path}"
        
        # Process image
        response = agent.run(task=task, img=img_path)
        return response
        
    except Exception as e:
        logger.error(f"Image processing failed: {e}")
        return f"Image processing error: {str(e)}"

result = process_image_safely(
    agent=vision_agent,
    task="Analyze this image",
    img_path="photo.jpg",
)

Output Examples

Typical vision agent output:

🤖 Agent: Vision-Analyst
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

📸 Image Analysis:

This image shows a modern factory floor with the following elements:

1. **Equipment**: 
   - 3 robotic arms in the center
   - Conveyor belt system running left to right
   - Control panels on the far wall

2. **Safety Features**:
   - Yellow safety barriers around robotic area
   - Emergency stop buttons visible
   - Proper lighting throughout

3. **Personnel**:
   - 2 workers wearing safety vests and hard hats
   - Both maintaining safe distance from robotic area

4. **Overall Assessment**:
   - Clean and organized workspace
   - Safety protocols appear to be followed
   - No visible hazards or concerns

Common Use Cases

Retail and E-commerce

# Product catalog generation
product_agent = Agent(
    agent_name="Product-Cataloger",
    system_prompt="Generate product descriptions from images",
    model_name="gpt-4o",
    multi_modal=True,
)

description = product_agent.run(
    task="Create a detailed product description for an e-commerce listing",
    img="product_photo.jpg",
)

Manufacturing and QA

# Defect detection
qa_agent = Agent(
    agent_name="QA-Inspector",
    system_prompt="Inspect products for defects and quality issues",
    model_name="gpt-4o",
    multi_modal=True,
)

inspection = qa_agent.run(
    task="Inspect this product for defects, scratches, or quality issues",
    img="product_inspection.jpg",
)

Healthcare

# Medical documentation
med_doc_agent = Agent(
    agent_name="Medical-Documentation",
    system_prompt="Extract information from medical documents and forms",
    model_name="gpt-4o",
    multi_modal=True,
)

extracted_data = med_doc_agent.run(
    task="Extract patient information and medical data from this form",
    img="patient_form.jpg",
)

Next Steps

Streaming - Stream vision analysis in real-time
Multi-Agent Vision - Coordinate vision agents
Agent Output Types - Structure vision outputs
Model Providers - Explore vision-capable models

Basic Examples

Multi-Agent Examples

Use Cases

Vision Agent

Vision Agent

Overview

Basic Vision Agent

Image Input Formats

1. File Path

2. URL

3. Base64 Encoded String

4. Data URI

Real-World Example: Quality Control Agent

Vision with Multiple Images

Advanced Vision Patterns

Document Analysis

Chart and Graph Analysis

Medical Image Analysis

Vision + Tools Integration

Supported Vision Models

Best Practices

1. Specific Task Instructions

2. Image Quality

3. Structured Output

4. Error Handling

Output Examples

Common Use Cases

Retail and E-commerce

Manufacturing and QA

Healthcare

Next Steps

Learn More

Build docs developers (and LLMs) love

Basic Examples

Multi-Agent Examples

Use Cases

​Vision Agent

​Overview

​Basic Vision Agent

​Image Input Formats

​1. File Path

​2. URL

​3. Base64 Encoded String

​4. Data URI

​Real-World Example: Quality Control Agent

​Vision with Multiple Images

​Advanced Vision Patterns

​Document Analysis

​Chart and Graph Analysis

​Medical Image Analysis

​Vision + Tools Integration

​Supported Vision Models

​Best Practices

​1. Specific Task Instructions

​2. Image Quality

​3. Structured Output

​4. Error Handling

​Output Examples

​Common Use Cases

​Retail and E-commerce

​Manufacturing and QA

​Healthcare

​Next Steps

​Learn More

Build docs developers (and LLMs) love

Vision Agent

Overview

Basic Vision Agent

Image Input Formats

1. File Path

2. URL

3. Base64 Encoded String

4. Data URI

Real-World Example: Quality Control Agent

Vision with Multiple Images

Advanced Vision Patterns

Document Analysis

Chart and Graph Analysis

Medical Image Analysis

Vision + Tools Integration

Supported Vision Models

Best Practices

1. Specific Task Instructions

2. Image Quality

3. Structured Output

4. Error Handling

Output Examples

Common Use Cases

Retail and E-commerce

Manufacturing and QA

Healthcare

Next Steps

Learn More