Vision Agent
Learn how to create agents that can analyze images, process visual content, and combine vision with language capabilities for powerful multimodal applications.Overview
Vision agents can:- Analyze and describe images
- Extract information from visual content
- Answer questions about images
- Combine visual analysis with tools
- Process multiple images simultaneously
- Generate insights from charts and diagrams
Basic Vision Agent
Here’s how to create a simple vision agent:from swarms import Agent
# Create a vision-enabled agent
vision_agent = Agent(
agent_name="Vision-Analyst",
agent_description="An agent that analyzes images and provides detailed descriptions",
model_name="gpt-4o", # Vision-capable model
multi_modal=True, # Enable multimodal processing
max_loops=1,
)
# Analyze an image
response = vision_agent.run(
task="Describe what you see in this image in detail",
img="path/to/image.jpg", # Path to image file
)
print(response)
Image Input Formats
Vision agents support multiple image input formats:1. File Path
response = agent.run(
task="Analyze this image",
img="/home/user/images/photo.jpg",
)
2. URL
response = agent.run(
task="What's in this image?",
img="https://example.com/image.jpg",
)
3. Base64 Encoded String
import base64
# Read and encode image
with open("image.jpg", "rb") as f:
img_base64 = base64.b64encode(f.read()).decode("utf-8")
response = agent.run(
task="Analyze this image",
img=img_base64,
)
4. Data URI
response = agent.run(
task="Describe the image",
img="data:image/jpeg;base64,/9j/4AAQSkZJRg...",
)
Real-World Example: Quality Control Agent
Here’s a production-ready example for factory quality control:import logging
from swarms import Agent
from swarms.prompts.logistics import Quality_Control_Agent_Prompt
# Set up logging
logging.basicConfig(level=logging.DEBUG)
def security_analysis(danger_level: str) -> str:
"""
Analyzes security danger level and returns appropriate response.
Args:
danger_level (str): The level of danger ("low", "medium", "high")
Returns:
str: Detailed security analysis based on danger level
"""
if danger_level == "low":
return """SECURITY ANALYSIS - LOW DANGER LEVEL:
✅ Environment appears safe and well-controlled
✅ Standard security measures are adequate
✅ Low risk of accidents or security breaches
✅ Normal operational protocols can continue
Recommendations: Maintain current security standards."""
elif danger_level == "medium":
return """SECURITY ANALYSIS - MEDIUM DANGER LEVEL:
⚠️ Moderate security concerns identified
⚠️ Enhanced monitoring recommended
⚠️ Some security measures may need strengthening
Recommendations: Implement additional safety protocols."""
elif danger_level == "high":
return """SECURITY ANALYSIS - HIGH DANGER LEVEL:
🚨 CRITICAL SECURITY CONCERNS DETECTED
🚨 Immediate action required
🚨 High risk of accidents or security breaches
Recommendations: Immediate intervention required, evacuate if necessary."""
return f"ERROR: Invalid danger level '{danger_level}'"
# Custom system prompt
custom_system_prompt = f"""
{Quality_Control_Agent_Prompt}
You have access to tools that can help with your analysis. When you need to
perform a security analysis, use the security_analysis function with an
appropriate danger level (low, medium, or high) based on your observations.
"""
# Quality control agent with vision and tools
quality_control_agent = Agent(
agent_name="Quality-Control-Agent",
agent_description="Analyzes images and provides detailed quality control reports",
model_name="gpt-4.1",
system_prompt=custom_system_prompt,
multi_modal=True, # Enable vision
max_loops=1,
output_type="str-all-except-first",
tools=[security_analysis], # Combine vision with tools
)
response = quality_control_agent.run(
task="Analyze the image and perform a security analysis. Determine the danger level and call the security_analysis function.",
img="factory_image.png",
)
print(response)
Vision with Multiple Images
Process multiple images in a single request:from swarms import Agent
# Create vision agent
agent = Agent(
agent_name="Multi-Image-Analyst",
model_name="gpt-4o",
multi_modal=True,
max_loops=1,
)
# Process batch of images
images = [
"image1.jpg",
"image2.jpg",
"image3.jpg",
]
for idx, img in enumerate(images, 1):
response = agent.run(
task=f"Analyze image {idx} and describe key features",
img=img,
)
print(f"\n=== Image {idx} Analysis ===")
print(response)
Advanced Vision Patterns
Document Analysis
from swarms import Agent
# Create document analysis agent
doc_agent = Agent(
agent_name="Document-Analyzer",
system_prompt="""You are an expert at analyzing documents, invoices,
and forms. Extract all relevant information accurately.""",
model_name="gpt-4o",
multi_modal=True,
max_loops=1,
)
response = doc_agent.run(
task="""Extract the following information from this invoice:
- Invoice number
- Date
- Total amount
- Line items with quantities and prices
- Vendor name and address
""",
img="invoice.pdf",
)
print(response)
Chart and Graph Analysis
# Create data visualization analyst
chart_agent = Agent(
agent_name="Chart-Analyst",
system_prompt="""You are an expert at analyzing charts, graphs, and
data visualizations. Provide insights about trends and patterns.""",
model_name="gpt-4o",
multi_modal=True,
max_loops=1,
)
response = chart_agent.run(
task="""Analyze this chart and provide:
1. Key trends and patterns
2. Notable data points
3. Statistical insights
4. Recommendations based on the data
""",
img="sales_chart.png",
)
Medical Image Analysis
from swarms import Agent
# Create medical imaging agent
medical_agent = Agent(
agent_name="Medical-Imaging-Analyst",
system_prompt="""You are a medical imaging analyst assistant.
Provide detailed observations about medical images. Note: This is for
educational purposes only and not a substitute for professional diagnosis.""",
model_name="gpt-4o",
multi_modal=True,
max_loops=1,
)
response = medical_agent.run(
task="""Analyze this X-ray image and describe:
1. What anatomical structures are visible
2. Any notable features or anomalies
3. Image quality and clarity
""",
img="xray.jpg",
)
Vision + Tools Integration
Combine vision capabilities with external tools:from swarms import Agent
import httpx
import json
def search_product_database(product_name: str) -> str:
"""
Search product database for information
Args:
product_name (str): Name or description of product
Returns:
str: Product information from database
"""
# Implementation
return f"Product info for {product_name}"
def check_inventory(product_id: str) -> str:
"""
Check inventory levels for a product
Args:
product_id (str): Product ID or SKU
Returns:
str: Current inventory status
"""
# Implementation
return f"Inventory status for {product_id}"
# Create agent with vision and tools
product_agent = Agent(
agent_name="Product-Recognition-Agent",
system_prompt="""You analyze product images, identify products,
and use tools to look up information about them.""",
model_name="gpt-4o",
multi_modal=True,
max_loops=2,
tools=[search_product_database, check_inventory],
)
response = product_agent.run(
task="""Identify the products in this image, search the database
for each product, and check inventory levels.""",
img="warehouse_shelf.jpg",
)
Supported Vision Models
Swarms supports multiple vision-capable models:# OpenAI GPT-4 Vision
agent_gpt4v = Agent(
model_name="gpt-4o",
multi_modal=True,
)
# OpenAI GPT-4o mini (cost-effective)
agent_gpt4o_mini = Agent(
model_name="gpt-4o-mini",
multi_modal=True,
)
# Anthropic Claude with vision
agent_claude = Agent(
model_name="claude-sonnet-4-5",
multi_modal=True,
)
# Groq with LLaVA
agent_groq = Agent(
model_name="groq/llava-v1.5-7b-4096-preview",
multi_modal=True,
)
Best Practices
1. Specific Task Instructions
# Bad: Vague instruction
response = agent.run(task="Look at this image", img="photo.jpg")
# Good: Specific instruction
response = agent.run(
task="""Identify all vehicles in this image, count them by type
(cars, trucks, motorcycles), and describe their colors and positions.""",
img="traffic.jpg",
)
2. Image Quality
# Ensure images are:
# - Clear and well-lit
# - High enough resolution (min 512x512 recommended)
# - In supported formats (JPEG, PNG, WebP)
# - Not too large (under 20MB)
import os
from PIL import Image
def validate_image(image_path: str) -> bool:
"""Validate image before processing"""
if not os.path.exists(image_path):
return False
try:
img = Image.open(image_path)
width, height = img.size
# Check minimum resolution
if width < 512 or height < 512:
print("Warning: Image resolution is low")
# Check file size
file_size = os.path.getsize(image_path) / (1024 * 1024) # MB
if file_size > 20:
print("Warning: Image file is large")
return True
except Exception as e:
print(f"Image validation failed: {e}")
return False
3. Structured Output
from pydantic import BaseModel, Field
from typing import List
class ImageAnalysis(BaseModel):
description: str = Field(..., description="Overall image description")
objects_detected: List[str] = Field(..., description="List of detected objects")
dominant_colors: List[str] = Field(..., description="Main colors in image")
scene_type: str = Field(..., description="Type of scene (indoor, outdoor, etc)")
agent = Agent(
model_name="gpt-4o",
multi_modal=True,
output_type="json",
)
response = agent.run(
task=f"""Analyze this image and return a JSON response matching this schema:
{ImageAnalysis.model_json_schema()}""",
img="scene.jpg",
)
result = ImageAnalysis.model_validate_json(response)
print(result)
4. Error Handling
def process_image_safely(agent: Agent, task: str, img_path: str) -> str:
"""Process image with error handling"""
try:
# Validate image exists
if not os.path.exists(img_path):
return f"Error: Image not found at {img_path}"
# Process image
response = agent.run(task=task, img=img_path)
return response
except Exception as e:
logger.error(f"Image processing failed: {e}")
return f"Image processing error: {str(e)}"
result = process_image_safely(
agent=vision_agent,
task="Analyze this image",
img_path="photo.jpg",
)
Output Examples
Typical vision agent output:🤖 Agent: Vision-Analyst
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
📸 Image Analysis:
This image shows a modern factory floor with the following elements:
1. **Equipment**:
- 3 robotic arms in the center
- Conveyor belt system running left to right
- Control panels on the far wall
2. **Safety Features**:
- Yellow safety barriers around robotic area
- Emergency stop buttons visible
- Proper lighting throughout
3. **Personnel**:
- 2 workers wearing safety vests and hard hats
- Both maintaining safe distance from robotic area
4. **Overall Assessment**:
- Clean and organized workspace
- Safety protocols appear to be followed
- No visible hazards or concerns
Common Use Cases
Retail and E-commerce
# Product catalog generation
product_agent = Agent(
agent_name="Product-Cataloger",
system_prompt="Generate product descriptions from images",
model_name="gpt-4o",
multi_modal=True,
)
description = product_agent.run(
task="Create a detailed product description for an e-commerce listing",
img="product_photo.jpg",
)
Manufacturing and QA
# Defect detection
qa_agent = Agent(
agent_name="QA-Inspector",
system_prompt="Inspect products for defects and quality issues",
model_name="gpt-4o",
multi_modal=True,
)
inspection = qa_agent.run(
task="Inspect this product for defects, scratches, or quality issues",
img="product_inspection.jpg",
)
Healthcare
# Medical documentation
med_doc_agent = Agent(
agent_name="Medical-Documentation",
system_prompt="Extract information from medical documents and forms",
model_name="gpt-4o",
multi_modal=True,
)
extracted_data = med_doc_agent.run(
task="Extract patient information and medical data from this form",
img="patient_form.jpg",
)
Next Steps
- Streaming - Stream vision analysis in real-time
- Multi-Agent Vision - Coordinate vision agents
- Agent Output Types - Structure vision outputs
- Model Providers - Explore vision-capable models