Overview
The Vision Agent is MedMitra’s specialized component for analyzing medical images, particularly radiology scans such as X-rays, CT scans, and MRIs. It uses multimodal vision-language models to extract clinical findings directly from images.Architecture
The Vision Agent is implemented inbackend/agents/vision_agent.py and uses Groq’s Llama Vision model for image analysis.
Key Functions
backend/agents/vision_agent.py
Image Analysis Process
Single Image Extraction
Theimage_extraction function processes individual radiology images:
backend/agents/vision_agent.py:19-54
Message Format
The function uses OpenAI’s multimodal message format with two content blocks:- Text prompt: Clinical instructions for analysis
- Image URL: Direct link to the radiology image
Case-Level Processing
Thevision_agent function orchestrates analysis for all radiology files in a case:
backend/agents/vision_agent.py:57-89
Processing Flow
Model Configuration
Vision Model
- Multimodal capabilities: Can process both images and text
- Medical imaging: Fine-tuned for visual understanding
- Fast inference: Optimized 17B parameter model
- Structured output: Generates JSON-formatted findings
Generation Parameters
Radiology Analysis Prompt
TheRADIOLOGY_ANALYSIS_PROMPT guides the model to extract relevant clinical information:
backend/utils/medical_prompts.py
Output Format
The Vision Agent returns structured JSON:Integration with Medical Insights Agent
The Vision Agent’s output is consumed by the Medical Insights Agent:backend/agents/medical_ai_agent.py:94-119
Database Storage
Vision analysis results are stored in the file metadata:Metadata Schema
Usage in Orchestration
The Vision Agent is called during the file processing stage:backend/agentic.py:69-73
Processing Timeline
Error Handling
The Vision Agent includes robust error handling:Common Error Scenarios
- Invalid image URL: Returns error if file URL is inaccessible
- Model timeout: Groq API timeout after extended processing
- JSON parsing errors: Handled by
extract_json_from_stringutility - Database update failures: Logged but don’t block other file processing
Supported Image Formats
The Vision Agent can process:- X-rays: Chest, bone, dental
- CT scans: All body regions
- MRI scans: Brain, spine, joints
- Ultrasound: When uploaded as images
- DICOM: After conversion to web-compatible formats
Performance Considerations
Processing Time
- Single image: ~2-5 seconds
- Case with 5 images: ~10-25 seconds
- Processing is sequential per file
Optimization Opportunities
JSON Extraction Utility
Theextract_json_from_string function handles response parsing:
backend/utils/extractjson.py
Future Enhancements
Planned Features
- DICOM support: Direct processing of medical imaging format
- Multi-view analysis: Comparing multiple angles of the same region
- Temporal comparison: Detecting changes between studies
- Annotation extraction: Reading radiologist markups and measurements
- 3D reconstruction: For CT/MRI volumetric data
Model Upgrades
Potential future models:- Specialized radiology vision models
- Larger context windows for whole-study analysis
- Fine-tuned models for specific imaging modalities
Next Steps
Medical Insights Agent
Learn how vision outputs are used in diagnosis
Complete Workflow
See the full processing pipeline
