Overview
The Ingestion module is VIGIA’s multi-channel data intake system that captures adverse event reports from diverse sources including email (POP3/IMAP), web forms, WhatsApp, and document uploads with OCR capabilities.Architecture
The ingestion pipeline consists of three main components:- Connectors: Protocol-specific adapters for each data source
- Processors: OCR and text extraction engines
- Payload Builder: Normalizes data into ICSR-compliant format
The ingestion service automatically detects the source protocol and routes data through the appropriate processing pipeline.
Email Connectors
POP3 Connector
The POP3 connector (backend/app/services/email_pop.py:346) implements stateful polling with UIDL tracking to prevent duplicate processing.
- Duplicate detection using Message-ID fingerprinting
- Automatic retry with exponential backoff
- Large attachment handling (images >15KB, PDFs)
- HTML email parsing with fallback to plain text
- DNS resolution and TCP probing for diagnostics
IMAP Connector
The IMAP connector (backend/app/services/email_imap.py:95) provides folder-based filtering and optional mark-as-read functionality.
IMAP Configuration
| Feature | IMAP | POP3 |
|---|---|---|
| Folder support | ✅ Yes | ❌ No |
| Server-side search | ✅ UNSEEN flag | ❌ Client-side |
| State management | 🔵 Server | 🟡 Local UIDL |
| Performance | ⚡ Faster | 🐢 Requires full download |
Web Form Ingestion
The web form router (backend/app/routers/ingest.py:64) exposes REST endpoints for direct ICSR creation.
- Text Input
- File Upload
POST /api/v1/ingest/text/create
WhatsApp Integration
WhatsApp connector configuration stored in tenant settings:Connector Status (GET /api/v1/ingest/connectors)
OCR Processing
OCR engine (backend/app/services/ocr_service.py:130) supports images and PDFs with multiple backends.
Image Processing
Uses OpenCV for preprocessing:
- Grayscale conversion
- Otsu’s binarization threshold
- Tesseract OCR (spa/eng language packs)
OCR Configuration
- Images: PNG, JPG, JPEG, WebP, TIFF
- Documents: PDF (multi-page)
Payload Construction
The ingestion service (backend/app/services/ingesta_service.py:198) normalizes all inputs into standardized ICSR payloads.
Field Extraction Logic
Field Extraction Logic
Two-stage extraction:
- AI Extraction (if enabled):
- Gemini API with structured output
- Entity recognition for patient data
- Automatic email extraction from sender
- Rule-Based Fallback:
- Regex patterns for Spanish medical terms
- Heuristic age/weight/date parsing
- Symptom keyword detection
Data Normalization
Data Normalization
Key Normalizations (backend/app/services/ingesta_service.py:49)
Full Narrative Storage
Full Narrative Storage
STORE_FULL_NARRATIVE=true behavior
Configuration
AI-Powered Extraction
Enable Gemini Extractor
Connector Management
Toggle Connector (POST /api/v1/ingest/connectors/gmail/toggle)
Monitoring
Ingestion items are tracked per source:List Recent Items (GET /api/v1/ingest/items?source=gmail&limit=10)
Best Practices
Email Setup
- Use dedicated inbox for adverse event reports
- Configure email filters to move reports to specific folder (IMAP)
- Enable app passwords instead of account passwords
- Set reasonable polling intervals (5-15 minutes)
OCR Optimization
- Request 300+ DPI scanned documents from reporters
- Use PDF format when possible (preserves quality)
- Enable preprocessing for low-quality images
- Review OCR confidence scores in audit logs
Data Quality
- Enable AI extraction for complex narratives
- Configure sender email whitelist for auto-approval
- Set required fields in overrides for form submissions
- Regularly review extraction accuracy metrics
Performance
- Limit POP3 fetch to 10-20 emails per poll
- Use IMAP UNSEEN filter for faster queries
- Set MAX_BYTES to prevent memory issues
- Enable TOP preview to skip large emails
Troubleshooting
POP3 Connection Fails
POP3 Connection Fails
Diagnostic Steps
OCR Returns Empty Text
OCR Returns Empty Text
Common causes:
- Missing Tesseract language packs (
apt install tesseract-ocr-spa) - Poppler not installed for PDF processing
- Image resolution too low (<150 DPI)
- Handwritten text (not supported)
OCR_DEBUG=trueDuplicate Emails Processed
Duplicate Emails Processed
POP3 UIDL storage may be cleared. Check:
API Reference
/api/v1/ingest/text/preview
Preview ICSR payload from text without creating record
/api/v1/ingest/file/create
Upload file (PDF/image), run OCR, create ICSR
/api/v1/ingest/email/pull
Fetch unseen emails (does not process)
/api/v1/ingest/email/process
Fetch and extract fields from emails
/api/v1/ingest/connectors
List all configured ingestion connectors
Related Documentation
OCR/NLP Module
Deep dive into AI extraction and translation
ICSR Management
Working with Individual Case Safety Reports