Ingestion Module

Overview

The Ingestion module is VIGIA’s multi-channel data intake system that captures adverse event reports from diverse sources including email (POP3/IMAP), web forms, WhatsApp, and document uploads with OCR capabilities.

Architecture

The ingestion pipeline consists of three main components:

Connectors: Protocol-specific adapters for each data source
Processors: OCR and text extraction engines
Payload Builder: Normalizes data into ICSR-compliant format

The ingestion service automatically detects the source protocol and routes data through the appropriate processing pipeline.

Email Connectors

POP3 Connector

The POP3 connector (backend/app/services/email_pop.py:346) implements stateful polling with UIDL tracking to prevent duplicate processing.

# Environment variables
POP_HOST="mail.example.com"
POP_PORT="995"
POP_USER="[email protected]"
POP_PASS="secure_password"
POP_SSL="true"
POP_TIMEOUT_SECONDS="10"
POP_CONNECT_RETRIES="3"

# Size limits
POP_MAX_BYTES="26214400"  # 25MB default

# Processing behavior
POP_TOP_PREVIEW_LINES="60"  # Fast header preview
POP_FALLBACK_LAST="0"  # Don't re-download old emails

Key Features:

Duplicate detection using Message-ID fingerprinting
Automatic retry with exponential backoff
Large attachment handling (images >15KB, PDFs)
HTML email parsing with fallback to plain text
DNS resolution and TCP probing for diagnostics

The POP3 connector increases poplib._MAXLINE to 1MB to handle long lines in email bodies. This is applied automatically on import.

IMAP Connector

The IMAP connector (backend/app/services/email_imap.py:95) provides folder-based filtering and optional mark-as-read functionality.

IMAP Configuration

IMAP_HOST="imap.gmail.com"
IMAP_PORT="993"
IMAP_USER="[email protected]"
IMAP_PASS="app_password"
IMAP_FOLDER="INBOX"  # Or "Reports/Adverse Events"
IMAP_MARK_SEEN="true"  # Mark as read after processing

IMAP vs POP3:

Feature	IMAP	POP3
Folder support	✅ Yes	❌ No
Server-side search	✅ UNSEEN flag	❌ Client-side
State management	🔵 Server	🟡 Local UIDL
Performance	⚡ Faster	🐢 Requires full download

Web Form Ingestion

The web form router (backend/app/routers/ingest.py:64) exposes REST endpoints for direct ICSR creation.

Text Input
File Upload

POST /api/v1/ingest/text/create

{
  "text": "Paciente de 45 años presentó urticaria tras tomar Ibuprofeno 400mg...",
  "overrides": {
    "gravedad": "Moderada",
    "reportante_nombre": "Dr. García"
  }
}

# Response: ICSRResponse with auto-generated fields

POST /api/v1/ingest/file/create

# multipart/form-data
file: <PDF/image with adverse event report>
overrides: '{"producto_sospechoso": "Paracetamol 500mg"}'

# Pipeline: OCR → NLP extraction → ICSR creation

WhatsApp Integration

WhatsApp connector configuration stored in tenant settings:

Connector Status (GET /api/v1/ingest/connectors)

{
  "connectors": [
    {
      "key": "whatsapp",
      "enabled": true,
      "last_sync": "2025-03-03T22:00:00Z",
      "details": {
        "phone_number": "+51999999999",
        "api_token": "EAAx..."
      }
    }
  ]
}

OCR Processing

OCR engine (backend/app/services/ocr_service.py:130) supports images and PDFs with multiple backends.

Image Processing

Uses OpenCV for preprocessing:

Grayscale conversion
Otsu’s binarization threshold
Tesseract OCR (spa/eng language packs)

PDF Handling

Two-tier approach:

Primary: pdf2image + Poppler (higher quality)
Fallback: PyMuPDF rasterization at 200 DPI

OCR Configuration

TESSERACT_CMD="/usr/bin/tesseract"  # Auto-detected
POPPLER_PATH="/usr/bin"  # Required for pdf2image
OCR_CONFIG="--oem 3 --psm 6"  # LSTM + assume uniform block

Supported Formats:

Images: PNG, JPG, JPEG, WebP, TIFF
Documents: PDF (multi-page)

For best OCR results, ensure source documents have at least 300 DPI resolution. The system automatically enhances lower quality images.

Payload Construction

The ingestion service (backend/app/services/ingesta_service.py:198) normalizes all inputs into standardized ICSR payloads.

Field Extraction Logic

Two-stage extraction:

AI Extraction (if enabled):
- Gemini API with structured output
- Entity recognition for patient data
- Automatic email extraction from sender
Rule-Based Fallback:
- Regex patterns for Spanish medical terms
- Heuristic age/weight/date parsing
- Symptom keyword detection

Data Normalization

Key Normalizations (backend/app/services/ingesta_service.py:49)

# Sex normalization
"M" → "Masculino"
"F gestante" → "Femenino gestante"
"F no gestante" → "Femenino no gestante"

# Age validation
< 0 or > 120 → 30 (default)

# Dose spacing fix
"10 0 mg" → "100 mg"
"5 ml" → "5 mL"

# Date parsing
"10/08/2025" → "2025-08-10"
"este año" → current year inference

Full Narrative Storage

STORE_FULL_NARRATIVE=true behavior

# Stores complete email body in descripcion_evento field
# Separate short summary in ea_evento (String 255)
# Enables AI-powered follow-up question generation

Configuration

AI-Powered Extraction

Enable Gemini Extractor

GEMINI_ENABLED="true"
GEMINI_API_KEY="AIzaSy..."
GEMINI_MODEL="gemini-1.5-pro"
STORE_FULL_NARRATIVE="true"  # Store complete email text

When GEMINI_ENABLED=false, the system falls back to rule-based NLP extraction with lower accuracy but no API costs.

Connector Management

Toggle Connector (POST /api/v1/ingest/connectors/gmail/toggle)

{
  "enabled": true
}

# Test connection (POST /api/v1/ingest/connectors/gmail/test)
# Response:
{
  "ok": true,
  "message": "Test ok for gmail"
}

Monitoring

Ingestion items are tracked per source:

List Recent Items (GET /api/v1/ingest/items?source=gmail&limit=10)

{
  "items": [
    {
      "id": "msg-12345",
      "source": "gmail",
      "subject": "Reporte RAM - Urticaria",
      "status": "processed",
      "created_at": "2025-03-03T10:30:00Z"
    }
  ]
}

Best Practices

Email Setup

Use dedicated inbox for adverse event reports
Configure email filters to move reports to specific folder (IMAP)
Enable app passwords instead of account passwords
Set reasonable polling intervals (5-15 minutes)

OCR Optimization

Request 300+ DPI scanned documents from reporters
Use PDF format when possible (preserves quality)
Enable preprocessing for low-quality images
Review OCR confidence scores in audit logs

Data Quality

Enable AI extraction for complex narratives
Configure sender email whitelist for auto-approval
Set required fields in overrides for form submissions
Regularly review extraction accuracy metrics

Performance

Limit POP3 fetch to 10-20 emails per poll
Use IMAP UNSEEN filter for faster queries
Set MAX_BYTES to prevent memory issues
Enable TOP preview to skip large emails

Troubleshooting

POP3 Connection Fails

Diagnostic Steps

# 1. Check DNS resolution
nslookup mail.example.com

# 2. Test TCP connectivity
telnet mail.example.com 995

# 3. Verify SSL/TLS
openssl s_client -connect mail.example.com:995

# 4. Review logs
grep "POP connect" /var/log/vigia/app.log

OCR Returns Empty Text

Common causes:

Missing Tesseract language packs (apt install tesseract-ocr-spa)
Poppler not installed for PDF processing
Image resolution too low (<150 DPI)
Handwritten text (not supported)

Solution: Enable debug logging with OCR_DEBUG=true

Duplicate Emails Processed

POP3 UIDL storage may be cleared. Check:

# Location: storage/pop_uidls.json
# Location: storage/pop_seen_keys.json

# Manually mark as seen:
UIDL_DB.write_text(json.dumps(["uid1", "uid2", ...]))

API Reference

/api/v1/ingest/text/preview

Preview ICSR payload from text without creating record

/api/v1/ingest/file/create

Upload file (PDF/image), run OCR, create ICSR

/api/v1/ingest/email/pull

Fetch unseen emails (does not process)

/api/v1/ingest/email/process

Fetch and extract fields from emails

/api/v1/ingest/connectors

List all configured ingestion connectors

OCR/NLP Module

Deep dive into AI extraction and translation

ICSR Management

Working with Individual Case Safety Reports

Get Started

Core Features

User Guides

Modules

Regulatory

Administration

Ingestion Module

Overview

Architecture

Email Connectors

POP3 Connector

IMAP Connector

Web Form Ingestion

WhatsApp Integration

OCR Processing

Payload Construction

Configuration

AI-Powered Extraction

Connector Management

Monitoring

Best Practices

Email Setup

OCR Optimization

Data Quality

Performance

Troubleshooting

API Reference

OCR/NLP Module

ICSR Management

Build docs developers (and LLMs) love

Get Started

Core Features

User Guides

Modules

Regulatory

Administration

​Overview

​Architecture

​Email Connectors

​POP3 Connector

​IMAP Connector

​Web Form Ingestion

​WhatsApp Integration

​OCR Processing

​Payload Construction

​Configuration

​AI-Powered Extraction

​Connector Management

​Monitoring

​Best Practices

Email Setup

OCR Optimization

Data Quality

Performance

​Troubleshooting

​API Reference

​Related Documentation

OCR/NLP Module

ICSR Management

Build docs developers (and LLMs) love

Overview

Architecture

Email Connectors

POP3 Connector

IMAP Connector

Web Form Ingestion

WhatsApp Integration

OCR Processing

Payload Construction

Configuration

AI-Powered Extraction

Connector Management

Monitoring

Best Practices

Troubleshooting

API Reference

Related Documentation