Process Document with OCR

Overview

This endpoint processes uploaded documents (PDFs and images) and extracts text using OCR technology powered by Tesseract. The service automatically detects the document type and applies appropriate processing methods.

Supported File Formats

Images: PNG, JPG, JPEG, WEBP, TIF, TIFF
Documents: PDF (multi-page support)

Request

file

required

The document or image file to process. Must be one of the supported formats.

Example Request

curl -X POST https://api.vigia.com/api/v1/ocr/preview \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -F "[email protected]"

import requests

url = "https://api.vigia.com/api/v1/ocr/preview"
files = {"file": open("document.pdf", "rb")}
headers = {"Authorization": "Bearer YOUR_API_KEY"}

response = requests.post(url, files=files, headers=headers)
print(response.json())

Response

text

string

The extracted text from the document. For multi-page PDFs, text from all pages is concatenated with newline separators.

Example Response

{
  "text": "Paciente: C.R.\nEdad: 45 años\nSexo: Masculino\nPeso: 72 kg\n\nProducto sospechoso: Ibuprofeno 400 mg\nFecha de inicio: 2025-08-10\n\nEvento adverso: Presentó urticaria generalizada tras administración del medicamento.\n\nReportante: Dra. María González"
}

OCR Processing Details

Image Processing

For image files, the service:

Loads the image using OpenCV (if available) or PIL
Converts to grayscale
Applies Otsu’s thresholding for better text detection
Runs Tesseract OCR with Spanish and English language support

PDF Processing

For PDF files, the service uses two strategies: Primary Method: pdf2image + Poppler

Converts each PDF page to high-resolution images
Applies OCR to each page individually
Concatenates results from all pages

Fallback Method: PyMuPDF (fitz)

Rasterizes PDF pages at 200 DPI
Processes each page with Tesseract
Used when Poppler is not available

Language Support

The OCR engine attempts text extraction in the following order:

Spanish (spa)
Spanish + English (spa+eng)
English only (eng)
Default language (fallback)

Configuration

OCR behavior is configured via environment variables:

TESSERACT_CMD: Path to Tesseract executable
POPPLER_PATH: Path to Poppler utilities (for PDF processing)
OCR_CONFIG: Tesseract configuration (default: --oem 3 --psm 6)

Debug Endpoint

To check OCR service configuration and available languages:

GET /api/v1/ocr/debug

Returns:

{
  "TESSERACT_CMD_env": "/usr/bin/tesseract",
  "pytesseract_cmd": "/usr/bin/tesseract",
  "tesseract_version": "tesseract 5.3.0",
  "langs": ["eng", "spa", "spa_old"],
  "POPPLER_PATH": "/usr/bin"
}

Error Handling

error

string

Error message if OCR processing fails

Common Errors

Unsupported file format: The uploaded file type is not supported
OCR disabled: Tesseract is not properly installed or configured
PDF processing failed: Poppler/PyMuPDF dependencies missing
Invalid file: The uploaded file is corrupted or cannot be read

Best Practices

Image Quality: Upload high-resolution images (300 DPI minimum) for best results
File Size: Keep files under 10 MB for optimal processing time
Document Orientation: Ensure text is properly oriented (not rotated)
Contrast: High contrast between text and background improves accuracy
Multi-page PDFs: Processing time increases linearly with page count

Performance

Single page: ~2-3 seconds
Multi-page PDF: ~2-3 seconds per page
Large images: ~3-5 seconds depending on resolution

Next Steps

After extracting text with OCR, you can:

Use Extract Data to parse structured fields from the text
Use Translate to translate the extracted text to other languages

Authentication

ICSR

IPS Reports

Document Management

Surveillance

OCR & NLP

Products & MedDRA

Process Document with OCR

Overview

Supported File Formats

Request

Example Request

Response

Example Response

OCR Processing Details

Image Processing

PDF Processing

Language Support

Configuration

Debug Endpoint

Error Handling

Common Errors

Best Practices

Performance

Next Steps

Build docs developers (and LLMs) love

Authentication

ICSR

IPS Reports

Document Management

Surveillance

OCR & NLP

Products & MedDRA

​Overview

​Supported File Formats

​Request

​Example Request

​Response

​Example Response

​OCR Processing Details

​Image Processing

​PDF Processing

​Language Support

​Configuration

​Debug Endpoint

​Error Handling

​Common Errors

​Best Practices

​Performance

​Next Steps

Build docs developers (and LLMs) love

Overview

Supported File Formats

Request

Example Request

Response

Example Response

OCR Processing Details

Image Processing

PDF Processing

Language Support

Configuration

Debug Endpoint

Error Handling

Common Errors

Best Practices

Performance

Next Steps