Create a custom OCR provider

Overview

This guide explains how to create your own OCR provider to use with meikipop. This allows you to integrate any OCR engine you prefer, whether it’s an offline model, a web service, or a commercial API.

The best way to start is to copy the entire /src/ocr/providers/dummy/ directory, rename it, and modify its contents. The dummy provider is a fully commented template designed for this purpose.

Automatic discovery

Meikipop automatically discovers and loads any valid OCR provider. To be discovered, your provider must meet two conditions:

Create a subdirectory

Your provider must be in its own subdirectory inside /src/ocr/providers/. For example: /src/ocr/providers/my_cool_ocr/.

Create provider.py

Your subdirectory must have a provider.py file containing a class that inherits from OcrProvider.

Once these conditions are met, meikipop will automatically detect and load your provider on startup.

Implementation steps

Step 1: Set up the directory structure

Create your provider directory:

mkdir src/ocr/providers/my_cool_ocr
touch src/ocr/providers/my_cool_ocr/__init__.py
touch src/ocr/providers/my_cool_ocr/provider.py

Step 2: Define your provider class

In provider.py, create a class that inherits from OcrProvider:

src/ocr/providers/my_cool_ocr/provider.py

import logging
from typing import List, Optional

from PIL import Image

from src.ocr.interface import OcrProvider, Paragraph, Word, BoundingBox

logger = logging.getLogger(__name__)


class MyCoolOcrProvider(OcrProvider):
    """
    A custom OCR provider using My Cool OCR engine.
    """
    # The NAME is displayed in the settings and tray menu
    NAME = "My Cool OCR"
    
    def __init__(self):
        """Initialize your OCR client here."""
        # Import and initialize your OCR library
        # self.client = my_cool_ocr.Client(api_key="...")
        pass
    
    def scan(self, image: Image.Image) -> Optional[List[Paragraph]]:
        """
        Performs OCR on the given image.
        
        This method must:
        1. Get OCR data from your engine
        2. Convert it to meikipop's format
        3. Return a list of Paragraphs
        """
        try:
            # Your implementation here
            return self._process_image(image)
        except Exception as e:
            logger.error(f"Error in {self.NAME}: {e}", exc_info=True)
            return None
    
    def _process_image(self, image: Image.Image) -> List[Paragraph]:
        # Your OCR processing logic
        pass

The NAME property is required and must be a unique, user-friendly string. This name appears in the settings and tray menu.

Step 3: Implement the scan method

Your scan method must perform three key tasks:

Obtain OCR data

Call your OCR engine to get raw results. This could be:

A Python library call
A REST API request
A command-line tool execution
A local model inference

Transform the data

Convert your OCR engine’s proprietary format into meikipop’s standard data model using BoundingBox, Word, and Paragraph objects.

Return the results

Return a List[Paragraph] on success, an empty list [] if no text found, or None if a critical error occurred.

Complete example: Dummy provider

Here’s the complete dummy provider that demonstrates all required transformations:

src/ocr/providers/dummy/provider.py

import logging
from typing import List, Optional

from PIL import Image

from src.ocr.interface import OcrProvider, Paragraph, Word, BoundingBox

logger = logging.getLogger(__name__)


class DummyProvider(OcrProvider):
    NAME = "Dummy OCR (Developer Template)"

    def scan(self, image: Image.Image) -> Optional[List[Paragraph]]:
        logger.info(f"{self.NAME} received an image of size {image.size}. Returning mock data.")
        
        try:
            # --- 1. OBTAIN OCR DATA ---
            # Simulated output from a fictional OCR engine with pixel coordinates
            mock_ocr_result = [
                {
                    "text": "これは横書きテキストです",
                    "bbox": {"x": 100, "y": 150, "w": 400, "h": 40},
                    "words": [
                        {"text": "これは", "bbox": {"x": 100, "y": 150, "w": 90, "h": 40}},
                        {"text": "横書き", "bbox": {"x": 200, "y": 150, "w": 90, "h": 40}},
                        {"text": "テキストです", "bbox": {"x": 300, "y": 150, "w": 200, "h": 40}},
                    ]
                },
                {
                    "text": "縦書き",
                    "bbox": {"x": 600, "y": 200, "w": 50, "h": 300},
                    "words": [
                        {"text": "縦", "bbox": {"x": 600, "y": 200, "w": 50, "h": 95}},
                        {"text": "書", "bbox": {"x": 600, "y": 305, "w": 50, "h": 95}},
                        {"text": "き", "bbox": {"x": 600, "y": 405, "w": 50, "h": 95}},
                    ]
                }
            ]
            
            # --- 2. PROCESS AND TRANSFORM THE DATA ---
            paragraphs: List[Paragraph] = []
            img_width, img_height = image.size
            
            if img_width == 0 or img_height == 0:
                logger.error("Invalid image dimensions received.")
                return None
            
            for ocr_line in mock_ocr_result:
                line_text = ocr_line.get("text")
                line_bbox_data = ocr_line.get("bbox")
                
                # Convert pixel bbox to normalized coordinates (0.0 to 1.0)
                center_x = (line_bbox_data['x'] + line_bbox_data['w'] / 2) / img_width
                center_y = (line_bbox_data['y'] + line_bbox_data['h'] / 2) / img_height
                norm_w = line_bbox_data['w'] / img_width
                norm_h = line_bbox_data['h'] / img_height
                
                line_box = BoundingBox(center_x, center_y, norm_w, norm_h)
                
                # Infer text direction from aspect ratio
                is_vertical = line_bbox_data['h'] > line_bbox_data['w']
                
                # Process words within the line
                words_in_para: List[Word] = []
                for word_data in ocr_line.get("words", []):
                    word_bbox_data = word_data.get("bbox")
                    
                    # Convert word coordinates
                    word_center_x = (word_bbox_data['x'] + word_bbox_data['w'] / 2) / img_width
                    word_center_y = (word_bbox_data['y'] + word_bbox_data['h'] / 2) / img_height
                    word_norm_w = word_bbox_data['w'] / img_width
                    word_norm_h = word_bbox_data['h'] / img_height
                    
                    word_box = BoundingBox(word_center_x, word_center_y, word_norm_w, word_norm_h)
                    words_in_para.append(Word(text=word_data['text'], separator="", box=word_box))
                
                # Assemble the Paragraph object
                paragraph = Paragraph(
                    full_text=line_text,
                    words=words_in_para,
                    box=line_box,
                    is_vertical=is_vertical
                )
                paragraphs.append(paragraph)
            
            # --- 3. RETURN THE RESULT ---
            return paragraphs
            
        except Exception as e:
            logger.error(f"An error occurred in {self.NAME}: {e}", exc_info=True)
            return None

You can provide the interface file, your provider template, and sample JSON output from your OCR engine to an AI assistant (like GPT-4 or Claude) and ask it to write the adapter code for you. This can get you 90% of the way there.

Data transformation patterns

Converting bounding boxes

Your OCR engine likely returns pixel coordinates. You must normalize them:

# From pixel coordinates (top-left corner + dimensions)
raw_box = {'x': 50, 'y': 100, 'w': 200, 'h': 40}
img_width, img_height = 1000, 800

# To normalized center-based coordinates
center_x = (raw_box['x'] + raw_box['w'] / 2) / img_width  # 0.15
center_y = (raw_box['y'] + raw_box['h'] / 2) / img_height  # 0.15
width = raw_box['w'] / img_width  # 0.2
height = raw_box['h'] / img_height  # 0.05

meiki_box = BoundingBox(center_x, center_y, width, height)

Determining text direction

If your OCR engine doesn’t provide text direction, infer it from the bounding box:

# Vertical text typically has height > width
is_vertical = bounding_box.height > bounding_box.width

# Or from pixel dimensions before normalization
is_vertical = raw_bbox['h'] > raw_bbox['w']

Handling word vs. character granularity

Meikipop works well with both word-level and character-level boxes. Character-level boxes often provide more precise lookups.

# Character-level (recommended for Japanese)
for char_info in line_chars:
    words_in_line.append(Word(
        text=char_info['char'],  # Single character
        separator="",
        box=convert_bbox(char_info['bbox'])
    ))

# Word-level (also works)
for word_info in line_words:
    words_in_line.append(Word(
        text=word_info['text'],  # Full word
        separator="",
        box=convert_bbox(word_info['bbox'])
    ))

Common OCR integration patterns

Python library integration

import my_cool_ocr_library

class MyCoolOcrProvider(OcrProvider):
    def __init__(self):
        self.client = my_cool_ocr_library.Client(api_key="...")
    
    def scan(self, image: Image.Image) -> Optional[List[Paragraph]]:
        raw_results = self.client.recognize(image)
        return self._transform_results(raw_results)

REST API integration

import requests
import io

class ApiOcrProvider(OcrProvider):
    def __init__(self):
        self.api_url = "https://api.myocr.com/v1/scan"
        self.api_key = "your-api-key"
    
    def scan(self, image: Image.Image) -> Optional[List[Paragraph]]:
        # Convert image to bytes
        buffer = io.BytesIO()
        image.save(buffer, format='PNG')
        
        # Make API request
        response = requests.post(
            self.api_url,
            files={'image': buffer.getvalue()},
            headers={'Authorization': f'Bearer {self.api_key}'}
        )
        
        if response.status_code != 200:
            return None
        
        return self._transform_results(response.json())

Command-line tool integration

import subprocess
import json
import tempfile

class CliOcrProvider(OcrProvider):
    def scan(self, image: Image.Image) -> Optional[List[Paragraph]]:
        # Save to temp file
        with tempfile.NamedTemporaryFile(suffix='.png', delete=False) as tmp:
            image.save(tmp.name)
            
            # Run CLI tool
            result = subprocess.run(
                ['ocr-tool', '--json', tmp.name],
                capture_output=True,
                text=True
            )
            
            if result.returncode != 0:
                return None
            
            raw_results = json.loads(result.stdout)
            return self._transform_results(raw_results)

Activating your provider

Once your provider is implemented:

Run meikipop

Start the application. Your provider will be automatically discovered.

Open the tray menu

Right-click the meikipop tray icon.

Select OCR provider

Navigate to OCR Provider in the menu.

Choose your provider

Select your provider by its NAME from the list. Meikipop will now use your class for all OCR operations.

Make sure your provider’s NAME is unique to avoid conflicts with existing providers.

Testing and debugging

Enable debug logging

import logging

logger = logging.getLogger(__name__)
logger.setLevel(logging.DEBUG)

def scan(self, image: Image.Image) -> Optional[List[Paragraph]]:
    logger.debug(f"Received image of size {image.size}")
    # Your code here
    logger.debug(f"Found {len(paragraphs)} paragraphs")

Validate coordinates

def _validate_bbox(self, box: BoundingBox) -> bool:
    """Ensure all coordinates are in valid range."""
    if not (0.0 <= box.center_x <= 1.0):
        logger.warning(f"Invalid center_x: {box.center_x}")
        return False
    if not (0.0 <= box.center_y <= 1.0):
        logger.warning(f"Invalid center_y: {box.center_y}")
        return False
    if not (0.0 <= box.width <= 1.0):
        logger.warning(f"Invalid width: {box.width}")
        return False
    if not (0.0 <= box.height <= 1.0):
        logger.warning(f"Invalid height: {box.height}")
        return False
    return True

Custom OCR Providers

Contributing

Create a custom OCR provider

Overview

Automatic discovery

Implementation steps

Step 1: Set up the directory structure

Step 2: Define your provider class

Step 3: Implement the scan method

Complete example: Dummy provider

Data transformation patterns

Converting bounding boxes

Determining text direction

Handling word vs. character granularity

Common OCR integration patterns

Python library integration

REST API integration

Command-line tool integration

Activating your provider

Testing and debugging

Enable debug logging

Validate coordinates

Next steps

OCR provider interface

Available providers

Build docs developers (and LLMs) love

Custom OCR Providers

Contributing

​Overview

​Automatic discovery

​Implementation steps

​Step 1: Set up the directory structure

​Step 2: Define your provider class

​Step 3: Implement the scan method

​Complete example: Dummy provider

​Data transformation patterns

​Converting bounding boxes

​Determining text direction

​Handling word vs. character granularity

​Common OCR integration patterns

​Python library integration

​REST API integration

​Command-line tool integration

​Activating your provider

​Testing and debugging

​Enable debug logging

​Validate coordinates

​Next steps

OCR provider interface

Available providers

Build docs developers (and LLMs) love

Overview

Automatic discovery

Implementation steps

Step 1: Set up the directory structure

Step 2: Define your provider class

Step 3: Implement the scan method

Complete example: Dummy provider

Data transformation patterns

Converting bounding boxes

Determining text direction

Handling word vs. character granularity

Common OCR integration patterns

Python library integration

REST API integration

Command-line tool integration

Activating your provider

Testing and debugging

Enable debug logging

Validate coordinates

Next steps