OCR provider interface

Overview

The OcrProvider interface defines the contract that all OCR providers must implement to work with meikipop. This abstraction allows you to swap different OCR backends without modifying the core application logic. All interface definitions are located in src/ocr/interface.py.

OcrProvider abstract class

Your custom provider must inherit from OcrProvider and implement its abstract methods:

src/ocr/interface.py

class OcrProvider(abc.ABC):
    """
    Abstract base class for an OCR provider.

    Any class that implements this interface can be used by the application's
    OcrProcessor. This allows for easily swapping out different OCR backends.
    """

    @property
    @abc.abstractmethod
    def NAME(self) -> str:
        """A user-friendly name for this provider."""
        raise NotImplementedError

    @abc.abstractmethod
    def scan(self, image: Image.Image) -> Optional[List[Paragraph]]:
        """
        Performs OCR on the given image.

        Args:
            image: A PIL Image object to perform OCR on.

        Returns:
            A list of Paragraph objects found in the image, or None if an
            error occurred. Returns an empty list if no text is found.
        """
        raise NotImplementedError

Required properties

NAME

str

required

A unique, user-friendly string for your provider (e.g., "My Cool OCR"). This name appears in the settings and tray icon menus.

Required methods

scan

method

required

The core method where all OCR processing happens.Parameters:

image (PIL.Image.Image): The screen region to scan

Returns:

List[Paragraph]: If OCR succeeds (return empty list [] if no text found)
None: If a critical error occurred

The scan method receives a PIL Image object and must return data in meikipop’s standard format. Your main task is converting your OCR engine’s output into this format.

Data models

Your scan method must return data using these three immutable dataclasses:

BoundingBox

Represents the location and size of text with normalized coordinates.

src/ocr/interface.py

@dataclass(frozen=True)
class BoundingBox:
    """A normalized bounding box. All coordinates are floats between 0.0 and 1.0."""
    center_x: float
    center_y: float
    width: float
    height: float

center_x

float

required

Horizontal center position, normalized to 0.0-1.0 range (0.0 is left edge)

center_y

float

required

Vertical center position, normalized to 0.0-1.0 range (0.0 is top edge)

width

float

required

Width of the bounding box, normalized to 0.0-1.0 range

height

float

required

Height of the bounding box, normalized to 0.0-1.0 range

All coordinates and dimensions must be normalized to a 0.0-1.0 float range, relative to the input image’s dimensions. (0.0, 0.0) represents the top-left corner.

Converting pixel coordinates to normalized format

If your OCR engine returns absolute pixel coordinates, you need to convert them:

# Raw data from your OCR engine
raw_box = {'x': 50, 'y': 100, 'w': 200, 'h': 40}
img_width, img_height = 1000, 800

# Conversion to normalized center_x, center_y, width, height
center_x = (raw_box['x'] + raw_box['w'] / 2) / img_width  # 0.15
center_y = (raw_box['y'] + raw_box['h'] / 2) / img_height  # 0.15
width = raw_box['w'] / img_width  # 0.2
height = raw_box['h'] / img_height  # 0.05

# Create the meikipop object
meiki_box = BoundingBox(center_x, center_y, width, height)

Word

Represents a single recognized text element.

src/ocr/interface.py

@dataclass(frozen=True)
class Word:
    """Represents a single word recognized by the OCR."""
    text: str  # this can be either a word or a single character
    separator: str  # The separator that follows the word (e.g., a space) - optional
    box: BoundingBox

text

str

required

The recognized text. Can be a full word ("日本語") or a single character ("日"). Single-character boxes often lead to more precise lookups.

separator

str

required

The character that follows the word. Usually an empty string "" for Japanese text.

box

BoundingBox

required

The bounding box for this specific word or character.

Meikipop’s hit-scanning works well with both word-level and character-level boxes. Providing single-character boxes often leads to more precise dictionary lookups.

Paragraph

Represents a block of text composed of words.

src/ocr/interface.py

@dataclass(frozen=True)
class Paragraph:
    """Represents a block of text, composed of words."""
    full_text: str
    words: List[Word]
    box: BoundingBox
    is_vertical: bool  # True if text is top-to-bottom - optional

full_text

str

required

The complete, reconstructed text of the paragraph.

words

List[Word]

required

A list of Word objects that form this paragraph.

box

BoundingBox

required

The bounding box encompassing the entire paragraph.

is_vertical

bool

required

Must be True if the text is written top-to-bottom (vertical Japanese text). If your OCR engine doesn’t provide this information, you can infer it from the bounding box aspect ratio: height > width.

For Japanese text, correctly setting is_vertical is crucial for proper text rendering and lookups.

Example implementation

Here’s how a typical scan method transforms OCR data:

src/ocr/providers/dummy/provider.py

def scan(self, image: Image.Image) -> Optional[List[Paragraph]]:
    try:
        # 1. Get OCR data from your engine
        raw_ocr_results = your_ocr_engine.recognize(image)
        
        # 2. Transform to meikipop format
        paragraphs: List[Paragraph] = []
        img_width, img_height = image.size
        
        for ocr_line in raw_ocr_results:
            # Convert pixel bbox to normalized BoundingBox
            line_bbox_data = ocr_line.get("bbox")
            center_x = (line_bbox_data['x'] + line_bbox_data['w'] / 2) / img_width
            center_y = (line_bbox_data['y'] + line_bbox_data['h'] / 2) / img_height
            norm_w = line_bbox_data['w'] / img_width
            norm_h = line_bbox_data['h'] / img_height
            
            line_box = BoundingBox(center_x, center_y, norm_w, norm_h)
            
            # Determine text direction
            is_vertical = line_bbox_data['h'] > line_bbox_data['w']
            
            # Process words
            words_in_para: List[Word] = []
            for word_data in ocr_line.get("words", []):
                # Convert word coordinates
                word_box = convert_to_bounding_box(word_data['bbox'], img_width, img_height)
                words_in_para.append(Word(text=word_data['text'], separator="", box=word_box))
            
            # Create Paragraph object
            paragraph = Paragraph(
                full_text=ocr_line.get("text"),
                words=words_in_para,
                box=line_box,
                is_vertical=is_vertical
            )
            paragraphs.append(paragraph)
        
        # 3. Return the results
        return paragraphs
        
    except Exception as e:
        logger.error(f"Error in OCR: {e}", exc_info=True)
        return None  # Return None on critical errors

Custom OCR Providers

Contributing

OCR provider interface

Overview

OcrProvider abstract class

Required properties

Required methods

Data models

BoundingBox

Converting pixel coordinates to normalized format

Word

Paragraph

Example implementation

Next steps

Create a custom provider

Available providers

Build docs developers (and LLMs) love

Custom OCR Providers

Contributing

​Overview

​OcrProvider abstract class

​Required properties

​Required methods

​Data models

​BoundingBox

​Converting pixel coordinates to normalized format

​Word

​Paragraph

​Example implementation

​Next steps

Create a custom provider

Available providers

Build docs developers (and LLMs) love

Overview

OcrProvider abstract class

Required properties

Required methods

Data models

BoundingBox

Converting pixel coordinates to normalized format

Word

Paragraph

Example implementation

Next steps