Architecture

Overview

Retto follows a three-tier architecture designed for high-performance OCR inference:

Session Layer - Orchestrates the entire OCR pipeline
Processor Layer - Handles preprocessing and postprocessing for each stage
Worker Layer - Executes ONNX model inference

This separation of concerns allows for flexibility in backend selection, efficient processing, and clear data flow through the OCR pipeline.

Architecture Diagram

┌─────────────────────────────────────────────────────────────┐
│                      RettoSession                           │
│  ┌──────────────────────────────────────────────────────┐  │
│  │  Image Input (raw bytes)                             │  │
│  └────────────────┬─────────────────────────────────────┘  │
│                   │                                         │
│                   ▼                                         │
│  ┌────────────────────────────────────────────────────┐   │
│  │  ImageHelper: Resize & Normalize                   │   │
│  └────────────────┬───────────────────────────────────┘   │
│                   │                                         │
│  ┌────────────────▼───────────────────────────────────┐   │
│  │  DetProcessor: Text Detection                      │   │
│  │    ├─ Preprocess: Normalize, BGR conversion        │   │
│  │    ├─ Worker: det(Array4<f32>) → Array4<f32>       │   │
│  │    └─ Postprocess: Box extraction, filtering       │   │
│  └────────────────┬───────────────────────────────────┘   │
│                   │  [Detected text boxes]                 │
│                   ▼                                         │
│  ┌────────────────────────────────────────────────────┐   │
│  │  ImageHelper: Crop regions based on boxes          │   │
│  └────────────────┬───────────────────────────────────┘   │
│                   │                                         │
│  ┌────────────────▼───────────────────────────────────┐   │
│  │  ClsProcessor: Text Direction Classification       │   │
│  │    ├─ Batch processing (default: 6 images/batch)   │   │
│  │    ├─ Worker: cls(Array4<f32>) → Array2<f32>       │   │
│  │    └─ Postprocess: Rotate 180° if needed           │   │
│  └────────────────┬───────────────────────────────────┘   │
│                   │  [Oriented text regions]               │
│                   ▼                                         │
│  ┌────────────────────────────────────────────────────┐   │
│  │  RecProcessor: Text Recognition                    │   │
│  │    ├─ Batch processing (default: 6 images/batch)   │   │
│  │    ├─ Worker: rec(Array4<f32>) → Array3<f32>       │   │
│  │    └─ Postprocess: Character decoding with CTC     │   │
│  └────────────────┬───────────────────────────────────┘   │
│                   │                                         │
│                   ▼                                         │
│  ┌────────────────────────────────────────────────────┐   │
│  │  RettoWorkerResult                                 │   │
│  │    ├─ det_result: Bounding boxes & scores          │   │
│  │    ├─ cls_result: Rotation labels & confidence     │   │
│  │    └─ rec_result: Recognized text & scores         │   │
│  └────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────┘

Session Layer

The RettoSession struct is the main entry point for OCR operations. It manages the complete pipeline and coordinates between processors and workers.

Key Components

From retto-core/src/session.rs:9:

pub struct RettoSession<W: RettoWorker> {
    worker: W,
    rec_character: RecCharacter,
    config: RettoSessionConfig<W>,
}

Session Configuration

The session is configured with parameters for image resizing and all three processors:

pub struct RettoSessionConfig<W: RettoWorker> {
    pub worker_config: W::RettoWorkerConfig,
    pub max_side_len: usize,           // Default: 2000
    pub min_side_len: usize,           // Default: 30
    pub det_processor_config: DetProcessorConfig,
    pub cls_processor_config: ClsProcessorConfig,
    pub rec_processor_config: RecProcessorConfig,
}

Processing Pipeline

The main processing flow is implemented in process_pipeline (session.rs:75):

1. Image Preprocessing

let mut image = ImageHelper::new_from_raw_img_flow(input)?;
let (ori_h, ori_w) = image.size();
let (ratio_h, ratio_w) = image.resize_both(
    self.config.max_side_len, 
    self.config.min_side_len
)?;

Purpose: Load and resize the input image to fit within size constraints while maintaining aspect ratio. Images are resized to multiples of 32 pixels for optimal model performance.

2. Detection Stage

let det = DetProcessor::new(&self.config.det_processor_config, after_h, after_w)?;
let mut det_res = det.process(arr, |i| self.worker.det(i))?;

Output: Bounding boxes (as PointBox quadrilaterals) and confidence scores for detected text regions.

3. Region Cropping

let mut crop_images = det_res.0
    .iter()
    .map(|res| ImageHelper::new_from_rgb_image(
        image.get_crop_img(&res.boxes)
    ))
    .collect::<Vec<_>>();

Purpose: Extract individual text regions using perspective transformation. The cropped regions are automatically rotated by 270° if they are taller than 1.5× their width.

4. Classification Stage

let cls = ClsProcessor::new(&self.config.cls_processor_config);
let cls_res = cls.process(&mut crop_images, |i| self.worker.cls(i))?;

Output: Text orientation labels (0° or 180°) with confidence scores. Images with 180° rotation and high confidence are automatically corrected.

5. Recognition Stage

let rec = RecProcessor::new(&self.config.rec_processor_config, &self.rec_character);
let rec_res = rec.process(&crop_images, |i| self.worker.rec(i))?;

Output: Recognized text strings with confidence scores, decoded using CTC (Connectionist Temporal Classification).

Coordinate Transformation

An important aspect of the architecture is coordinate space management. From session.rs:94:

for res in &mut det_res.0 {
    res.boxes.scale_and_clip(
        after_w as f64, after_h as f64, 
        ori_w as f64, ori_h as f64
    );
}

Detected boxes are transformed from the resized image space back to the original image coordinates, ensuring that output bounding boxes match the input image dimensions.

Execution Modes

Synchronous Mode

The run() method returns all results at once:

pub fn run(&mut self, input: impl AsRef<[u8]>) -> RettoResult<RettoWorkerResult>

Returns RettoWorkerResult containing det_result, cls_result, and rec_result.

Streaming Mode

The run_stream() method sends results as each stage completes:

pub fn run_stream(
    &mut self,
    input: impl AsRef<[u8]>,
    sender: mpsc::Sender<RettoWorkerStageResult>,
) -> RettoResult<()>

Useful for progressive rendering or early processing of detection results.

Image Processing Helper

The ImageHelper struct (image_helper.rs:13) provides utilities for:

Loading: From raw bytes or RGB arrays
Resizing: With configurable min/max constraints, always aligned to 32-pixel boundaries
Color conversion: RGB to BGR for model input
Cropping: Perspective transformation for detected regions
Rotation: In-place 180° rotation for orientation correction

All image dimensions are automatically rounded to multiples of 32 pixels to match the model’s requirements and optimize inference performance.

Error Handling

The architecture uses a unified error type RettoError (error.rs:2) that wraps:

I/O errors
Image decoding errors
Array shape errors
ONNX Runtime errors
HuggingFace Hub API errors
Model loading errors

All operations return RettoResult<T> for consistent error propagation.

Performance Considerations

Batch Processing: Classification and recognition stages process multiple regions in batches (default 6) to maximize throughput
Memory Efficiency: Images are processed in-place where possible
Lazy Evaluation: Workers are only initialized when needed
Size Optimization: Automatic image resizing prevents memory exhaustion on large inputs

Get Started

Core Concepts

Guides

Examples

Architecture

Overview

Architecture Diagram

Session Layer

Key Components

Session Configuration

Processing Pipeline

1. Image Preprocessing

2. Detection Stage

3. Region Cropping

4. Classification Stage

5. Recognition Stage

Coordinate Transformation

Execution Modes

Synchronous Mode

Streaming Mode

Image Processing Helper

Error Handling

Performance Considerations

Next Steps

Processors

Workers

Build docs developers (and LLMs) love

Get Started

Core Concepts

Guides

Examples

​Overview

​Architecture Diagram

​Session Layer

​Key Components

​Session Configuration

​Processing Pipeline

​1. Image Preprocessing

​2. Detection Stage

​3. Region Cropping

​4. Classification Stage

​5. Recognition Stage

​Coordinate Transformation

​Execution Modes

​Synchronous Mode

​Streaming Mode

​Image Processing Helper

​Error Handling

​Performance Considerations

​Next Steps

Processors

Workers

Build docs developers (and LLMs) love

Overview

Architecture Diagram

Session Layer

Key Components

Session Configuration

Processing Pipeline

1. Image Preprocessing

2. Detection Stage

3. Region Cropping

4. Classification Stage

5. Recognition Stage

Coordinate Transformation

Execution Modes

Synchronous Mode

Streaming Mode

Image Processing Helper

Error Handling

Performance Considerations

Next Steps