Skip to main content

Overview

Retto follows a three-tier architecture designed for high-performance OCR inference:
  1. Session Layer - Orchestrates the entire OCR pipeline
  2. Processor Layer - Handles preprocessing and postprocessing for each stage
  3. Worker Layer - Executes ONNX model inference
This separation of concerns allows for flexibility in backend selection, efficient processing, and clear data flow through the OCR pipeline.

Architecture Diagram

┌─────────────────────────────────────────────────────────────┐
│                      RettoSession                           │
│  ┌──────────────────────────────────────────────────────┐  │
│  │  Image Input (raw bytes)                             │  │
│  └────────────────┬─────────────────────────────────────┘  │
│                   │                                         │
│                   ▼                                         │
│  ┌────────────────────────────────────────────────────┐   │
│  │  ImageHelper: Resize & Normalize                   │   │
│  └────────────────┬───────────────────────────────────┘   │
│                   │                                         │
│  ┌────────────────▼───────────────────────────────────┐   │
│  │  DetProcessor: Text Detection                      │   │
│  │    ├─ Preprocess: Normalize, BGR conversion        │   │
│  │    ├─ Worker: det(Array4<f32>) → Array4<f32>       │   │
│  │    └─ Postprocess: Box extraction, filtering       │   │
│  └────────────────┬───────────────────────────────────┘   │
│                   │  [Detected text boxes]                 │
│                   ▼                                         │
│  ┌────────────────────────────────────────────────────┐   │
│  │  ImageHelper: Crop regions based on boxes          │   │
│  └────────────────┬───────────────────────────────────┘   │
│                   │                                         │
│  ┌────────────────▼───────────────────────────────────┐   │
│  │  ClsProcessor: Text Direction Classification       │   │
│  │    ├─ Batch processing (default: 6 images/batch)   │   │
│  │    ├─ Worker: cls(Array4<f32>) → Array2<f32>       │   │
│  │    └─ Postprocess: Rotate 180° if needed           │   │
│  └────────────────┬───────────────────────────────────┘   │
│                   │  [Oriented text regions]               │
│                   ▼                                         │
│  ┌────────────────────────────────────────────────────┐   │
│  │  RecProcessor: Text Recognition                    │   │
│  │    ├─ Batch processing (default: 6 images/batch)   │   │
│  │    ├─ Worker: rec(Array4<f32>) → Array3<f32>       │   │
│  │    └─ Postprocess: Character decoding with CTC     │   │
│  └────────────────┬───────────────────────────────────┘   │
│                   │                                         │
│                   ▼                                         │
│  ┌────────────────────────────────────────────────────┐   │
│  │  RettoWorkerResult                                 │   │
│  │    ├─ det_result: Bounding boxes & scores          │   │
│  │    ├─ cls_result: Rotation labels & confidence     │   │
│  │    └─ rec_result: Recognized text & scores         │   │
│  └────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────┘

Session Layer

The RettoSession struct is the main entry point for OCR operations. It manages the complete pipeline and coordinates between processors and workers.

Key Components

From retto-core/src/session.rs:9:
pub struct RettoSession<W: RettoWorker> {
    worker: W,
    rec_character: RecCharacter,
    config: RettoSessionConfig<W>,
}

Session Configuration

The session is configured with parameters for image resizing and all three processors:
pub struct RettoSessionConfig<W: RettoWorker> {
    pub worker_config: W::RettoWorkerConfig,
    pub max_side_len: usize,           // Default: 2000
    pub min_side_len: usize,           // Default: 30
    pub det_processor_config: DetProcessorConfig,
    pub cls_processor_config: ClsProcessorConfig,
    pub rec_processor_config: RecProcessorConfig,
}

Processing Pipeline

The main processing flow is implemented in process_pipeline (session.rs:75):

1. Image Preprocessing

let mut image = ImageHelper::new_from_raw_img_flow(input)?;
let (ori_h, ori_w) = image.size();
let (ratio_h, ratio_w) = image.resize_both(
    self.config.max_side_len, 
    self.config.min_side_len
)?;
Purpose: Load and resize the input image to fit within size constraints while maintaining aspect ratio. Images are resized to multiples of 32 pixels for optimal model performance.

2. Detection Stage

let det = DetProcessor::new(&self.config.det_processor_config, after_h, after_w)?;
let mut det_res = det.process(arr, |i| self.worker.det(i))?;
Output: Bounding boxes (as PointBox quadrilaterals) and confidence scores for detected text regions.

3. Region Cropping

let mut crop_images = det_res.0
    .iter()
    .map(|res| ImageHelper::new_from_rgb_image(
        image.get_crop_img(&res.boxes)
    ))
    .collect::<Vec<_>>();
Purpose: Extract individual text regions using perspective transformation. The cropped regions are automatically rotated by 270° if they are taller than 1.5× their width.

4. Classification Stage

let cls = ClsProcessor::new(&self.config.cls_processor_config);
let cls_res = cls.process(&mut crop_images, |i| self.worker.cls(i))?;
Output: Text orientation labels (0° or 180°) with confidence scores. Images with 180° rotation and high confidence are automatically corrected.

5. Recognition Stage

let rec = RecProcessor::new(&self.config.rec_processor_config, &self.rec_character);
let rec_res = rec.process(&crop_images, |i| self.worker.rec(i))?;
Output: Recognized text strings with confidence scores, decoded using CTC (Connectionist Temporal Classification).

Coordinate Transformation

An important aspect of the architecture is coordinate space management. From session.rs:94:
for res in &mut det_res.0 {
    res.boxes.scale_and_clip(
        after_w as f64, after_h as f64, 
        ori_w as f64, ori_h as f64
    );
}
Detected boxes are transformed from the resized image space back to the original image coordinates, ensuring that output bounding boxes match the input image dimensions.

Execution Modes

Synchronous Mode

The run() method returns all results at once:
pub fn run(&mut self, input: impl AsRef<[u8]>) -> RettoResult<RettoWorkerResult>
Returns RettoWorkerResult containing det_result, cls_result, and rec_result.

Streaming Mode

The run_stream() method sends results as each stage completes:
pub fn run_stream(
    &mut self,
    input: impl AsRef<[u8]>,
    sender: mpsc::Sender<RettoWorkerStageResult>,
) -> RettoResult<()>
Useful for progressive rendering or early processing of detection results.

Image Processing Helper

The ImageHelper struct (image_helper.rs:13) provides utilities for:
  • Loading: From raw bytes or RGB arrays
  • Resizing: With configurable min/max constraints, always aligned to 32-pixel boundaries
  • Color conversion: RGB to BGR for model input
  • Cropping: Perspective transformation for detected regions
  • Rotation: In-place 180° rotation for orientation correction
All image dimensions are automatically rounded to multiples of 32 pixels to match the model’s requirements and optimize inference performance.

Error Handling

The architecture uses a unified error type RettoError (error.rs:2) that wraps:
  • I/O errors
  • Image decoding errors
  • Array shape errors
  • ONNX Runtime errors
  • HuggingFace Hub API errors
  • Model loading errors
All operations return RettoResult<T> for consistent error propagation.

Performance Considerations

  1. Batch Processing: Classification and recognition stages process multiple regions in batches (default 6) to maximize throughput
  2. Memory Efficiency: Images are processed in-place where possible
  3. Lazy Evaluation: Workers are only initialized when needed
  4. Size Optimization: Automatic image resizing prevents memory exhaustion on large inputs

Next Steps

Processors

Learn about the three processor types and their configurations

Workers

Understand worker backends and model loading strategies

Build docs developers (and LLMs) love