Processors - Retto

Overview

Retto uses three specialized processors that handle the OCR pipeline stages:

DetProcessor - Text detection using DB (Differentiable Binarization) algorithm
ClsProcessor - Text orientation classification
RecProcessor - Text recognition with CTC decoding

Each processor follows a consistent three-step pattern:

Preprocess: Transform input into model-compatible format
Worker Inference: Execute ONNX model
Postprocess: Convert model output into usable results

Processor Architecture

All processors implement the Processor trait (processor.rs:33):

pub(crate) trait Processor: ProcessorInner {
    type Config;
    type ProcessInput<'pl>;
    fn process<'a, F>(
        &self,
        input: Self::ProcessInput<'a>,
        worker_fun: F,
    ) -> RettoResult<Self::FinalResult>
    where
        F: FnMut(Self::PreProcessOutput<'a>) -> RettoResult<Self::PostProcessInput<'a>>;
}

This design allows each processor to define its own input/output types while maintaining a consistent interface.

DetProcessor - Text Detection

The detection processor locates text regions in images using the DB algorithm.

Configuration

From det_processor.rs:44:

pub struct DetProcessorConfig {
    // Preprocessing
    pub limit_side_len: usize,        // Default: 736
    pub limit_type: LimitType,        // Default: Min
    pub mean: Array1<f32>,            // Default: [0.5, 0.5, 0.5]
    pub std: Array1<f32>,             // Default: [0.5, 0.5, 0.5]
    pub scale: f32,                   // Default: 1/255
    
    // Postprocessing
    pub thresh: f32,                  // Default: 0.3
    pub box_thresh: f32,              // Default: 0.5
    pub max_candidates: usize,        // Default: 1000
    pub unclip_ratio: f32,            // Default: 1.6
    pub use_dilation: bool,           // Default: true
    pub score_mode: ScoreMode,        // Default: Fast
    pub min_mini_box_size: usize,     // Default: 3
    pub dilation_kernel: Option<Array2<usize>>, // Default: 2×2 kernel
}

Preprocessing Pipeline

From det_processor.rs:256:

1. Image Resizing

let mut rs_helper = ImageHelper::new_from_rgb_image_flow(input, h, w);
rs_helper.resize_either(&self.config.limit_type, self.config.limit_side_len)?;

LimitType Options:

Min: Ensures shortest side ≥ limit_side_len (default behavior)
Max: Ensures longest side ≤ limit_side_len

Dimensions are aligned to 32-pixel boundaries for optimal model performance.

2. Color Space Conversion

let input = rs_helper.rgb2bgr()?;

Converts RGB to BGR format expected by PaddleOCR models.

3. Normalization

fn normalize(&self, input: &Array3<u8>) -> RettoResult<Array3<f32>> {
    let normalized = (input.mapv(|x| x as f32) * self.config.scale 
                      - &self.config.mean) / &self.config.std;
    Ok(Array3::from(normalized))
}

Applies standard normalization: (pixel * scale - mean) / std

4. Channel Permutation

fn permute(&self, input: Array3<f32>) -> RettoResult<Array3<f32>> {
    let permuted = input.permuted_axes((2, 0, 1));
    Ok(permuted)
}

Converts HWC (Height × Width × Channels) to CHW format required by the model.

Postprocessing Pipeline

From det_processor.rs:279:

1. Thresholding

let mut mask = GrayImage::from_fn(w, h, |x, y| {
    let v = input[[0, 0, y as usize, x as usize]];
    Luma([if v > self.config.thresh { 255 } else { 0 }])
});

Creates a binary mask where pixels > thresh (default 0.3) are considered text.

2. Morphological Dilation

if let Some(ref k) = self.dilation_kernel {
    mask = grayscale_dilate(&mask, k);
}

Optionally expands text regions to merge nearby components using a 2×2 kernel.

3. Contour Detection

let mut boxes_res: Vec<_> = find_contours::<i32>(&mask)
    .iter()
    .filter_map(|contour| {
        let (points, sside) = self.get_mini_boxes(&contour.points);
        // Filter by size and score...
    })
    .collect();

Finds contours in the binary mask and computes minimum area rectangles.

4. Box Scoring

let mean_score = self.box_score_fast(&pred, &points);
if mean_score < self.config.box_thresh {
    return None;
}

Two scoring modes (score_mode):

Fast (default): Average score within bounding rectangle
Slow: Average score within exact polygon (more accurate)

5. Box Unclipping

fn unclip<T>(&self, point_box: &PointBox<T>) -> Vec<ImagePoint<OrderedFloat<f32>>> {
    let polygon = Polygon::new(LineString(exterior_coords), vec![]);
    let area = polygon.unsigned_area();
    let perimeter = /* ... */;
    let distance = area * (self.config.unclip_ratio) / perimeter;
    let offset_polys = polygon.offset(distance, JoinType::Round(0.5), 
                                       EndType::ClosedPolygon, 1.0);
    // ...
}

Expands detected boxes using the Vatti clipping algorithm with unclip_ratio (default 1.6) to ensure complete text capture.

6. Box Filtering

From det_processor.rs:298-316:

if sside < self.config.min_mini_box_size as f32 {
    return None;  // Box too small
}

if pb_h <= OrderedFloat(3f32) || pb_w <= OrderedFloat(3f32) {
    return None;  // Final box dimensions too small
}

7. Box Sorting

boxes_res.sort_by(|r1, r2| {
    let (c1, c2) = (r1.boxes.center_point(), r2.boxes.center_point());
    let (y1, y2) = (c1.y.into_inner(), c2.y.into_inner());
    if (y1 - y2).abs() < 10f32 {
        // Same line: sort by x-coordinate
        x1.partial_cmp(&x2).unwrap()
    } else {
        // Different lines: sort by y-coordinate
        y1.partial_cmp(&y2).unwrap()
    }
});

Sorts boxes top-to-bottom, left-to-right. Boxes within 10 pixels vertically are considered on the same line.

Output Format

pub struct DetProcessorResult(pub Vec<DetProcessorInnerResult>);

pub struct DetProcessorInnerResult {
    pub boxes: PointBox<OrderedFloat<f32>>,  // 4 corner points
    pub score: f32,                           // Confidence score
}

Each detected region is represented as a quadrilateral with four corner points, allowing for rotated and skewed text detection.

ClsProcessor - Text Orientation

The classification processor determines if text is upright (0°) or upside-down (180°).

Configuration

From cls_processor.rs:14:

pub struct ClsProcessorConfig {
    pub image_shape: [usize; 3],   // Default: [3, 48, 192]
    pub batch_num: usize,          // Default: 6
    pub thresh: f32,               // Default: 0.9
    pub label: Vec<u16>,           // Default: [0, 180]
}

Processing Pipeline

From cls_processor.rs:127:

1. Batch Preparation

let mut image_index_asc_size: Vec<usize> = (0..crop_images.len()).collect();
image_index_asc_size.sort_by_key(|&i| Reverse(OrderedFloat(crop_images[i].ori_ratio())));

let batched = image_index_asc_size
    .chunks(self.config.batch_num)
    .map(|batch| {
        // Process batch...
    })
    .collect();

Key insight: Images are sorted by aspect ratio (descending) before batching. This groups similar-sized images together, minimizing padding and improving efficiency.

2. Image Resizing

crop_images[i].resize_norm_image(self.config.image_shape, None)

Resizes to fixed dimensions (3 × 48 × 192) with padding to maintain aspect ratio.

3. Postprocessing

fn postprocess<'a>(
    &self,
    input: Self::PostProcessInput<'a>,  // Array2<f32>
    _: Self::PostProcessInputExtra<'a>,
) -> RettoResult<Self::PostProcessOutput<'a>> {
    let pred_idxs = input.map_axis(Axis(1), |row| row.argmax().unwrap());
    let mut out = Vec::with_capacity(pred_idxs.len());
    for (i, &class_idx) in pred_idxs.iter().enumerate() {
        let score = input[(i, class_idx)];
        let label = self.config.label[class_idx];
        out.push(ClsPostProcessLabel { label, score });
    }
    Ok(out)
}

4. Automatic Rotation

if label.label == 180 && label.score >= self.config.thresh {
    crop_images[idx].rotate_180_in_place()?;
}

Images detected as upside-down with high confidence (≥ 0.9) are automatically rotated before recognition.

Output Format

pub struct ClsProcessorResult(pub Vec<ClsProcessorSingleResult>);

pub struct ClsProcessorSingleResult {
    pub label: ClsPostProcessLabel,
}

pub struct ClsPostProcessLabel {
    pub label: u16,    // 0 or 180
    pub score: f32,    // Confidence
}

RecProcessor - Text Recognition

The recognition processor converts text images into strings using CTC decoding.

Configuration

From rec_processor.rs:102:

pub struct RecProcessorConfig {
    pub character_source: RecCharacterDictProvider,
    pub image_shape: [usize; 3],   // Default: [3, 48, 320]
    pub batch_num: usize,          // Default: 6
}

Character Dictionary

The RecCharacter struct (rec_processor.rs:23) manages the character vocabulary:

pub(crate) struct RecCharacter {
    inner: Vec<String>,           // Character dictionary
    ignored_tokens: Vec<usize>,   // Tokens to skip (e.g., blank)
}

Dictionary sources:

HuggingFace: Downloads from pk5ls20/PaddleModel/retto/onnx/ppocr_keys_v1.txt
Local Path: Reads from file system
Blob: Uses embedded data (WebAssembly)

The dictionary is initialized with special tokens (rec_processor.rs:38):

dict.push(" ".to_string());      // Space character
dict.insert(0, "blank".to_string()); // CTC blank token

Processing Pipeline

From rec_processor.rs:214:

1. Dynamic Width Calculation

let mut max_wh_ratio = OrderedFloat(w as f32 / h as f32);
image_index_asc_size.chunks(self.config.batch_num).try_for_each(|batch_idx| {
    let mut wh_ratios = Vec::with_capacity(batch_idx.len());
    batch_idx.iter().for_each(|&i| {
        let img = &images[i];
        let (img_h, img_w) = img.size();
        let wh_ratio = OrderedFloat(img_w as f32 / img_h as f32);
        wh_ratios.push(wh_ratio);
        max_wh_ratio = max(max_wh_ratio, wh_ratio);
    });
    // ...
});

Dynamically adjusts image width based on aspect ratio to minimize padding while maintaining fixed height (48px).

2. Batch Processing

let mats = batch_idx
    .iter()
    .map(|&i| {
        images[i]
            .resize_norm_image(
                self.config.image_shape,
                Some(max_wh_ratio.into_inner()),
            )
            .insert_axis(Axis(0))
    })
    .collect::<Vec<_>>();

All images in a batch are resized to the same width (determined by max aspect ratio).

3. CTC Decoding

From rec_processor.rs:48:

fn decode(
    &self,
    text_index: &Array2<usize>,    // Predicted character indices
    text_prob: &Array2<f32>,       // Prediction probabilities
    wh_ratio_list: &[OrderedFloat<f32>],
    max_wh_ratio: OrderedFloat<f32>,
    remove_duplicate: bool,         // CTC duplicate removal
    return_word_box: bool,          // TODO: Word-level boxes
) -> Vec<(String, f32)>

CTC Duplicate Removal (rec_processor.rs:62):

if remove_duplicate {
    Zip::from(selection.slice_mut(s![1..]))
        .and(token_indices.slice(s![1..]))
        .and(token_indices.slice(s![..-1]))
        .for_each(|sel, &curr, &prev| {
            *sel = *sel && curr != prev;
        });
}

Removes consecutive duplicate characters (e.g., “hello” from “hheelllloo”). Ignored Token Filtering (rec_processor.rs:70):

self.ignored_tokens.iter().for_each(|ignored| {
    Zip::from(&mut selection)
        .and(token_indices)
        .for_each(|sel, &idx| {
            *sel = *sel && idx != *ignored;
        });
});

Filters out CTC blank tokens (index 0). Score Calculation (rec_processor.rs:87):

.filter_map(|((&sel, &idx), &p)| match sel {
    true => Some((self.inner[idx].clone(), p)),
    false => None,
})
.fold(
    (String::with_capacity(text_len), 0.0, 0u32),
    |(mut acc_str, acc_sum, sum), (seg, p)| {
        acc_str.push_str(&seg);
        (acc_str, acc_sum + p, sum + 1)
    },
);
(pre_res.0, pre_res.1 / pre_res.2 as f32)  // Average score

Final score is the average confidence across all predicted characters.

Output Format

pub struct RecProcessorResult(pub Vec<RecProcessorSingleResult>);

pub struct RecProcessorSingleResult {
    pub text: String,   // Recognized text
    pub score: f32,     // Average confidence
}

Configuration Examples

High Precision Detection

DetProcessorConfig {
    thresh: 0.2,              // Lower threshold for better recall
    box_thresh: 0.6,          // Higher threshold for precision
    unclip_ratio: 1.8,        // More generous box expansion
    score_mode: ScoreMode::Slow,  // Accurate polygon scoring
    ..Default::default()
}

Fast Processing

ClsProcessorConfig {
    batch_num: 12,            // Larger batches
    thresh: 0.7,              // Lower confidence threshold
    ..Default::default()
}

RecProcessorConfig {
    batch_num: 12,
    image_shape: [3, 32, 256], // Smaller dimensions
    ..Default::default()
}

Large Images

DetProcessorConfig {
    limit_side_len: 1280,     // Higher resolution
    limit_type: LimitType::Min,
    max_candidates: 2000,     // More boxes
    ..Default::default()
}

All image dimensions must be multiples of 32 pixels. The processors automatically handle this alignment.

Performance Tips

Batch Size: Increase batch_num for better GPU utilization, but watch memory usage
Image Resolution: Higher limit_side_len improves accuracy but increases processing time
Score Mode: Use ScoreMode::Fast for speed, ScoreMode::Slow for accuracy
Dilation: Disable use_dilation for sharp, well-separated text
Unclip Ratio: Reduce for tightly-cropped text, increase for text with large spacing

Get Started

Core Concepts

Guides

Examples

​Overview

​Processor Architecture

​DetProcessor - Text Detection

​Configuration

​Preprocessing Pipeline

​1. Image Resizing

​2. Color Space Conversion

​3. Normalization

​4. Channel Permutation

​Postprocessing Pipeline

​1. Thresholding

​2. Morphological Dilation

​3. Contour Detection

​4. Box Scoring

​5. Box Unclipping

​6. Box Filtering

​7. Box Sorting

​Output Format

​ClsProcessor - Text Orientation

​Configuration

​Processing Pipeline

​1. Batch Preparation

​2. Image Resizing

​3. Postprocessing

​4. Automatic Rotation

​Output Format

​RecProcessor - Text Recognition

​Configuration

​Character Dictionary

​Processing Pipeline

​1. Dynamic Width Calculation

​2. Batch Processing

​3. CTC Decoding

​Output Format

​Configuration Examples

​High Precision Detection

​Fast Processing

​Large Images

​Performance Tips

​Next Steps

Architecture

Workers

Build docs developers (and LLMs) love

Overview

Processor Architecture

DetProcessor - Text Detection

Configuration

Preprocessing Pipeline

1. Image Resizing

2. Color Space Conversion

3. Normalization

4. Channel Permutation

Postprocessing Pipeline

1. Thresholding

2. Morphological Dilation

3. Contour Detection

4. Box Scoring

5. Box Unclipping

6. Box Filtering

7. Box Sorting

Output Format

ClsProcessor - Text Orientation

Configuration

Processing Pipeline

1. Batch Preparation

2. Image Resizing

3. Postprocessing

4. Automatic Rotation

Output Format

RecProcessor - Text Recognition

Configuration

Character Dictionary

Processing Pipeline

1. Dynamic Width Calculation

2. Batch Processing

3. CTC Decoding

Output Format

Configuration Examples

High Precision Detection

Fast Processing

Large Images

Performance Tips

Next Steps