Skip to main content

Overview

Retto uses three specialized processors that handle the OCR pipeline stages:
  1. DetProcessor - Text detection using DB (Differentiable Binarization) algorithm
  2. ClsProcessor - Text orientation classification
  3. RecProcessor - Text recognition with CTC decoding
Each processor follows a consistent three-step pattern:
  1. Preprocess: Transform input into model-compatible format
  2. Worker Inference: Execute ONNX model
  3. Postprocess: Convert model output into usable results

Processor Architecture

All processors implement the Processor trait (processor.rs:33):
pub(crate) trait Processor: ProcessorInner {
    type Config;
    type ProcessInput<'pl>;
    fn process<'a, F>(
        &self,
        input: Self::ProcessInput<'a>,
        worker_fun: F,
    ) -> RettoResult<Self::FinalResult>
    where
        F: FnMut(Self::PreProcessOutput<'a>) -> RettoResult<Self::PostProcessInput<'a>>;
}
This design allows each processor to define its own input/output types while maintaining a consistent interface.

DetProcessor - Text Detection

The detection processor locates text regions in images using the DB algorithm.

Configuration

From det_processor.rs:44:
pub struct DetProcessorConfig {
    // Preprocessing
    pub limit_side_len: usize,        // Default: 736
    pub limit_type: LimitType,        // Default: Min
    pub mean: Array1<f32>,            // Default: [0.5, 0.5, 0.5]
    pub std: Array1<f32>,             // Default: [0.5, 0.5, 0.5]
    pub scale: f32,                   // Default: 1/255
    
    // Postprocessing
    pub thresh: f32,                  // Default: 0.3
    pub box_thresh: f32,              // Default: 0.5
    pub max_candidates: usize,        // Default: 1000
    pub unclip_ratio: f32,            // Default: 1.6
    pub use_dilation: bool,           // Default: true
    pub score_mode: ScoreMode,        // Default: Fast
    pub min_mini_box_size: usize,     // Default: 3
    pub dilation_kernel: Option<Array2<usize>>, // Default: 2×2 kernel
}

Preprocessing Pipeline

From det_processor.rs:256:

1. Image Resizing

let mut rs_helper = ImageHelper::new_from_rgb_image_flow(input, h, w);
rs_helper.resize_either(&self.config.limit_type, self.config.limit_side_len)?;
LimitType Options:
  • Min: Ensures shortest side ≥ limit_side_len (default behavior)
  • Max: Ensures longest side ≤ limit_side_len
Dimensions are aligned to 32-pixel boundaries for optimal model performance.

2. Color Space Conversion

let input = rs_helper.rgb2bgr()?;
Converts RGB to BGR format expected by PaddleOCR models.

3. Normalization

fn normalize(&self, input: &Array3<u8>) -> RettoResult<Array3<f32>> {
    let normalized = (input.mapv(|x| x as f32) * self.config.scale 
                      - &self.config.mean) / &self.config.std;
    Ok(Array3::from(normalized))
}
Applies standard normalization: (pixel * scale - mean) / std

4. Channel Permutation

fn permute(&self, input: Array3<f32>) -> RettoResult<Array3<f32>> {
    let permuted = input.permuted_axes((2, 0, 1));
    Ok(permuted)
}
Converts HWC (Height × Width × Channels) to CHW format required by the model.

Postprocessing Pipeline

From det_processor.rs:279:

1. Thresholding

let mut mask = GrayImage::from_fn(w, h, |x, y| {
    let v = input[[0, 0, y as usize, x as usize]];
    Luma([if v > self.config.thresh { 255 } else { 0 }])
});
Creates a binary mask where pixels > thresh (default 0.3) are considered text.

2. Morphological Dilation

if let Some(ref k) = self.dilation_kernel {
    mask = grayscale_dilate(&mask, k);
}
Optionally expands text regions to merge nearby components using a 2×2 kernel.

3. Contour Detection

let mut boxes_res: Vec<_> = find_contours::<i32>(&mask)
    .iter()
    .filter_map(|contour| {
        let (points, sside) = self.get_mini_boxes(&contour.points);
        // Filter by size and score...
    })
    .collect();
Finds contours in the binary mask and computes minimum area rectangles.

4. Box Scoring

let mean_score = self.box_score_fast(&pred, &points);
if mean_score < self.config.box_thresh {
    return None;
}
Two scoring modes (score_mode):
  • Fast (default): Average score within bounding rectangle
  • Slow: Average score within exact polygon (more accurate)

5. Box Unclipping

fn unclip<T>(&self, point_box: &PointBox<T>) -> Vec<ImagePoint<OrderedFloat<f32>>> {
    let polygon = Polygon::new(LineString(exterior_coords), vec![]);
    let area = polygon.unsigned_area();
    let perimeter = /* ... */;
    let distance = area * (self.config.unclip_ratio) / perimeter;
    let offset_polys = polygon.offset(distance, JoinType::Round(0.5), 
                                       EndType::ClosedPolygon, 1.0);
    // ...
}
Expands detected boxes using the Vatti clipping algorithm with unclip_ratio (default 1.6) to ensure complete text capture.

6. Box Filtering

From det_processor.rs:298-316:
if sside < self.config.min_mini_box_size as f32 {
    return None;  // Box too small
}

if pb_h <= OrderedFloat(3f32) || pb_w <= OrderedFloat(3f32) {
    return None;  // Final box dimensions too small
}

7. Box Sorting

boxes_res.sort_by(|r1, r2| {
    let (c1, c2) = (r1.boxes.center_point(), r2.boxes.center_point());
    let (y1, y2) = (c1.y.into_inner(), c2.y.into_inner());
    if (y1 - y2).abs() < 10f32 {
        // Same line: sort by x-coordinate
        x1.partial_cmp(&x2).unwrap()
    } else {
        // Different lines: sort by y-coordinate
        y1.partial_cmp(&y2).unwrap()
    }
});
Sorts boxes top-to-bottom, left-to-right. Boxes within 10 pixels vertically are considered on the same line.

Output Format

pub struct DetProcessorResult(pub Vec<DetProcessorInnerResult>);

pub struct DetProcessorInnerResult {
    pub boxes: PointBox<OrderedFloat<f32>>,  // 4 corner points
    pub score: f32,                           // Confidence score
}
Each detected region is represented as a quadrilateral with four corner points, allowing for rotated and skewed text detection.

ClsProcessor - Text Orientation

The classification processor determines if text is upright (0°) or upside-down (180°).

Configuration

From cls_processor.rs:14:
pub struct ClsProcessorConfig {
    pub image_shape: [usize; 3],   // Default: [3, 48, 192]
    pub batch_num: usize,          // Default: 6
    pub thresh: f32,               // Default: 0.9
    pub label: Vec<u16>,           // Default: [0, 180]
}

Processing Pipeline

From cls_processor.rs:127:

1. Batch Preparation

let mut image_index_asc_size: Vec<usize> = (0..crop_images.len()).collect();
image_index_asc_size.sort_by_key(|&i| Reverse(OrderedFloat(crop_images[i].ori_ratio())));

let batched = image_index_asc_size
    .chunks(self.config.batch_num)
    .map(|batch| {
        // Process batch...
    })
    .collect();
Key insight: Images are sorted by aspect ratio (descending) before batching. This groups similar-sized images together, minimizing padding and improving efficiency.

2. Image Resizing

crop_images[i].resize_norm_image(self.config.image_shape, None)
Resizes to fixed dimensions (3 × 48 × 192) with padding to maintain aspect ratio.

3. Postprocessing

fn postprocess<'a>(
    &self,
    input: Self::PostProcessInput<'a>,  // Array2<f32>
    _: Self::PostProcessInputExtra<'a>,
) -> RettoResult<Self::PostProcessOutput<'a>> {
    let pred_idxs = input.map_axis(Axis(1), |row| row.argmax().unwrap());
    let mut out = Vec::with_capacity(pred_idxs.len());
    for (i, &class_idx) in pred_idxs.iter().enumerate() {
        let score = input[(i, class_idx)];
        let label = self.config.label[class_idx];
        out.push(ClsPostProcessLabel { label, score });
    }
    Ok(out)
}

4. Automatic Rotation

if label.label == 180 && label.score >= self.config.thresh {
    crop_images[idx].rotate_180_in_place()?;
}
Images detected as upside-down with high confidence (≥ 0.9) are automatically rotated before recognition.

Output Format

pub struct ClsProcessorResult(pub Vec<ClsProcessorSingleResult>);

pub struct ClsProcessorSingleResult {
    pub label: ClsPostProcessLabel,
}

pub struct ClsPostProcessLabel {
    pub label: u16,    // 0 or 180
    pub score: f32,    // Confidence
}

RecProcessor - Text Recognition

The recognition processor converts text images into strings using CTC decoding.

Configuration

From rec_processor.rs:102:
pub struct RecProcessorConfig {
    pub character_source: RecCharacterDictProvider,
    pub image_shape: [usize; 3],   // Default: [3, 48, 320]
    pub batch_num: usize,          // Default: 6
}

Character Dictionary

The RecCharacter struct (rec_processor.rs:23) manages the character vocabulary:
pub(crate) struct RecCharacter {
    inner: Vec<String>,           // Character dictionary
    ignored_tokens: Vec<usize>,   // Tokens to skip (e.g., blank)
}
Dictionary sources:
  • HuggingFace: Downloads from pk5ls20/PaddleModel/retto/onnx/ppocr_keys_v1.txt
  • Local Path: Reads from file system
  • Blob: Uses embedded data (WebAssembly)
The dictionary is initialized with special tokens (rec_processor.rs:38):
dict.push(" ".to_string());      // Space character
dict.insert(0, "blank".to_string()); // CTC blank token

Processing Pipeline

From rec_processor.rs:214:

1. Dynamic Width Calculation

let mut max_wh_ratio = OrderedFloat(w as f32 / h as f32);
image_index_asc_size.chunks(self.config.batch_num).try_for_each(|batch_idx| {
    let mut wh_ratios = Vec::with_capacity(batch_idx.len());
    batch_idx.iter().for_each(|&i| {
        let img = &images[i];
        let (img_h, img_w) = img.size();
        let wh_ratio = OrderedFloat(img_w as f32 / img_h as f32);
        wh_ratios.push(wh_ratio);
        max_wh_ratio = max(max_wh_ratio, wh_ratio);
    });
    // ...
});
Dynamically adjusts image width based on aspect ratio to minimize padding while maintaining fixed height (48px).

2. Batch Processing

let mats = batch_idx
    .iter()
    .map(|&i| {
        images[i]
            .resize_norm_image(
                self.config.image_shape,
                Some(max_wh_ratio.into_inner()),
            )
            .insert_axis(Axis(0))
    })
    .collect::<Vec<_>>();
All images in a batch are resized to the same width (determined by max aspect ratio).

3. CTC Decoding

From rec_processor.rs:48:
fn decode(
    &self,
    text_index: &Array2<usize>,    // Predicted character indices
    text_prob: &Array2<f32>,       // Prediction probabilities
    wh_ratio_list: &[OrderedFloat<f32>],
    max_wh_ratio: OrderedFloat<f32>,
    remove_duplicate: bool,         // CTC duplicate removal
    return_word_box: bool,          // TODO: Word-level boxes
) -> Vec<(String, f32)>
CTC Duplicate Removal (rec_processor.rs:62):
if remove_duplicate {
    Zip::from(selection.slice_mut(s![1..]))
        .and(token_indices.slice(s![1..]))
        .and(token_indices.slice(s![..-1]))
        .for_each(|sel, &curr, &prev| {
            *sel = *sel && curr != prev;
        });
}
Removes consecutive duplicate characters (e.g., “hello” from “hheelllloo”). Ignored Token Filtering (rec_processor.rs:70):
self.ignored_tokens.iter().for_each(|ignored| {
    Zip::from(&mut selection)
        .and(token_indices)
        .for_each(|sel, &idx| {
            *sel = *sel && idx != *ignored;
        });
});
Filters out CTC blank tokens (index 0). Score Calculation (rec_processor.rs:87):
.filter_map(|((&sel, &idx), &p)| match sel {
    true => Some((self.inner[idx].clone(), p)),
    false => None,
})
.fold(
    (String::with_capacity(text_len), 0.0, 0u32),
    |(mut acc_str, acc_sum, sum), (seg, p)| {
        acc_str.push_str(&seg);
        (acc_str, acc_sum + p, sum + 1)
    },
);
(pre_res.0, pre_res.1 / pre_res.2 as f32)  // Average score
Final score is the average confidence across all predicted characters.

Output Format

pub struct RecProcessorResult(pub Vec<RecProcessorSingleResult>);

pub struct RecProcessorSingleResult {
    pub text: String,   // Recognized text
    pub score: f32,     // Average confidence
}

Configuration Examples

High Precision Detection

DetProcessorConfig {
    thresh: 0.2,              // Lower threshold for better recall
    box_thresh: 0.6,          // Higher threshold for precision
    unclip_ratio: 1.8,        // More generous box expansion
    score_mode: ScoreMode::Slow,  // Accurate polygon scoring
    ..Default::default()
}

Fast Processing

ClsProcessorConfig {
    batch_num: 12,            // Larger batches
    thresh: 0.7,              // Lower confidence threshold
    ..Default::default()
}

RecProcessorConfig {
    batch_num: 12,
    image_shape: [3, 32, 256], // Smaller dimensions
    ..Default::default()
}

Large Images

DetProcessorConfig {
    limit_side_len: 1280,     // Higher resolution
    limit_type: LimitType::Min,
    max_candidates: 2000,     // More boxes
    ..Default::default()
}
All image dimensions must be multiples of 32 pixels. The processors automatically handle this alignment.

Performance Tips

  1. Batch Size: Increase batch_num for better GPU utilization, but watch memory usage
  2. Image Resolution: Higher limit_side_len improves accuracy but increases processing time
  3. Score Mode: Use ScoreMode::Fast for speed, ScoreMode::Slow for accuracy
  4. Dilation: Disable use_dilation for sharp, well-separated text
  5. Unclip Ratio: Reduce for tightly-cropped text, increase for text with large spacing

Next Steps

Architecture

Understand how processors fit into the overall architecture

Workers

Learn about backend selection and model loading

Build docs developers (and LLMs) love