Overview
Retto uses three specialized processors that handle the OCR pipeline stages:
DetProcessor - Text detection using DB (Differentiable Binarization) algorithm
ClsProcessor - Text orientation classification
RecProcessor - Text recognition with CTC decoding
Each processor follows a consistent three-step pattern:
Preprocess : Transform input into model-compatible format
Worker Inference : Execute ONNX model
Postprocess : Convert model output into usable results
Processor Architecture
All processors implement the Processor trait (processor.rs:33):
pub ( crate ) trait Processor : ProcessorInner {
type Config ;
type ProcessInput <' pl >;
fn process <' a , F >(
& self ,
input : Self :: ProcessInput <' a >,
worker_fun : F ,
) -> RettoResult < Self :: FinalResult >
where
F : FnMut ( Self :: PreProcessOutput <' a >) -> RettoResult < Self :: PostProcessInput <' a >>;
}
This design allows each processor to define its own input/output types while maintaining a consistent interface.
DetProcessor - Text Detection
The detection processor locates text regions in images using the DB algorithm.
Configuration
From det_processor.rs:44:
pub struct DetProcessorConfig {
// Preprocessing
pub limit_side_len : usize , // Default: 736
pub limit_type : LimitType , // Default: Min
pub mean : Array1 < f32 >, // Default: [0.5, 0.5, 0.5]
pub std : Array1 < f32 >, // Default: [0.5, 0.5, 0.5]
pub scale : f32 , // Default: 1/255
// Postprocessing
pub thresh : f32 , // Default: 0.3
pub box_thresh : f32 , // Default: 0.5
pub max_candidates : usize , // Default: 1000
pub unclip_ratio : f32 , // Default: 1.6
pub use_dilation : bool , // Default: true
pub score_mode : ScoreMode , // Default: Fast
pub min_mini_box_size : usize , // Default: 3
pub dilation_kernel : Option < Array2 < usize >>, // Default: 2×2 kernel
}
Preprocessing Pipeline
From det_processor.rs:256:
1. Image Resizing
let mut rs_helper = ImageHelper :: new_from_rgb_image_flow ( input , h , w );
rs_helper . resize_either ( & self . config . limit_type, self . config . limit_side_len) ? ;
LimitType Options :
Min: Ensures shortest side ≥ limit_side_len (default behavior)
Max: Ensures longest side ≤ limit_side_len
Dimensions are aligned to 32-pixel boundaries for optimal model performance.
2. Color Space Conversion
let input = rs_helper . rgb2bgr () ? ;
Converts RGB to BGR format expected by PaddleOCR models.
3. Normalization
fn normalize ( & self , input : & Array3 < u8 >) -> RettoResult < Array3 < f32 >> {
let normalized = ( input . mapv ( | x | x as f32 ) * self . config . scale
- & self . config . mean) / & self . config . std;
Ok ( Array3 :: from ( normalized ))
}
Applies standard normalization: (pixel * scale - mean) / std
4. Channel Permutation
fn permute ( & self , input : Array3 < f32 >) -> RettoResult < Array3 < f32 >> {
let permuted = input . permuted_axes (( 2 , 0 , 1 ));
Ok ( permuted )
}
Converts HWC (Height × Width × Channels) to CHW format required by the model.
Postprocessing Pipeline
From det_processor.rs:279:
1. Thresholding
let mut mask = GrayImage :: from_fn ( w , h , | x , y | {
let v = input [[ 0 , 0 , y as usize , x as usize ]];
Luma ([ if v > self . config . thresh { 255 } else { 0 }])
});
Creates a binary mask where pixels > thresh (default 0.3) are considered text.
2. Morphological Dilation
if let Some ( ref k ) = self . dilation_kernel {
mask = grayscale_dilate ( & mask , k );
}
Optionally expands text regions to merge nearby components using a 2×2 kernel.
3. Contour Detection
let mut boxes_res : Vec < _ > = find_contours :: < i32 >( & mask )
. iter ()
. filter_map ( | contour | {
let ( points , sside ) = self . get_mini_boxes ( & contour . points);
// Filter by size and score...
})
. collect ();
Finds contours in the binary mask and computes minimum area rectangles.
4. Box Scoring
let mean_score = self . box_score_fast ( & pred , & points );
if mean_score < self . config . box_thresh {
return None ;
}
Two scoring modes (score_mode):
Fast (default): Average score within bounding rectangle
Slow : Average score within exact polygon (more accurate)
5. Box Unclipping
fn unclip < T >( & self , point_box : & PointBox < T >) -> Vec < ImagePoint < OrderedFloat < f32 >>> {
let polygon = Polygon :: new ( LineString ( exterior_coords ), vec! []);
let area = polygon . unsigned_area ();
let perimeter = /* ... */ ;
let distance = area * ( self . config . unclip_ratio) / perimeter ;
let offset_polys = polygon . offset ( distance , JoinType :: Round ( 0.5 ),
EndType :: ClosedPolygon , 1.0 );
// ...
}
Expands detected boxes using the Vatti clipping algorithm with unclip_ratio (default 1.6) to ensure complete text capture.
6. Box Filtering
From det_processor.rs:298-316:
if sside < self . config . min_mini_box_size as f32 {
return None ; // Box too small
}
if pb_h <= OrderedFloat ( 3 f32 ) || pb_w <= OrderedFloat ( 3 f32 ) {
return None ; // Final box dimensions too small
}
7. Box Sorting
boxes_res . sort_by ( | r1 , r2 | {
let ( c1 , c2 ) = ( r1 . boxes . center_point (), r2 . boxes . center_point ());
let ( y1 , y2 ) = ( c1 . y . into_inner (), c2 . y . into_inner ());
if ( y1 - y2 ) . abs () < 10 f32 {
// Same line: sort by x-coordinate
x1 . partial_cmp ( & x2 ) . unwrap ()
} else {
// Different lines: sort by y-coordinate
y1 . partial_cmp ( & y2 ) . unwrap ()
}
});
Sorts boxes top-to-bottom, left-to-right. Boxes within 10 pixels vertically are considered on the same line.
pub struct DetProcessorResult ( pub Vec < DetProcessorInnerResult >);
pub struct DetProcessorInnerResult {
pub boxes : PointBox < OrderedFloat < f32 >>, // 4 corner points
pub score : f32 , // Confidence score
}
Each detected region is represented as a quadrilateral with four corner points, allowing for rotated and skewed text detection.
ClsProcessor - Text Orientation
The classification processor determines if text is upright (0°) or upside-down (180°).
Configuration
From cls_processor.rs:14:
pub struct ClsProcessorConfig {
pub image_shape : [ usize ; 3 ], // Default: [3, 48, 192]
pub batch_num : usize , // Default: 6
pub thresh : f32 , // Default: 0.9
pub label : Vec < u16 >, // Default: [0, 180]
}
Processing Pipeline
From cls_processor.rs:127:
1. Batch Preparation
let mut image_index_asc_size : Vec < usize > = ( 0 .. crop_images . len ()) . collect ();
image_index_asc_size . sort_by_key ( |& i | Reverse ( OrderedFloat ( crop_images [ i ] . ori_ratio ())));
let batched = image_index_asc_size
. chunks ( self . config . batch_num)
. map ( | batch | {
// Process batch...
})
. collect ();
Key insight : Images are sorted by aspect ratio (descending) before batching. This groups similar-sized images together, minimizing padding and improving efficiency.
2. Image Resizing
crop_images [ i ] . resize_norm_image ( self . config . image_shape, None )
Resizes to fixed dimensions (3 × 48 × 192) with padding to maintain aspect ratio.
3. Postprocessing
fn postprocess <' a >(
& self ,
input : Self :: PostProcessInput <' a >, // Array2<f32>
_ : Self :: PostProcessInputExtra <' a >,
) -> RettoResult < Self :: PostProcessOutput <' a >> {
let pred_idxs = input . map_axis ( Axis ( 1 ), | row | row . argmax () . unwrap ());
let mut out = Vec :: with_capacity ( pred_idxs . len ());
for ( i , & class_idx ) in pred_idxs . iter () . enumerate () {
let score = input [( i , class_idx )];
let label = self . config . label[ class_idx ];
out . push ( ClsPostProcessLabel { label , score });
}
Ok ( out )
}
4. Automatic Rotation
if label . label == 180 && label . score >= self . config . thresh {
crop_images [ idx ] . rotate_180_in_place () ? ;
}
Images detected as upside-down with high confidence (≥ 0.9) are automatically rotated before recognition.
pub struct ClsProcessorResult ( pub Vec < ClsProcessorSingleResult >);
pub struct ClsProcessorSingleResult {
pub label : ClsPostProcessLabel ,
}
pub struct ClsPostProcessLabel {
pub label : u16 , // 0 or 180
pub score : f32 , // Confidence
}
RecProcessor - Text Recognition
The recognition processor converts text images into strings using CTC decoding.
Configuration
From rec_processor.rs:102:
pub struct RecProcessorConfig {
pub character_source : RecCharacterDictProvider ,
pub image_shape : [ usize ; 3 ], // Default: [3, 48, 320]
pub batch_num : usize , // Default: 6
}
Character Dictionary
The RecCharacter struct (rec_processor.rs:23) manages the character vocabulary:
pub ( crate ) struct RecCharacter {
inner : Vec < String >, // Character dictionary
ignored_tokens : Vec < usize >, // Tokens to skip (e.g., blank)
}
Dictionary sources:
HuggingFace : Downloads from pk5ls20/PaddleModel/retto/onnx/ppocr_keys_v1.txt
Local Path : Reads from file system
Blob : Uses embedded data (WebAssembly)
The dictionary is initialized with special tokens (rec_processor.rs:38):
dict . push ( " " . to_string ()); // Space character
dict . insert ( 0 , "blank" . to_string ()); // CTC blank token
Processing Pipeline
From rec_processor.rs:214:
1. Dynamic Width Calculation
let mut max_wh_ratio = OrderedFloat ( w as f32 / h as f32 );
image_index_asc_size . chunks ( self . config . batch_num) . try_for_each ( | batch_idx | {
let mut wh_ratios = Vec :: with_capacity ( batch_idx . len ());
batch_idx . iter () . for_each ( |& i | {
let img = & images [ i ];
let ( img_h , img_w ) = img . size ();
let wh_ratio = OrderedFloat ( img_w as f32 / img_h as f32 );
wh_ratios . push ( wh_ratio );
max_wh_ratio = max ( max_wh_ratio , wh_ratio );
});
// ...
});
Dynamically adjusts image width based on aspect ratio to minimize padding while maintaining fixed height (48px).
2. Batch Processing
let mats = batch_idx
. iter ()
. map ( |& i | {
images [ i ]
. resize_norm_image (
self . config . image_shape,
Some ( max_wh_ratio . into_inner ()),
)
. insert_axis ( Axis ( 0 ))
})
. collect :: < Vec < _ >>();
All images in a batch are resized to the same width (determined by max aspect ratio).
3. CTC Decoding
From rec_processor.rs:48:
fn decode (
& self ,
text_index : & Array2 < usize >, // Predicted character indices
text_prob : & Array2 < f32 >, // Prediction probabilities
wh_ratio_list : & [ OrderedFloat < f32 >],
max_wh_ratio : OrderedFloat < f32 >,
remove_duplicate : bool , // CTC duplicate removal
return_word_box : bool , // TODO: Word-level boxes
) -> Vec <( String , f32 )>
CTC Duplicate Removal (rec_processor.rs:62):
if remove_duplicate {
Zip :: from ( selection . slice_mut ( s! [ 1 .. ]))
. and ( token_indices . slice ( s! [ 1 .. ]))
. and ( token_indices . slice ( s! [ ..- 1 ]))
. for_each ( | sel , & curr , & prev | {
* sel = * sel && curr != prev ;
});
}
Removes consecutive duplicate characters (e.g., “hello” from “hheelllloo”).
Ignored Token Filtering (rec_processor.rs:70):
self . ignored_tokens . iter () . for_each ( | ignored | {
Zip :: from ( & mut selection )
. and ( token_indices )
. for_each ( | sel , & idx | {
* sel = * sel && idx != * ignored ;
});
});
Filters out CTC blank tokens (index 0).
Score Calculation (rec_processor.rs:87):
. filter_map ( | (( & sel , & idx ), & p ) | match sel {
true => Some (( self . inner[ idx ] . clone (), p )),
false => None ,
})
. fold (
( String :: with_capacity ( text_len ), 0.0 , 0 u32 ),
| ( mut acc_str , acc_sum , sum ), ( seg , p ) | {
acc_str . push_str ( & seg );
( acc_str , acc_sum + p , sum + 1 )
},
);
( pre_res . 0 , pre_res . 1 / pre_res . 2 as f32 ) // Average score
Final score is the average confidence across all predicted characters.
pub struct RecProcessorResult ( pub Vec < RecProcessorSingleResult >);
pub struct RecProcessorSingleResult {
pub text : String , // Recognized text
pub score : f32 , // Average confidence
}
Configuration Examples
High Precision Detection
DetProcessorConfig {
thresh : 0.2 , // Lower threshold for better recall
box_thresh : 0.6 , // Higher threshold for precision
unclip_ratio : 1.8 , // More generous box expansion
score_mode : ScoreMode :: Slow , // Accurate polygon scoring
.. Default :: default ()
}
Fast Processing
ClsProcessorConfig {
batch_num : 12 , // Larger batches
thresh : 0.7 , // Lower confidence threshold
.. Default :: default ()
}
RecProcessorConfig {
batch_num : 12 ,
image_shape : [ 3 , 32 , 256 ], // Smaller dimensions
.. Default :: default ()
}
Large Images
DetProcessorConfig {
limit_side_len : 1280 , // Higher resolution
limit_type : LimitType :: Min ,
max_candidates : 2000 , // More boxes
.. Default :: default ()
}
All image dimensions must be multiples of 32 pixels. The processors automatically handle this alignment.
Batch Size : Increase batch_num for better GPU utilization, but watch memory usage
Image Resolution : Higher limit_side_len improves accuracy but increases processing time
Score Mode : Use ScoreMode::Fast for speed, ScoreMode::Slow for accuracy
Dilation : Disable use_dilation for sharp, well-separated text
Unclip Ratio : Reduce for tightly-cropped text, increase for text with large spacing
Next Steps
Architecture Understand how processors fit into the overall architecture
Workers Learn about backend selection and model loading