Skip to main content

Overview

The Detection Processor (DetProcessor) implements the DB (Differentiable Binarization) algorithm for text detection in images. It identifies text regions and returns their bounding boxes with confidence scores. Source: retto-core/src/processor/det_processor.rs

DetProcessor

The main detection processor structure that handles text region detection.

Constructor

pub fn new(config: &DetProcessorConfig, ori_h: usize, ori_w: usize) -> RettoResult<Self>
config
&DetProcessorConfig
required
Detection processor configuration
ori_h
usize
required
Original image height after initial resize
ori_w
usize
required
Original image width after initial resize

Process Method

fn process<F>(
    &self,
    input: ArrayView3<u8>,
    worker_fun: F,
) -> RettoResult<DetProcessorResult>
where
    F: FnMut(Array4<f32>) -> RettoResult<Array4<f32>>
Processes an input image to detect text regions.
input
ArrayView3<u8>
required
Input image as a 3D array (height × width × channels) in RGB format
worker_fun
F
required
Worker function that runs model inference on preprocessed data
DetProcessorResult
struct
Detection results containing bounding boxes and scores

DetProcessorConfig

Configuration structure for the detection processor implementing the DB algorithm.

Fields

Preprocessing

limit_side_len
usize
default:"736"
Limit side length of input image. Used to resize the input image before processing.
limit_type
LimitType
default:"LimitType::Min"
Input image side length restriction type. Controls how limit_side_len is applied.
mean
Array1<f32>
default:"[0.5, 0.5, 0.5]"
Channel-wise mean values for image normalization (RGB channels).
std
Array1<f32>
default:"[0.5, 0.5, 0.5]"
Channel-wise standard deviation values for image normalization (RGB channels).
scale
f32
default:"1.0 / 255.0"
Initial scale factor applied to pixel values before normalization.

Postprocessing

threch
f32
default:"0.3"
In the probability map output by DB, only pixels with scores greater than this threshold are considered to be text pixels. Lower values detect more regions but may include false positives.
box_thresh
f32
default:"0.5"
If the average score of all pixels within the border of the measurement result is greater than this threshold, the result is considered to be a text area. Higher values require more confident detections.
max_candidates
usize
default:"1000"
Maximum number of text boxes to output. Limits the number of detected regions.
unclip_ratio
f32
default:"1.6"
Expansion coefficient for the Vatti clipping algorithm. This method expands the detected text area to ensure complete text coverage. Values > 1.0 expand the region.
use_dilation
bool
default:"true"
Whether to expand the segmentation results using morphological dilation. Helps connect nearby text regions.
score_mode
ScoreMode
default:"ScoreMode::Fast"
DB detection result scoring method. Determines how confidence scores are calculated.
min_mini_box_size
usize
default:"3"
Minimum side length threshold for text boxes. Boxes smaller than this are filtered out.
dilation_kernel
Option<Array2<usize>>
default:"Some([[1, 1], [1, 1]])"
Morphological dilation kernel. Used when use_dilation is true. A 2×2 kernel of ones by default.

Example

use retto_core::processor::DetProcessorConfig;
use retto_core::processor::{LimitType, ScoreMode};
use ndarray::Array1;

// Use default configuration
let config = DetProcessorConfig::default();

// Custom configuration for high-precision detection
let custom_config = DetProcessorConfig {
    limit_side_len: 960,
    limit_type: LimitType::Max,
    threch: 0.2,
    box_thresh: 0.6,
    unclip_ratio: 2.0,
    use_dilation: true,
    score_mode: ScoreMode::Slow,
    min_mini_box_size: 5,
    ..Default::default()
};

DetProcessorResult

Result structure containing all detected text regions.
pub struct DetProcessorResult(pub Vec<DetProcessorInnerResult>);
0
Vec<DetProcessorInnerResult>
Vector of individual detection results, sorted by position (top-to-bottom, left-to-right)

DetProcessorInnerResult

Individual detection result for a single text region.
boxes
PointBox<OrderedFloat<f32>>
Bounding box of the detected text region as a quadrilateral. Points are ordered clockwise starting from the top-left corner.
score
f32
Confidence score for this detection (0.0 to 1.0). Higher values indicate more confident detections.

PointBox Structure

A rectangular point frame representing the detected text region.
tl()
&Point<T>
Top-left corner of the bounding box
tr()
&Point<T>
Top-right corner of the bounding box
br()
&Point<T>
Bottom-right corner of the bounding box
bl()
&Point<T>
Bottom-left corner of the bounding box
points()
&[Point<T>; 4]
All four corner points as an array (clockwise from top-left)
center_point()
Point<T>
Center point of the bounding box
width_tlc()
T
Width of the bounding box calculated from top-left corner
height_tlc()
T
Height of the bounding box calculated from top-left corner

LimitType

Enum defining how the limit_side_len parameter is applied during preprocessing.
pub enum LimitType {
    Min,  // default
    Max,
}
Min
enum variant
default:true
Ensure that the shortest side of the image is not less than limit_side_len. Use this to guarantee minimum resolution.
Max
enum variant
Ensure that the longest side of the image does not exceed limit_side_len. Use this to limit maximum processing size.

ScoreMode

Enum defining the scoring method for detection results.
pub enum ScoreMode {
    Slow,
    Fast,  // default
}
Fast
enum variant
default:true
Calculate the average score for all pixels within the bounding rectangle of the polygon. This is faster but less accurate as it includes pixels outside the actual text region.
Slow
enum variant
Calculate the average score based on all pixels within the original polygon only. This method is relatively slow but more accurate as it only considers actual text pixels.

Processing Pipeline

The detection processor follows this pipeline:
  1. Preprocessing:
    • Resize input image according to limit_type and limit_side_len
    • Convert RGB to BGR color space
    • Normalize pixel values: (pixel * scale - mean) / std
    • Permute dimensions from HWC to CHW format
    • Add batch dimension
  2. Model Inference:
    • Pass preprocessed data to the DB model via worker_fun
    • Model outputs probability map for text regions
  3. Postprocessing:
    • Threshold probability map using threch to create binary mask
    • Apply morphological dilation if use_dilation is enabled
    • Find contours in the binary mask
    • For each contour:
      • Get minimum area bounding box
      • Filter by min_mini_box_size
      • Calculate confidence score using score_mode
      • Filter by box_thresh
      • Expand region using Vatti clipping with unclip_ratio
      • Get final bounding box and scale to original image coordinates
    • Sort results by position (top-to-bottom, left-to-right)
    • Limit to max_candidates results

Example Usage

use retto_core::processor::{DetProcessor, DetProcessorConfig};
use ndarray::ArrayView3;

// Create configuration
let config = DetProcessorConfig::default();

// Load image as RGB array (height × width × 3)
let image: ArrayView3<u8> = load_image();
let (height, width) = (image.shape()[0], image.shape()[1]);

// Create processor
let processor = DetProcessor::new(&config, height, width)?;

// Process image with model inference function
let results = processor.process(image, |preprocessed| {
    // Run your model inference here
    model.run(preprocessed)
})?;

// Access detection results
for detection in results.0.iter() {
    println!("Text region found with score: {}", detection.score);
    println!("  Top-left: {:?}", detection.boxes.tl());
    println!("  Top-right: {:?}", detection.boxes.tr());
    println!("  Bottom-right: {:?}", detection.boxes.br());
    println!("  Bottom-left: {:?}", detection.boxes.bl());
}

Performance Considerations

  • limit_side_len: Smaller values process faster but may miss small text. Larger values are more accurate but slower.
  • score_mode: Fast is recommended for most cases. Use Slow only when accuracy is critical.
  • use_dilation: Disabling dilation improves performance but may separate connected text regions.
  • unclip_ratio: Lower values (1.2-1.5) are faster but may clip text edges. Higher values (1.6-2.0) ensure full text coverage.

Build docs developers (and LLMs) love