Detection Processor

Overview

The Detection Processor (DetProcessor) implements the DB (Differentiable Binarization) algorithm for text detection in images. It identifies text regions and returns their bounding boxes with confidence scores. Source: retto-core/src/processor/det_processor.rs

DetProcessor

The main detection processor structure that handles text region detection.

Constructor

pub fn new(config: &DetProcessorConfig, ori_h: usize, ori_w: usize) -> RettoResult<Self>

config

&DetProcessorConfig

required

Detection processor configuration

ori_h

usize

required

Original image height after initial resize

ori_w

usize

required

Original image width after initial resize

Process Method

fn process<F>(
    &self,
    input: ArrayView3<u8>,
    worker_fun: F,
) -> RettoResult<DetProcessorResult>
where
    F: FnMut(Array4<f32>) -> RettoResult<Array4<f32>>

Processes an input image to detect text regions.

input

ArrayView3<u8>

required

Input image as a 3D array (height × width × channels) in RGB format

worker_fun

required

Worker function that runs model inference on preprocessed data

DetProcessorResult

struct

Detection results containing bounding boxes and scores

DetProcessorConfig

Configuration structure for the detection processor implementing the DB algorithm.

Fields

Preprocessing

limit_side_len

usize

default:"736"

Limit side length of input image. Used to resize the input image before processing.

limit_type

LimitType

default:"LimitType::Min"

Input image side length restriction type. Controls how limit_side_len is applied.

mean

Array1<f32>

default:"[0.5, 0.5, 0.5]"

Channel-wise mean values for image normalization (RGB channels).

std

Array1<f32>

default:"[0.5, 0.5, 0.5]"

Channel-wise standard deviation values for image normalization (RGB channels).

scale

f32

default:"1.0 / 255.0"

Initial scale factor applied to pixel values before normalization.

Postprocessing

threch

f32

default:"0.3"

In the probability map output by DB, only pixels with scores greater than this threshold are considered to be text pixels. Lower values detect more regions but may include false positives.

box_thresh

f32

default:"0.5"

If the average score of all pixels within the border of the measurement result is greater than this threshold, the result is considered to be a text area. Higher values require more confident detections.

max_candidates

usize

default:"1000"

Maximum number of text boxes to output. Limits the number of detected regions.

unclip_ratio

f32

default:"1.6"

Expansion coefficient for the Vatti clipping algorithm. This method expands the detected text area to ensure complete text coverage. Values > 1.0 expand the region.

use_dilation

bool

default:"true"

Whether to expand the segmentation results using morphological dilation. Helps connect nearby text regions.

score_mode

ScoreMode

default:"ScoreMode::Fast"

DB detection result scoring method. Determines how confidence scores are calculated.

min_mini_box_size

usize

default:"3"

Minimum side length threshold for text boxes. Boxes smaller than this are filtered out.

dilation_kernel

Option<Array2<usize>>

default:"Some([[1, 1], [1, 1]])"

Morphological dilation kernel. Used when use_dilation is true. A 2×2 kernel of ones by default.

Example

use retto_core::processor::DetProcessorConfig;
use retto_core::processor::{LimitType, ScoreMode};
use ndarray::Array1;

// Use default configuration
let config = DetProcessorConfig::default();

// Custom configuration for high-precision detection
let custom_config = DetProcessorConfig {
    limit_side_len: 960,
    limit_type: LimitType::Max,
    threch: 0.2,
    box_thresh: 0.6,
    unclip_ratio: 2.0,
    use_dilation: true,
    score_mode: ScoreMode::Slow,
    min_mini_box_size: 5,
    ..Default::default()
};

DetProcessorResult

Result structure containing all detected text regions.

pub struct DetProcessorResult(pub Vec<DetProcessorInnerResult>);

Vec<DetProcessorInnerResult>

Vector of individual detection results, sorted by position (top-to-bottom, left-to-right)

DetProcessorInnerResult

Individual detection result for a single text region.

boxes

PointBox<OrderedFloat<f32>>

Bounding box of the detected text region as a quadrilateral. Points are ordered clockwise starting from the top-left corner.

score

f32

Confidence score for this detection (0.0 to 1.0). Higher values indicate more confident detections.

PointBox Structure

A rectangular point frame representing the detected text region.

tl()

&Point<T>

Top-left corner of the bounding box

tr()

&Point<T>

Top-right corner of the bounding box

br()

&Point<T>

Bottom-right corner of the bounding box

bl()

&Point<T>

Bottom-left corner of the bounding box

points()

&[Point<T>; 4]

All four corner points as an array (clockwise from top-left)

center_point()

Point<T>

Center point of the bounding box

width_tlc()

Width of the bounding box calculated from top-left corner

height_tlc()

Height of the bounding box calculated from top-left corner

LimitType

Enum defining how the limit_side_len parameter is applied during preprocessing.

pub enum LimitType {
    Min,  // default
    Max,
}

Min

enum variant

default:true

Ensure that the shortest side of the image is not less than limit_side_len. Use this to guarantee minimum resolution.

Max

enum variant

Ensure that the longest side of the image does not exceed limit_side_len. Use this to limit maximum processing size.

ScoreMode

Enum defining the scoring method for detection results.

pub enum ScoreMode {
    Slow,
    Fast,  // default
}

Fast

enum variant

default:true

Calculate the average score for all pixels within the bounding rectangle of the polygon. This is faster but less accurate as it includes pixels outside the actual text region.

Slow

enum variant

Calculate the average score based on all pixels within the original polygon only. This method is relatively slow but more accurate as it only considers actual text pixels.

Processing Pipeline

The detection processor follows this pipeline:

Preprocessing:
- Resize input image according to limit_type and limit_side_len
- Convert RGB to BGR color space
- Normalize pixel values: (pixel * scale - mean) / std
- Permute dimensions from HWC to CHW format
- Add batch dimension
Model Inference:
- Pass preprocessed data to the DB model via worker_fun
- Model outputs probability map for text regions
Postprocessing:
- Threshold probability map using threch to create binary mask
- Apply morphological dilation if use_dilation is enabled
- Find contours in the binary mask
- For each contour:
  - Get minimum area bounding box
  - Filter by min_mini_box_size
  - Calculate confidence score using score_mode
  - Filter by box_thresh
  - Expand region using Vatti clipping with unclip_ratio
  - Get final bounding box and scale to original image coordinates
- Sort results by position (top-to-bottom, left-to-right)
- Limit to max_candidates results

Example Usage

use retto_core::processor::{DetProcessor, DetProcessorConfig};
use ndarray::ArrayView3;

// Create configuration
let config = DetProcessorConfig::default();

// Load image as RGB array (height × width × 3)
let image: ArrayView3<u8> = load_image();
let (height, width) = (image.shape()[0], image.shape()[1]);

// Create processor
let processor = DetProcessor::new(&config, height, width)?;

// Process image with model inference function
let results = processor.process(image, |preprocessed| {
    // Run your model inference here
    model.run(preprocessed)
})?;

// Access detection results
for detection in results.0.iter() {
    println!("Text region found with score: {}", detection.score);
    println!("  Top-left: {:?}", detection.boxes.tl());
    println!("  Top-right: {:?}", detection.boxes.tr());
    println!("  Bottom-right: {:?}", detection.boxes.br());
    println!("  Bottom-left: {:?}", detection.boxes.bl());
}

Performance Considerations

limit_side_len: Smaller values process faster but may miss small text. Larger values are more accurate but slower.
score_mode: Fast is recommended for most cases. Use Slow only when accuracy is critical.
use_dilation: Disabling dilation improves performance but may separate connected text regions.
unclip_ratio: Lower values (1.2-1.5) are faster but may clip text edges. Higher values (1.6-2.0) ensure full text coverage.

Core API

Processors

Workers

CLI

WebAssembly

Detection Processor

Overview

DetProcessor

Constructor

Process Method

DetProcessorConfig

Fields

Preprocessing

Postprocessing

Example

DetProcessorResult

DetProcessorInnerResult

PointBox Structure

LimitType

ScoreMode

Processing Pipeline

Example Usage

Performance Considerations

Build docs developers (and LLMs) love

Core API

Processors

Workers

CLI

WebAssembly

​Overview

​DetProcessor

​Constructor

​Process Method

​DetProcessorConfig

​Fields

​Preprocessing

​Postprocessing

​Example

​DetProcessorResult

​DetProcessorInnerResult

​PointBox Structure

​LimitType

​ScoreMode

​Processing Pipeline

​Example Usage

​Performance Considerations

Build docs developers (and LLMs) love

Overview

DetProcessor

Constructor

Process Method

DetProcessorConfig

Fields

Preprocessing

Postprocessing

Example

DetProcessorResult

DetProcessorInnerResult

PointBox Structure

LimitType

ScoreMode

Processing Pipeline

Example Usage

Performance Considerations