Skip to main content

Overview

The Recognition Processor (RecProcessor) performs text recognition (OCR) on text images. It converts images of text into readable strings using character dictionaries and CTC (Connectionist Temporal Classification) decoding. Source: retto-core/src/processor/rec_processor.rs

RecProcessor

The main recognition processor that extracts text from images.

Constructor

pub fn new(config: &RecProcessorConfig, character: &RecCharacter) -> Self
config
&RecProcessorConfig
required
Recognition processor configuration
character
&RecCharacter
required
Character dictionary for text decoding. Maps model outputs to text characters.

Process Method

fn process<F>(
    &self,
    images: &Vec<ImageHelper>,
    worker_fun: F,
) -> RettoResult<RecProcessorResult>
where
    F: FnMut(Array4<f32>) -> RettoResult<Array3<f32>>
Processes a batch of text images to extract recognized text.
images
&Vec<ImageHelper>
required
Immutable reference to a vector of images to recognize. These should be cropped text regions (e.g., from detection results).
worker_fun
F
required
Worker function that runs model inference on preprocessed batches. Takes a 4D tensor (batch × channels × height × width) and returns a 3D tensor (batch × sequence × classes).
RecProcessorResult
struct
Recognition results containing recognized text and confidence scores for each image

RecProcessorConfig

Configuration structure for the recognition processor.

Fields

character_source
RecCharacterDictProvider
required
Identifies the provider of the model dictionary source. Specifies where to load the character dictionary from.
image_shape
[usize; 3]
default:"[3, 48, 320]"
Image size during recognition as [channels, height, width]. Text images are resized to this shape before processing.
batch_num
usize
default:"6"
Batch size for recognition. Images are processed in batches of this size for efficiency.

Default Configuration

The default configuration varies by platform and features:
// With hf-hub feature (Hugging Face Hub)
let config = RecProcessorConfig {
    character_source: RecCharacterDictProvider::OutSide(
        RettoWorkerModelSource::HuggingFace {
            repo: "pk5ls20/PaddleModel",
            model: "retto/onnx/ppocr_keys_v1.txt",
        }
    ),
    image_shape: [3, 48, 320],
    batch_num: 6,
};

// Without hf-hub feature (local file)
let config = RecProcessorConfig {
    character_source: RecCharacterDictProvider::OutSide(
        RettoWorkerModelSource::Path("ppocr_keys_v1.txt".into())
    ),
    image_shape: [3, 48, 320],
    batch_num: 6,
};

// WebAssembly with download-models feature (embedded)
let config = RecProcessorConfig {
    character_source: RecCharacterDictProvider::OutSide(
        RettoWorkerModelSource::Blob(
            include_bytes!("../../models/ppocr_keys_v1.txt").to_vec()
        )
    ),
    image_shape: [3, 48, 320],
    batch_num: 6,
};

Example

use retto_core::processor::RecProcessorConfig;
use retto_core::processor::RecCharacterDictProvider;
use retto_core::worker::RettoWorkerModelSource;

// Use default configuration
let config = RecProcessorConfig::default();

// Custom configuration with local dictionary
let custom_config = RecProcessorConfig {
    character_source: RecCharacterDictProvider::OutSide(
        RettoWorkerModelSource::Path("./my_dict.txt".into())
    ),
    image_shape: [3, 32, 256],
    batch_num: 8,
};

// Configuration with embedded dictionary
let embedded_config = RecProcessorConfig {
    character_source: RecCharacterDictProvider::OutSide(
        RettoWorkerModelSource::Blob(
            include_bytes!("../dicts/chinese.txt").to_vec()
        )
    ),
    image_shape: [3, 48, 320],
    batch_num: 6,
};

RecCharacterDictProvider

Enum specifying the source of the character dictionary.
pub enum RecCharacterDictProvider {
    OutSide(RettoWorkerModelSource),
    Inline(),  // TODO: Load from ONNX model
}
OutSide
enum variant
Load dictionary from external sources (file, URL, or embedded blob). Requires a RettoWorkerModelSource parameter.
Inline
enum variant
Load dictionary from the ONNX model itself (not yet implemented).

RettoWorkerModelSource

Specifies the actual source location:
pub enum RettoWorkerModelSource {
    // Load from local file path
    Path(PathBuf),
    
    // Download from Hugging Face Hub
    HuggingFace { repo: String, model: String },
    
    // Load from embedded byte array
    Blob(Vec<u8>),
}

RecCharacter

Internal structure managing the character dictionary and decoding logic.

Constructor

pub fn new(
    dict: RecCharacterDictProvider,
    ignored_tokens: Vec<usize>,
) -> RettoResult<Self>
dict
RecCharacterDictProvider
required
Source provider for the character dictionary
ignored_tokens
Vec<usize>
required
List of token indices to ignore during decoding (e.g., padding tokens)

Dictionary Format

Character dictionaries are plain text files with one character per line:





...
The processor automatically adds special characters:
  • "blank" is inserted at index 0 (CTC blank token)
  • " " (space) is appended at the end

Decode Method

fn decode(
    &self,
    text_index: &Array2<usize>,
    text_prob: &Array2<f32>,
    wh_ratio_list: &[OrderedFloat<f32>],
    max_wh_ratio: OrderedFloat<f32>,
    remove_duplicate: bool,
    return_word_box: bool,  // Not yet implemented
) -> Vec<(String, f32)>
Internal method that decodes model outputs into text strings.

RecProcessorResult

Result structure containing recognition results for all processed images.
pub struct RecProcessorResult(pub Vec<RecProcessorSingleResult>);
0
Vec<RecProcessorSingleResult>
Vector of recognition results, one per input image in the same order as input

RecProcessorSingleResult

Recognition result for a single image.
pub struct RecProcessorSingleResult {
    pub text: String,
    pub score: f32,
}
text
String
The recognized text string extracted from the image
score
f32
Average confidence score for the recognized text (0.0 to 1.0). Calculated as the mean of all character probabilities. Higher values indicate more confident recognition.

Processing Pipeline

The recognition processor follows this pipeline:
  1. Batch Preparation:
    • Sort images by aspect ratio (width/height) in descending order
    • Group images into batches of size batch_num
    • Calculate maximum width/height ratio for each batch
  2. Preprocessing (per batch):
    • Resize each image to fit image_shape while maintaining aspect ratio
    • Images shorter than the target width are padded to match the widest image in the batch
    • Normalize pixel values
    • Stack images into a batch tensor (4D array)
  3. Model Inference:
    • Pass preprocessed batch to the recognition model via worker_fun
    • Model outputs character probability sequence (3D array: batch × sequence × classes)
  4. Postprocessing (CTC Decoding):
    • For each image in the batch:
      • Find the most likely character at each time step (argmax)
      • Remove duplicate consecutive characters
      • Remove blank tokens (index 0)
      • Remove ignored tokens
      • Map token indices to characters using the dictionary
      • Calculate average confidence score
    • Store results maintaining original input order

CTC Decoding

The processor uses CTC (Connectionist Temporal Classification) decoding:
  1. Argmax Selection: Select the most probable character at each time step
  2. Duplicate Removal: Consecutive identical characters are merged (“hello” from “hheelllloo”)
  3. Blank Removal: CTC blank tokens are removed
  4. Character Mapping: Token indices are mapped to actual characters

Example Decoding

Model output indices: [0, 15, 15, 8, 0, 12, 12, 0, 0]
After blank removal:  [15, 15, 8, 12, 12]
After dedup:          [15, 8, 12]
Mapped to characters: ['h', 'e', 'l']
Final text:           "hel"

Example Usage

use retto_core::processor::{RecProcessor, RecProcessorConfig, RecCharacter};
use retto_core::processor::RecCharacterDictProvider;
use retto_core::image_helper::ImageHelper;

// Create configuration
let config = RecProcessorConfig::default();

// Load character dictionary
let character = RecCharacter::new(
    config.character_source.clone(),
    vec![],  // No ignored tokens
)?;

// Create processor
let processor = RecProcessor::new(&config, &character);

// Prepare images (e.g., cropped and oriented text regions)
let images: Vec<ImageHelper> = vec![/* ... */];

// Process images with model inference function
let results = processor.process(&images, |batch| {
    // Run your recognition model inference here
    model.run(batch)
})?;

// Access recognition results
for (i, result) in results.0.iter().enumerate() {
    println!("Image {}: text = '{}', confidence = {:.2}",
        i,
        result.text,
        result.score
    );
}

Integration with Detection and Classification

The recognition processor is typically the final stage in an OCR pipeline:
use retto_core::processor::{DetProcessor, ClsProcessor, RecProcessor};

// 1. Detect text regions
let det_results = det_processor.process(image, model_det)?;

// 2. Crop detected regions
let mut crop_images: Vec<ImageHelper> = det_results.0
    .iter()
    .map(|det| crop_region(image, &det.boxes))
    .collect();

// 3. Classify and correct orientation
let cls_results = cls_processor.process(&mut crop_images, model_cls)?;

// 4. Recognize text
let rec_results = rec_processor.process(&crop_images, model_rec)?;

// 5. Combine results
for (det, rec) in det_results.0.iter().zip(rec_results.0.iter()) {
    println!("Text at {:?}: '{}' (confidence: {:.2})",
        det.boxes,
        rec.text,
        rec.score
    );
}

Character Dictionaries

Standard Dictionaries

ppocr_keys_v1.txt: Default Chinese + English + numbers (6623 characters)
  • Includes simplified Chinese characters
  • English letters (uppercase and lowercase)
  • Numbers and common punctuation
  • Suitable for general Chinese OCR

Custom Dictionaries

Create custom dictionaries for specific use cases:
# English letters only
A
B
C
...
Z
a
b
c
...
z
# Numbers only
0
1
2
3
4
5
6
7
8
9
# Specific domain (e.g., license plates)
A
B
C
...



...

Loading Custom Dictionaries

// From local file
let config = RecProcessorConfig {
    character_source: RecCharacterDictProvider::OutSide(
        RettoWorkerModelSource::Path("./custom_dict.txt".into())
    ),
    ..Default::default()
};

// From Hugging Face Hub
let config = RecProcessorConfig {
    character_source: RecCharacterDictProvider::OutSide(
        RettoWorkerModelSource::HuggingFace {
            repo: "my-org/my-models".into(),
            model: "dicts/my_custom_dict.txt".into(),
        }
    ),
    ..Default::default()
};

// Embedded in binary
let config = RecProcessorConfig {
    character_source: RecCharacterDictProvider::OutSide(
        RettoWorkerModelSource::Blob(
            include_bytes!("../dicts/custom.txt").to_vec()
        )
    ),
    ..Default::default()
};

Performance Considerations

  • batch_num: Larger batches improve throughput but require more memory. Adjust based on your hardware.
  • image_shape:
    • Height: Smaller heights (e.g., 32) are faster but may reduce accuracy for complex characters
    • Width: Wider images (e.g., 320-640) support longer text sequences
  • Dictionary Size: Smaller dictionaries (e.g., English only) are faster and more accurate than large dictionaries (e.g., full CJK)
  • Aspect Ratio Sorting: The processor automatically sorts images by aspect ratio to minimize padding waste

Common Use Cases

Chinese OCR

let config = RecProcessorConfig {
    character_source: RecCharacterDictProvider::OutSide(
        RettoWorkerModelSource::Path("ppocr_keys_v1.txt".into())
    ),
    image_shape: [3, 48, 320],
    batch_num: 6,
};

English-Only OCR

let config = RecProcessorConfig {
    character_source: RecCharacterDictProvider::OutSide(
        RettoWorkerModelSource::Path("english_dict.txt".into())
    ),
    image_shape: [3, 32, 256],  // Smaller for Latin characters
    batch_num: 8,
};

Number Recognition (e.g., Credit Cards)

let config = RecProcessorConfig {
    character_source: RecCharacterDictProvider::OutSide(
        RettoWorkerModelSource::Path("digits_dict.txt".into())
    ),
    image_shape: [3, 32, 192],  // Smaller for simple digits
    batch_num: 16,  // Larger batches for simple cases
};

License Plate Recognition

let config = RecProcessorConfig {
    character_source: RecCharacterDictProvider::OutSide(
        RettoWorkerModelSource::Path("license_plate_dict.txt".into())
    ),
    image_shape: [3, 48, 168],  // License plates are short
    batch_num: 12,
};

Error Handling

Common errors and solutions:

Dictionary Not Found

// Error: Failed to load character dictionary
// Solution: Check that the dictionary file exists and is readable
let config = RecProcessorConfig {
    character_source: RecCharacterDictProvider::OutSide(
        RettoWorkerModelSource::Path("./dicts/ppocr_keys_v1.txt".into())
    ),
    ..Default::default()
};

Invalid UTF-8 in Dictionary

// Error: Invalid UTF-8 in dictionary blob
// Solution: Ensure dictionary file is UTF-8 encoded
// Use a text editor to re-save with UTF-8 encoding

Model Output Shape Mismatch

// Error: Model output classes don't match dictionary size
// Solution: Ensure your model is trained with the same dictionary
// Dictionary size = number of characters + 1 (for blank token)

Build docs developers (and LLMs) love