The Sanitizer Worker cleans and sanitizes tax document data to prevent security issues and ensure data quality.
Overview
The Sanitizer Worker processes document data to:
- Remove potentially harmful content
- Clean and trim whitespace
- Escape special characters
- Remove invalid or dangerous data
- Ensure data safety and integrity
Methods
sanitize
Sanitizes tax document data for security and quality.
bag
DocumentBagInterface
required
Container with the document data to sanitize.
Sanitized document data array.
Throws: SanitizerException if sanitization fails.
use libredte\lib\Core\Service\ServiceFactory;
$factory = new ServiceFactory();
$documentComponent = $factory->make('billing.document');
$sanitizer = $documentComponent->getSanitizerWorker();
// Sanitize the document data
$sanitizedData = $sanitizer->sanitize($bag);
Accessing the Sanitizer Worker
Access the Sanitizer Worker through the Document Component:
use libredte\lib\Core\Service\ServiceFactory;
$factory = new ServiceFactory();
$documentComponent = $factory->make('billing.document');
$sanitizer = $documentComponent->getSanitizerWorker();
Usage Example
Manual sanitization workflow:
use libredte\lib\Core\Service\ServiceFactory;
use libredte\lib\Core\Package\Billing\Component\Document\Support\DocumentBag;
use libredte\lib\Core\Package\Billing\Component\Document\Exception\SanitizerException;
$factory = new ServiceFactory();
$documentComponent = $factory->make('billing.document');
// Create a bag with potentially unsafe data
$bag = new DocumentBag(
inputData: [
'Encabezado' => [
'IdDoc' => ['TipoDTE' => 33],
'Emisor' => [
'RUTEmisor' => '12345678-9',
'RznSoc' => ' Company Name <script>alert(1)</script> '
]
],
'Detalle' => [
[
'NmbItem' => 'Product <b>1</b>',
'PrcItem' => 1000
]
]
],
options: []
);
// Sanitize the data
try {
$sanitizer = $documentComponent->getSanitizerWorker();
$sanitizedData = $sanitizer->sanitize($bag);
// $sanitizedData contains cleaned, safe data
print_r($sanitizedData);
} catch (SanitizerException $e) {
echo "Sanitization failed: " . $e->getMessage();
}
Sanitization Operations
The Sanitizer performs various cleaning operations:
String Cleaning
- Trim leading/trailing whitespace
- Remove excessive whitespace
- Strip HTML/XML tags from text fields
- Remove control characters
- Clean special characters
Security
- Prevent XSS attacks
- Remove script tags
- Escape dangerous characters
- Validate character encodings
- Remove null bytes
Data Quality
- Normalize line endings
- Remove invisible characters
- Fix encoding issues
- Clean corrupted data
Before sanitization:
[
'Encabezado' => [
'Emisor' => [
'RznSoc' => ' Company <script>alert(1)</script> Name ',
'GiroEmis' => "Retail\x00Store"
]
],
'Detalle' => [
[
'NmbItem' => 'Product with spaces',
'DscItem' => '<b>Bold</b> description'
]
]
]
After sanitization:
[
'Encabezado' => [
'Emisor' => [
'RznSoc' => 'Company Name',
'GiroEmis' => 'Retail Store'
]
],
'Detalle' => [
[
'NmbItem' => 'Product with spaces',
'DscItem' => 'Bold description'
]
]
]
Integration with Document Pipeline
Sanitization typically occurs early in the document processing pipeline, often after parsing and before normalization:
// Standard processing order:
// 1. Parse input data
// 2. Sanitize data (remove unsafe content)
// 3. Normalize data (apply business rules)
// 4. Build document
Strategy Pattern
The Sanitizer Worker implements StrategiesAwareInterface, allowing different sanitization strategies for:
- Various field types (text, numbers, dates)
- Different document types
- Security levels
- Industry-specific requirements