--dedup). Both levels operate before any data leaves your machine.
File-level deduplication (default)
Every binary file is hashed with a salted SHA-256 digest before upload. The hash becomes the blob name in Azure. Before performing an upload, Arius checks whether a blob with that hash already exists in the container.- If it does exist, Arius skips the upload entirely and only writes or updates the local pointer file.
- If it does not exist, Arius compresses, encrypts, and uploads the file, then records its
BinaryPropertiesso future runs can skip it.
Block-level deduplication (architecture)
For large files that change partially between archive runs — virtual machine images, database dumps, video projects — file-level deduplication is insufficient: a single changed byte produces a new hash and forces a full re-upload. The Arius architecture supports variable block-size content-defined chunking. With this approach, large files are split into variable-size chunks. Each chunk is hashed independently. On subsequent runs, only chunks whose content has changed need to be re-uploaded. A chunklist blob (stored in thechunklist/ folder of the container) maps the file’s hash to its ordered list of chunk hashes, enabling full reconstruction at restore time.
Block-level deduplication is part of the Arius architecture and is used internally for TAR archive handling. The current CLI exposes file-level deduplication by default. Refer to the Architecture: Archive Pipeline page for technical details.
When block-level deduplication helps
| Scenario | Notes |
|---|---|
| Files that rarely change (documents, photos, archives) | File-level deduplication is sufficient |
| Large files that partially change (VM images, databases) | Block-level chunking reduces re-upload size significantly |
| Many small files | TAR batching handles these efficiently at the blob level |
TAR batching for small files
Small files incur disproportionate blob transaction costs if each is stored as its own blob. Arius addresses this by batching small files into in-memory TAR archives before upload.- Files are collected until the TAR reaches its size threshold or the input channel drains.
- The TAR is then compressed, encrypted, and uploaded as a single blob.
- Each individual file inside the TAR still gets its own pointer file and its own
BinaryPropertiesentry. This means individual files can be referenced and restored independently. - Duplicate small files (same hash) are deferred: only one is added to the TAR; the others receive pointer files after the TAR completes.
InFlightGate: preventing duplicate concurrent uploads
When Arius processes files in parallel, two workers may hash the same content at nearly the same time. Without coordination, both would attempt to upload the same blob. Arius uses an InFlightGate keyed on the content hash:- The first worker to enter the gate for a given hash becomes the owner and performs the upload.
- All subsequent workers with the same hash wait for the owner to complete, then proceed directly to pointer creation — without uploading.
Full archive flow
Storage savings example
Consider a project directory with the following files archived weekly:| Week | Files | Unique content | Blobs uploaded | Blobs skipped |
|---|---|---|---|---|
| 1 | 500 | 500 | 500 | 0 |
| 2 | 520 | 30 changed | 30 | 490 |
| 3 | 520 | 5 changed | 5 | 515 |