Deduplication

Arius never uploads the same content twice. It offers two levels of deduplication: file-level (default) and block-level (opt-in with --dedup). Both levels operate before any data leaves your machine.

File-level deduplication (default)

Every binary file is hashed with a salted SHA-256 digest before upload. The hash becomes the blob name in Azure. Before performing an upload, Arius checks whether a blob with that hash already exists in the container.

If it does exist, Arius skips the upload entirely and only writes or updates the local pointer file.
If it does not exist, Arius compresses, encrypts, and uploads the file, then records its BinaryProperties so future runs can skip it.

This guarantees that two files with identical content — regardless of filename or directory — are stored as a single blob.

File A (10 GB)  ──► SHA-256: abc123  ──► blob exists? No  ──► Upload
File B (10 GB)  ──► SHA-256: abc123  ──► blob exists? Yes ──► Skip, write pointer only

Storage savings are immediate: if you archive 1,000 files and 200 are identical to files already in the container, those 200 are not uploaded.

Block-level deduplication (architecture)

For large files that change partially between archive runs — virtual machine images, database dumps, video projects — file-level deduplication is insufficient: a single changed byte produces a new hash and forces a full re-upload. The Arius architecture supports variable block-size content-defined chunking. With this approach, large files are split into variable-size chunks. Each chunk is hashed independently. On subsequent runs, only chunks whose content has changed need to be re-uploaded. A chunklist blob (stored in the chunklist/ folder of the container) maps the file’s hash to its ordered list of chunk hashes, enabling full reconstruction at restore time.

Block-level deduplication is part of the Arius architecture and is used internally for TAR archive handling. The current CLI exposes file-level deduplication by default. Refer to the Architecture: Archive Pipeline page for technical details.

When block-level deduplication helps

Scenario	Notes
Files that rarely change (documents, photos, archives)	File-level deduplication is sufficient
Large files that partially change (VM images, databases)	Block-level chunking reduces re-upload size significantly
Many small files	TAR batching handles these efficiently at the blob level

TAR batching for small files

Small files incur disproportionate blob transaction costs if each is stored as its own blob. Arius addresses this by batching small files into in-memory TAR archives before upload.

File A (50 KB) ─┐
File B (12 KB) ─┤──► TAR bundle ──► GZip ──► AES256 ──► Single blob
File C (80 KB) ─┘

Files are collected until the TAR reaches its size threshold or the input channel drains.
The TAR is then compressed, encrypted, and uploaded as a single blob.
Each individual file inside the TAR still gets its own pointer file and its own BinaryProperties entry. This means individual files can be referenced and restored independently.
Duplicate small files (same hash) are deferred: only one is added to the TAR; the others receive pointer files after the TAR completes.

InFlightGate: preventing duplicate concurrent uploads

When Arius processes files in parallel, two workers may hash the same content at nearly the same time. Without coordination, both would attempt to upload the same blob. Arius uses an InFlightGate keyed on the content hash:

Worker 1 hashes file X (hash: abc123) ──► Gate: become owner ──► Upload blob
Worker 2 hashes file Y (hash: abc123) ──► Gate: wait for owner ──► Skip upload, write pointer

The first worker to enter the gate for a given hash becomes the owner and performs the upload.
All subsequent workers with the same hash wait for the owner to complete, then proceed directly to pointer creation — without uploading.

This ensures each unique blob is uploaded exactly once, even under high parallelism.

Full archive flow

Storage savings example

Consider a project directory with the following files archived weekly:

Week	Files	Unique content	Blobs uploaded	Blobs skipped
1	500	500	500	0
2	520	30 changed	30	490
3	520	5 changed	5	515

By week 3, only 535 blobs have ever been uploaded for what would otherwise be 1,540 upload operations — a 65% reduction in blob writes and corresponding storage costs.

Get Started

CLI Reference

Guides

Architecture

Reference

Deduplication

File-level deduplication (default)

Block-level deduplication (architecture)

When block-level deduplication helps

TAR batching for small files

InFlightGate: preventing duplicate concurrent uploads

Full archive flow

Storage savings example

Build docs developers (and LLMs) love

Get Started

CLI Reference

Guides

Architecture

Reference

​File-level deduplication (default)

​Block-level deduplication (architecture)

​When block-level deduplication helps

​TAR batching for small files

​InFlightGate: preventing duplicate concurrent uploads

​Full archive flow

​Storage savings example

Build docs developers (and LLMs) love

File-level deduplication (default)

Block-level deduplication (architecture)

When block-level deduplication helps

TAR batching for small files

InFlightGate: preventing duplicate concurrent uploads

Full archive flow

Storage savings example