Skip to main content
The archive command is responsible for archiving files into Azure Blob Storage with client-side compression, encryption, and deduplication. It implements a multi-stage pipeline with parallel processing, TAR batching for small files, and intelligent storage tier management. Identical file content is uploaded only once. Storage operations are minimised through TAR batching, and all data is secured on the client before it leaves the local system.

Key features

  • Deduplication — SHA256 hashes ensure identical files are stored only once.
  • Optimised storage — TAR batching reduces blob transactions for small files.
  • Tiering — Storage tier policies are applied automatically after upload.
  • Client-side security — Compression occurs before AES256 encryption, ensuring both efficiency and privacy.
  • Resilient orchestration — Linked cancellation, error handling, and per-hash gates prevent deadlocks and duplicate uploads.

Orchestration

The handler creates four concurrent tasks and waits for all of them to finish:
  1. Index — enumerate files from the file system.
  2. Hash — compute SHA256 hashes and route files by size.
  3. Upload large files — upload each large file individually.
  4. Batch small files — aggregate small files into TAR archives and upload each batch.
After all tasks complete, stale pointer entries are removed. If the state repository was modified, it is vacuumed and re-uploaded to blob storage. If nothing changed, the local state file is discarded.

Channel-based pipeline design

The four tasks communicate through typed .NET channels. Each channel is a bounded queue that applies back-pressure when consumers cannot keep up.
ChannelProducerConsumer
indexedFilesChannelIndex taskHash task
hashedLargeFilesChannelHash taskLarge file upload task
hashedSmallFilesChannelHash taskSmall file TAR task
This design allows high throughput with controlled parallelism — hashing and uploading run concurrently without blocking the indexer.

Stage 1: Index task

The indexer enumerates FilePair objects from the local file system and writes them into indexedFilesChannel. When enumeration is complete, the channel is marked complete so downstream stages can drain and exit cleanly.

Stage 2: Hash task

Hashers read FilePairs from indexedFilesChannel in parallel. For each file:
  • Pointer-only files (already archived, no binary present) are skipped.
  • Binary files are hashed with SHA256.
  • Files below the small-file size boundary are routed to hashedSmallFilesChannel.
  • Files at or above the boundary are routed to hashedLargeFilesChannel.

Stage 3: Large file uploads

Large files are uploaded individually. Multiple parallel readers consume hashedLargeFilesChannel.

InFlightGate: preventing duplicate uploads

When multiple files share the same hash (exact duplicates), only one upload should happen. The InFlightGate is a per-hash concurrency primitive that enforces this:
  • The first task to arrive for a given hash becomes the owner and performs the upload.
  • All subsequent tasks for the same hash become non-owners and await the owner’s completion.
  • Once the owner finishes, all waiters are released and proceed to write their pointer files.
This guarantees exactly-once upload without locks or pessimistic synchronisation.

Transformation pipeline

Every large file is transformed before reaching blob storage:
Original File → GZip → AES256 → Blob Storage
Compression runs before encryption so that encrypted ciphertext (which is not compressible) does not inflate the stored size.

Stage 4: Small file TAR batching

Small files are aggregated into in-memory TAR archives to reduce the number of blob write operations. Unlike the large file stage, the small file stage uses a single reader to maintain TAR ordering.

TAR processing details

When the accumulator decides to flush (either because it reached the size threshold or because the input channel is drained):
1

Hash the TAR stream

The complete in-memory TAR stream is hashed to produce a canonical identifier for the batch.
2

Upload the TAR blob

The TAR is compressed with GZip, encrypted with AES256, and written to Azure Blob Storage as a single blob.
3

Set storage tier

The blob’s access tier is set according to the configured tier policy.
4

Record BinaryProperties

BinaryProperties are recorded for both each child file and the parent TAR blob.
5

Write pointers

Pointer files and PointerFileEntry records are written for every file in the batch, including deferred duplicates.

Transformation pipeline for small files

Multiple Files → TAR → GZip → AES256 → Blob Storage
Duplicate small files (same hash as an already-processed file) are handled by the same InFlightGate mechanism used for large files. Non-owners register a continuation on the owner’s task so their pointer files are written as soon as the TAR flush completes.

Build docs developers (and LLMs) love