Key features
- Deduplication — SHA256 hashes ensure identical files are stored only once.
- Optimised storage — TAR batching reduces blob transactions for small files.
- Tiering — Storage tier policies are applied automatically after upload.
- Client-side security — Compression occurs before AES256 encryption, ensuring both efficiency and privacy.
- Resilient orchestration — Linked cancellation, error handling, and per-hash gates prevent deadlocks and duplicate uploads.
Orchestration
The handler creates four concurrent tasks and waits for all of them to finish:- Index — enumerate files from the file system.
- Hash — compute SHA256 hashes and route files by size.
- Upload large files — upload each large file individually.
- Batch small files — aggregate small files into TAR archives and upload each batch.
Channel-based pipeline design
The four tasks communicate through typed .NET channels. Each channel is a bounded queue that applies back-pressure when consumers cannot keep up.| Channel | Producer | Consumer |
|---|---|---|
indexedFilesChannel | Index task | Hash task |
hashedLargeFilesChannel | Hash task | Large file upload task |
hashedSmallFilesChannel | Hash task | Small file TAR task |
Stage 1: Index task
The indexer enumeratesFilePair objects from the local file system and writes them into indexedFilesChannel. When enumeration is complete, the channel is marked complete so downstream stages can drain and exit cleanly.
Stage 2: Hash task
Hashers readFilePairs from indexedFilesChannel in parallel. For each file:
- Pointer-only files (already archived, no binary present) are skipped.
- Binary files are hashed with SHA256.
- Files below the small-file size boundary are routed to
hashedSmallFilesChannel. - Files at or above the boundary are routed to
hashedLargeFilesChannel.
Stage 3: Large file uploads
Large files are uploaded individually. Multiple parallel readers consumehashedLargeFilesChannel.
InFlightGate: preventing duplicate uploads
When multiple files share the same hash (exact duplicates), only one upload should happen. TheInFlightGate is a per-hash concurrency primitive that enforces this:
- The first task to arrive for a given hash becomes the owner and performs the upload.
- All subsequent tasks for the same hash become non-owners and await the owner’s completion.
- Once the owner finishes, all waiters are released and proceed to write their pointer files.
Transformation pipeline
Every large file is transformed before reaching blob storage:Stage 4: Small file TAR batching
Small files are aggregated into in-memory TAR archives to reduce the number of blob write operations. Unlike the large file stage, the small file stage uses a single reader to maintain TAR ordering.TAR processing details
When the accumulator decides to flush (either because it reached the size threshold or because the input channel is drained):Hash the TAR stream
The complete in-memory TAR stream is hashed to produce a canonical identifier for the batch.
Upload the TAR blob
The TAR is compressed with GZip, encrypted with AES256, and written to Azure Blob Storage as a single blob.
Record BinaryProperties
BinaryProperties are recorded for both each child file and the parent TAR blob.Transformation pipeline for small files
Duplicate small files (same hash as an already-processed file) are handled by the same
InFlightGate mechanism used for large files. Non-owners register a continuation on the owner’s task so their pointer files are written as soon as the TAR flush completes.