Core Architecture Principles
The system is designed around 18 core principles (see principles.md). The most fundamental are:File First
State is primarily described by files synced between peers. All other storage (databases, caches) is secondary and optional.
Offline First
Datasites can go offline/online freely. Messages are cached in transport layers until peers reconnect.
Shell First
Functions are shell scripts (
run.sh) inside resource folders. Minimal dependencies beyond filesystem + shell + internet.Transport Agnostic
Works with any transport layer (Google Drive, Dropbox, etc.). No lock-in to specific platforms.
High-Level Architecture
Syft Client consists of several modular components that work together:Core Components
1. Datasite Syncers
Two main syncer components handle different roles:Datasite Watcher Syncer
Handles the data scientist role - pushing proposed changes and pulling from peer outboxes.Datasite Owner Syncer
Handles the data owner role - downloading files and checking permissions.2. Connection Router
Abstracts away transport layer details, allowing the same code to work with multiple platforms.3. Event System
All state changes are represented as file change events:- Immutable: Once created, events never change
- Ordered: Timestamp-based ordering ensures consistency
- Compressed: Stored as
.tar.gzfiles for efficiency - Cacheable: Can be replayed to rebuild state
4. Permission Engine
File-based access control usingsyft.pub.yaml files. See Permissions for details.
5. Job System
Shell-first job execution framework:run.sh- The shell script to executeconfig.yaml- Job metadata (name, submitted_by, dependencies)- Additional resources (Python files, data, etc.)
- Status markers:
approved,done
State Management
File-First State
All state is stored as files on the local filesystem:Checkpointing System
To optimize sync performance, Syft uses a multi-tier checkpoint system:Incremental Checkpoints
Periodic snapshots of rolling state. Created when event count exceeds threshold (default: 50 events).
Modular Package Structure
Syft Client is composed of optional modules:syft-datasets
syft-datasets
Dataset management and sharing. Handles collections of files with permission tracking.
syft-job
syft-job
Job submission and execution. Shell-first computation framework.
syft-permissions
syft-permissions
Permission system for datasites. File-first access control.
syft-perm
syft-perm
User-facing permission API. High-level interface for granting/revoking access.
syft-bg
syft-bg
Background services TUI dashboard. Monitors and auto-approves jobs.
syft-notebook-ui
syft-notebook-ui
Jupyter notebook display utilities. Rich HTML rendering for jobs and datasets.
Following Principle 10: Modular-first - upgrades to one module don’t require upgrades to the rest of the system.
Peer Discovery
Following Principle 9: Peer-first, Syft assumes discovery happens elsewhere (e.g., SyftHub). The protocol itself is privacy-preserving like Signal:- If they’re not in your contact list, you don’t know they exist
- Nobody outside your contact list knows you exist
- Communication only happens between explicitly authorized peers
Next Steps
P2P Network
Learn about transport layers and offline-first sync
Datasites
Understand data owner and data scientist roles
Permissions
Explore the file-first permission system
18 Principles
Read the complete design principles