Skip to main content
Syft Client is built on a file-first, modular, and offline-first architecture that enables peer-to-peer data science across decentralized networks.

Core Architecture Principles

The system is designed around 18 core principles (see principles.md). The most fundamental are:

File First

State is primarily described by files synced between peers. All other storage (databases, caches) is secondary and optional.

Offline First

Datasites can go offline/online freely. Messages are cached in transport layers until peers reconnect.

Shell First

Functions are shell scripts (run.sh) inside resource folders. Minimal dependencies beyond filesystem + shell + internet.

Transport Agnostic

Works with any transport layer (Google Drive, Dropbox, etc.). No lock-in to specific platforms.

High-Level Architecture

Syft Client consists of several modular components that work together:

Core Components

1. Datasite Syncers

Two main syncer components handle different roles:

Datasite Watcher Syncer

Handles the data scientist role - pushing proposed changes and pulling from peer outboxes.
class DatasiteWatcherSyncer(BaseModelCallbackMixin):
    """Handles both pushing proposed file changes and pulling from datasite outboxes."""
    
    def on_file_change(self, relative_path: Path | str, content: str | None = None):
        """Queue file changes for syncing to peers"""
        
    def sync_down(self, peer_emails: list[str]):
        """Pull messages and datasets from peer outboxes"""

Datasite Owner Syncer

Handles the data owner role - downloading files and checking permissions.
class DatasiteOwnerSyncer(BaseModelCallbackMixin):
    """Responsible for downloading files and checking permissions"""
    
    def sync(self, peer_emails: list[str], recompute_hashes: bool = True):
        """Pull proposed file changes and process them"""
        
    def check_write_permission(self, sender_email: str, path: str) -> bool:
        """Check if sender has write access to the given path"""
        
    def check_read_permissions(self, recipient_email: str, path: str) -> bool:
        """Check if recipient has read access to the given path"""

2. Connection Router

Abstracts away transport layer details, allowing the same code to work with multiple platforms.
class ConnectionRouter:
    """Routes connections to appropriate transport layers"""
    
    def send_proposed_file_changes_message(self, recipient: str, message: ProposedFileChangesMessage):
        """Send changes via configured transport layer"""
        
    def get_next_proposed_filechange_message(self, sender_email: str) -> ProposedFileChangesMessage | None:
        """Pull next message from inbox"""
See Peer-to-Peer Network for transport layer details.

3. Event System

All state changes are represented as file change events:
class FileChangeEvent(BaseModel):
    id: UUID
    path_in_datasite: Path
    datasite_email: str
    content: str | bytes | None  # None for deletions
    old_hash: str | None
    new_hash: str | None  # None for deletions
    is_deleted: bool
    submitted_timestamp: float
    timestamp: float
Events are:
  • Immutable: Once created, events never change
  • Ordered: Timestamp-based ordering ensures consistency
  • Compressed: Stored as .tar.gz files for efficiency
  • Cacheable: Can be replayed to rebuild state

4. Permission Engine

File-based access control using syft.pub.yaml files. See Permissions for details.

5. Job System

Shell-first job execution framework:
class JobClient:
    """Client for submitting jobs to SyftBox."""
    
    def submit_bash_job(self, user: str, script: str, job_name: str = "") -> Path:
        """Submit a bash job for a user"""
        
    def submit_python_job(self, user: str, code_path: str, job_name: str = "", 
                          dependencies: List[str] = None) -> Path:
        """Submit a Python job (wraps code in bash script)"""
Jobs are folders containing:
  • run.sh - The shell script to execute
  • config.yaml - Job metadata (name, submitted_by, dependencies)
  • Additional resources (Python files, data, etc.)
  • Status markers: approved, done

State Management

File-First State

All state is stored as files on the local filesystem:
~/syftbox/
├── [email protected]/           # Datasite
│   ├── public/                  # Public folder
│   │   ├── syft.pub.yaml       # Permission file
│   │   └── data.csv            # Shared data
│   ├── private/                 # Private folder
│   └── jobs/                    # Job queue
│       └── [email protected]/     # Jobs from Bob
│           └── job_123/
│               ├── run.sh
│               ├── config.yaml
│               └── approved     # Status marker
└── .syftbox-events/            # Event cache (optional)

Checkpointing System

To optimize sync performance, Syft uses a multi-tier checkpoint system:
1

Rolling State

In-memory accumulation of recent events. Uploaded to transport layer after threshold.
2

Incremental Checkpoints

Periodic snapshots of rolling state. Created when event count exceeds threshold (default: 50 events).
3

Full Checkpoints

Complete state snapshots. Created by compacting incremental checkpoints (default: after 10 incremental checkpoints).
# From datasite_owner_syncer.py
def try_create_checkpoint(self, threshold: int = 50, compacting_threshold: int = 10):
    """Try to create incremental checkpoint and/or compact if thresholds exceeded."""
    if self.should_create_checkpoint(threshold):
        result = self.create_incremental_checkpoint()
        
        if self.should_compact_checkpoints(compacting_threshold):
            result = self.compact_checkpoints()  # Merge into full checkpoint

Modular Package Structure

Syft Client is composed of optional modules:
Dataset management and sharing. Handles collections of files with permission tracking.
Job submission and execution. Shell-first computation framework.
Permission system for datasites. File-first access control.
User-facing permission API. High-level interface for granting/revoking access.
Background services TUI dashboard. Monitors and auto-approves jobs.
Jupyter notebook display utilities. Rich HTML rendering for jobs and datasets.
Following Principle 10: Modular-first - upgrades to one module don’t require upgrades to the rest of the system.

Peer Discovery

Following Principle 9: Peer-first, Syft assumes discovery happens elsewhere (e.g., SyftHub). The protocol itself is privacy-preserving like Signal:
  • If they’re not in your contact list, you don’t know they exist
  • Nobody outside your contact list knows you exist
  • Communication only happens between explicitly authorized peers

Next Steps

P2P Network

Learn about transport layers and offline-first sync

Datasites

Understand data owner and data scientist roles

Permissions

Explore the file-first permission system

18 Principles

Read the complete design principles

Build docs developers (and LLMs) love