Overview
SyftBox uses a bi-directional sync system built on Google Drive to enable secure data sharing between Data Owners (DO) and Data Scientists (DS). The sync system manages:- File change detection and propagation
- Permission-based access control
- Event-driven state synchronization
- Connection routing and management
Architecture
The sync system consists of two main components:DatasiteOwnerSyncer (Data Owner)
Handles downloads, permission checking, and outbox management for Data Owners.- Pull file change events from peer inboxes
- Verify permissions before accepting changes
- Write approved files to local filesystem
- Track file hashes in event cache
- Send file change events to peer outboxes
- Manage checkpoints for efficient syncing
DatasiteWatcherSyncer (Data Scientist)
Handles file change detection and pushing to Data Owner inboxes.- Monitor local file changes
- Create ProposedFileChange messages
- Send changes to Data Owner inboxes
- Pull approved changes from Data Owner outboxes
- Update local cache of file states
Sync Flow
Data Scientist → Data Owner
- DS detects change: File written to
~/SyftBox_{email}/{dataowner_email}/... - DS creates proposal:
ProposedFileChangewith path, content, old hash - DS sends to inbox: Message placed in DO’s Google Drive inbox
- DO syncs: Pulls from inbox during
sync() - DO checks permissions: Verifies DS has write access via SyftPerm
- DO accepts/rejects: Writes file locally or discards based on permissions
- DO sends event: Places
FileChangeEventin DS’s outbox - DS pulls event: Updates cache with new hash during next
sync()
Data Owner → Data Scientist
- DO creates dataset:
client.create_dataset(name="data", users=["[email protected]"]) - DO shares via collection: Files uploaded to Google Drive collection folder
- DS syncs: Discovers collection during
sync_down() - DS downloads: Pulls dataset files to local cache
- DS accesses: Reads files via
client.datasets.get(name="data")
Sync Methods
sync()
Primary sync method onSyftboxManager.
Auto-Sync Behavior
Certain properties trigger automatic sync before returning data:client.peers- Syncs before returning peer listclient.jobs- Syncs before returning job listclient.datasets- Syncs before returning dataset manager
client.process_approved_jobs()- Syncs after job execution
PRE_SYNC=true(default): Auto-sync enabledPRE_SYNC=false: Auto-sync disabled
Connection Management
ConnectionRouter
Manages multiple platform connections (Google Drive, future: Dropbox, S3, etc.)write_event_messages_to_inbox()- Send messages to peer inboxwrite_event_messages_to_outbox_do()- Send messages to peer outbox (DO)pull_from_inbox()- Retrieve messages from inboxpull_from_outbox()- Retrieve messages from outbox (DS)create_dataset_collection_folder()- Create shared dataset foldershare_dataset_collection()- Grant access to users
GDriveConnection
Google Drive implementation of platform connection.- OAuth2 authentication
- File upload/download via Drive API
- Permission management (share with users)
- Folder creation and hierarchy management
- Batch operations for efficiency
Event Caching
DataSiteOwnerEventCache (Data Owner)
Tracks file hashes and events for Data Owners.- In-memory: Dict of file paths → hashes
- Filesystem:
{syftbox_folder}-events/directory (unlessuse_in_memory_cache=True)
- File hashes (current state)
- Event history (for checkpointing)
- Peer permissions
DataSiteWatcherCache (Data Scientist)
Tracks file states for Data Scientists.- In-memory: Dict of file paths → hashes
- Filesystem:
{syftbox_folder}-event-messages/directory
- File hashes (to detect changes)
- Outbox message timestamps
- Dataset collection metadata
Checkpoints
Checkpoints enable fast initial sync by providing a snapshot of all files and hashes.Creating Checkpoints
Checkpoint Structure
Checkpoint Benefits
Without checkpoint:- New peer downloads all historical events (could be thousands)
- Replays each event to build current state
- Slow for long-running Data Owners
- New peer downloads single checkpoint file
- Instantly has current state
- Only processes events since checkpoint
Version Compatibility
Sync operations check version compatibility to prevent protocol mismatches.Version Manager
Version Checks
During sync:- Data Owner only syncs with compatible peers
- Incompatible peers receive warnings
- Events from incompatible peers are ignored
Performance Considerations
Batch Operations
Checkpoint Frequency
Manual Sync Control
Troubleshooting
Sync Not Detecting Changes
Orphaned Files After Deletion
Version Incompatibility Issues
See Also
- SyftboxManager - Main client interface
- Login Functions - Creating authenticated clients
- Datasets Guide - Working with shared datasets
- Jobs Guide - Submitting and processing jobs