Overview
TheSyftboxManager is the central class for interacting with the Syftbox ecosystem. It provides different capabilities depending on whether you’re a Data Owner (DO) or Data Scientist (DS).
Data Owner capabilities:
- Create and manage datasets
- Approve/reject peer requests
- Process and execute approved jobs
- Create checkpoints for efficient syncing
- Access shared datasets
- Submit jobs to Data Owners
- Connect with Data Owners as peers
Creating Instances
Instances are typically created via login functions:Class Methods
for_colab()
Create a SyftboxManager instance for Google Colab environment.User email address.
Initialize as Data Scientist (cannot be True with only_datasite_owner).
Initialize as Data Owner (cannot be True with only_ds).
Configured instance with Colab-specific settings.
for_jupyter()
Create a SyftboxManager instance for Jupyter environment.User email address.
Initialize as Data Scientist.
Initialize as Data Owner.
Path to Google Drive OAuth token file.
Properties
Email address of the current user.
syftbox_folder
Base directory for SyftBox files and datasets.
peers
Get list of connected peers. Automatically syncs before returning ifPRE_SYNC=true (default).
For Data Owners: List of approved peers + pending requests (approved first)For Data Scientists: List of connected Data Owners (all marked as ACCEPTED)
- Default:
PRE_SYNC=true- syncs before returning - Disable: Set
PRE_SYNC=falseenvironment variable
jobs
Get list of jobs. Automatically syncs before returning ifPRE_SYNC=true (default).
List of job objects with status, submitter, and execution details.
datasets
Get dataset manager. Automatically syncs before returning ifPRE_SYNC=true (default).
Dataset manager for querying and accessing datasets.
is_do
True if this is a Data Owner instance, False if Data Scientist.
Sync Methods
sync()
Sync local state with Google Drive.Automatically create checkpoint when event count exceeds threshold (DO only).
Create checkpoint when events since last checkpoint >= this value.
- Loads peer list
- Filters to version-compatible peers (warns about incompatible)
- Syncs with compatible peers
- Optionally creates checkpoint if threshold exceeded
- Loads peer list
- Warns if all connected peers are incompatible
- Syncs down from connected peers
load_peers()
Load peer list from connection router.sync() and when accessing the peers property.
Peer Management
add_peer()
Add a peer connection request.Email address of the peer to add.
Re-add peer even if already exists.
Print status messages.
approve_peer_request()
Approve a pending peer request. Data Owner only.Email address or Peer object to approve.
Print approval status.
Require peer request to exist before approving.
- Sets up DS job folder for the approved peer
- Shares all “any”-permission datasets with the peer
reject_peer_request()
Reject a pending peer request. Data Owner only.Email address or Peer object to reject.
Dataset Management
create_dataset()
Create and optionally share a dataset. Data Owner only.Dataset name.
Path to mock/sample data file.
Path to private data file (optional).
List of user emails to share with, or “any” for public sharing.
Upload private data to owner-only collection.
Sync after dataset creation.
Created dataset object.
share_dataset()
Share an existing dataset with additional users. Data Owner only.Dataset name.
List of email addresses or “any”.
Sync after sharing.
delete_dataset()
Delete a dataset. Data Owner only.Dataset name to delete.
Sync after deletion.
Job Management
submit_python_job()
Submit a Python job to a Data Owner. Data Scientist only.Data Owner email to submit job to.
Path to Python script file.
Job description.
Sync after submission.
Skip version compatibility check.
submit_bash_job()
Submit a Bash job to a Data Owner. Data Scientist only.Data Owner email to submit job to.
Path to Bash script file.
Job description.
Sync after submission.
Skip version compatibility check.
process_approved_jobs()
Execute all approved jobs. Data Owner only.Stream job output in real-time (False = capture at end).
Timeout in seconds per job (default: 300, or SYFT_DEFAULT_JOB_TIMEOUT_SECONDS env var).
Process all jobs regardless of version compatibility.
Grant submitter read access to job outputs.
Grant submitter read access to job logs.
- Automatically syncs after processing if
PRE_SYNC=true(default) - Skips jobs from version-incompatible peers unless
force_execution=True - Prints warnings for skipped jobs
Checkpoint Management
create_checkpoint()
Create a checkpoint of current state. Data Owner only.Checkpoint object containing snapshot of all files and hashes.
should_create_checkpoint()
Check if checkpoint should be created based on event count.Create checkpoint if events since last checkpoint >= this value.
True if checkpoint should be created, False otherwise.
try_create_checkpoint()
Automatically create checkpoint if threshold exceeded.Event count threshold.
Created checkpoint if threshold exceeded, None otherwise.
Cleanup Methods
delete_syftbox()
Delete all SyftBox state: Google Drive files, local caches, and folders.Print deletion progress.
Broadcast is_deleted events to peers before deleting (DO only).
- Gathers all files from folder hierarchy
- Finds orphaned files by name pattern
- Deletes all files from Google Drive
- Broadcasts delete events to peers (if DO)
- Clears in-memory and filesystem caches
- Deletes local SyftBox folder and cache directories
See Also
- Login Functions - Creating SyftboxManager instances
- Sync Operations - Understanding sync mechanics