Skip to main content

Overview

The SyftboxManager is the central class for interacting with the Syftbox ecosystem. It provides different capabilities depending on whether you’re a Data Owner (DO) or Data Scientist (DS). Data Owner capabilities:
  • Create and manage datasets
  • Approve/reject peer requests
  • Process and execute approved jobs
  • Create checkpoints for efficient syncing
Data Scientist capabilities:
  • Access shared datasets
  • Submit jobs to Data Owners
  • Connect with Data Owners as peers

Creating Instances

Instances are typically created via login functions:
import syft_client as sc

# Data Scientist
ds_client = sc.login_ds(email="[email protected]")

# Data Owner
do_client = sc.login_do(email="[email protected]")

Class Methods

for_colab()

Create a SyftboxManager instance for Google Colab environment.
from syft_client.sync.syftbox_manager import SyftboxManager

# Data Scientist
client = SyftboxManager.for_colab(
    email="[email protected]",
    only_ds=True
)

# Data Owner
client = SyftboxManager.for_colab(
    email="[email protected]",
    only_datasite_owner=True
)
email
str
required
User email address.
only_ds
bool
default:"False"
Initialize as Data Scientist (cannot be True with only_datasite_owner).
only_datasite_owner
bool
default:"False"
Initialize as Data Owner (cannot be True with only_ds).
return
SyftboxManager
Configured instance with Colab-specific settings.

for_jupyter()

Create a SyftboxManager instance for Jupyter environment.
from syft_client.sync.syftbox_manager import SyftboxManager

client = SyftboxManager.for_jupyter(
    email="[email protected]",
    only_ds=True,
    token_path="/path/to/token.json"
)
email
str
required
User email address.
only_ds
bool
default:"False"
Initialize as Data Scientist.
only_datasite_owner
bool
default:"False"
Initialize as Data Owner.
token_path
Path | None
default:"None"
Path to Google Drive OAuth token file.

Properties

email

print(client.email)  # "[email protected]"
email
str
Email address of the current user.

syftbox_folder

print(client.syftbox_folder)  # PosixPath('/home/user/[email protected]')
syftbox_folder
Path
Base directory for SyftBox files and datasets.

peers

Get list of connected peers. Automatically syncs before returning if PRE_SYNC=true (default).
# Auto-sync enabled (default)
for peer in client.peers:
    print(f"{peer.email}: {peer.state}")

# Disable auto-sync
import os
os.environ["PRE_SYNC"] = "false"
peers = client.peers
peers
PeerList
For Data Owners: List of approved peers + pending requests (approved first)For Data Scientists: List of connected Data Owners (all marked as ACCEPTED)
Auto-sync behavior:
  • Default: PRE_SYNC=true - syncs before returning
  • Disable: Set PRE_SYNC=false environment variable

jobs

Get list of jobs. Automatically syncs before returning if PRE_SYNC=true (default).
# Review jobs (auto-syncs by default)
for job in client.jobs:
    print(f"{job.name}: {job.status}")
jobs
JobsList
List of job objects with status, submitter, and execution details.

datasets

Get dataset manager. Automatically syncs before returning if PRE_SYNC=true (default).
# Access datasets (auto-syncs by default)
for dataset in client.datasets.get_all():
    print(f"{dataset.name} by {dataset.owner}")
datasets
SyftDatasetManager
Dataset manager for querying and accessing datasets.

is_do

if client.is_do:
    print("This is a Data Owner client")
is_do
bool
True if this is a Data Owner instance, False if Data Scientist.

Sync Methods

sync()

Sync local state with Google Drive.
# Basic sync
client.sync()

# Sync with custom checkpoint settings
client.sync(auto_checkpoint=True, checkpoint_threshold=100)
auto_checkpoint
bool
default:"True"
Automatically create checkpoint when event count exceeds threshold (DO only).
checkpoint_threshold
int
default:"50"
Create checkpoint when events since last checkpoint >= this value.
Behavior: For Data Owners:
  1. Loads peer list
  2. Filters to version-compatible peers (warns about incompatible)
  3. Syncs with compatible peers
  4. Optionally creates checkpoint if threshold exceeded
For Data Scientists:
  1. Loads peer list
  2. Warns if all connected peers are incompatible
  3. Syncs down from connected peers

load_peers()

Load peer list from connection router.
client.load_peers()
Refreshes the peer list from Google Drive. Called automatically by sync() and when accessing the peers property.

Peer Management

add_peer()

Add a peer connection request.
# Add a peer
client.add_peer("[email protected]")

# Force re-add even if already exists
client.add_peer("[email protected]", force=True, verbose=False)
peer_email
str
required
Email address of the peer to add.
force
bool
default:"False"
Re-add peer even if already exists.
verbose
bool
default:"True"
Print status messages.

approve_peer_request()

Approve a pending peer request. Data Owner only.
# Approve by email
client.approve_peer_request("[email protected]")

# Approve by peer object
peer = client.peers[0]
client.approve_peer_request(peer, verbose=False)

# Skip existence check (for testing)
client.approve_peer_request(
    "[email protected]",
    peer_must_exist=False
)
email_or_peer
str | Peer
required
Email address or Peer object to approve.
verbose
bool
default:"True"
Print approval status.
peer_must_exist
bool
default:"True"
Require peer request to exist before approving.
Side effects:
  • Sets up DS job folder for the approved peer
  • Shares all “any”-permission datasets with the peer

reject_peer_request()

Reject a pending peer request. Data Owner only.
client.reject_peer_request("[email protected]")
email_or_peer
str | Peer
required
Email address or Peer object to reject.

Dataset Management

create_dataset()

Create and optionally share a dataset. Data Owner only.
# Create dataset shared with specific users
dataset = client.create_dataset(
    name="medical_data",
    mock_path="mock_data.csv",
    private_path="private_data.csv",
    users=["[email protected]", "[email protected]"]
)

# Create dataset shared with anyone
dataset = client.create_dataset(
    name="public_data",
    mock_path="mock.csv",
    users="any"
)

# Create with private data uploaded
dataset = client.create_dataset(
    name="sensitive_data",
    mock_path="mock.csv",
    private_path="private.csv",
    users="any",
    upload_private=True
)
name
str
required
Dataset name.
mock_path
str | Path
required
Path to mock/sample data file.
private_path
str | Path
Path to private data file (optional).
users
list[str] | str | None
default:"None"
List of user emails to share with, or “any” for public sharing.
upload_private
bool
default:"False"
Upload private data to owner-only collection.
sync
bool
default:"True"
Sync after dataset creation.
return
Dataset
Created dataset object.

share_dataset()

Share an existing dataset with additional users. Data Owner only.
# Share with specific users
client.share_dataset("medical_data", ["[email protected]"])

# Share with anyone
client.share_dataset("public_data", "any")
tag
str
required
Dataset name.
users
list[str] | str
required
List of email addresses or “any”.
sync
bool
default:"True"
Sync after sharing.

delete_dataset()

Delete a dataset. Data Owner only.
client.delete_dataset(name="old_data", sync=True)
name
str
required
Dataset name to delete.
sync
bool
default:"True"
Sync after deletion.

Job Management

submit_python_job()

Submit a Python job to a Data Owner. Data Scientist only.
client.submit_python_job(
    user="[email protected]",
    script="analysis.py",
    description="Analyze medical data"
)

# Force submission even if version incompatible
client.submit_python_job(
    user="[email protected]",
    script="analysis.py",
    force_submission=True
)
user
str
required
Data Owner email to submit job to.
script
str | Path
required
Path to Python script file.
description
str
Job description.
sync
bool
default:"True"
Sync after submission.
force_submission
bool
default:"False"
Skip version compatibility check.

submit_bash_job()

Submit a Bash job to a Data Owner. Data Scientist only.
client.submit_bash_job(
    user="[email protected]",
    script="process.sh",
    description="Process data files"
)
user
str
required
Data Owner email to submit job to.
script
str | Path
required
Path to Bash script file.
description
str
Job description.
sync
bool
default:"True"
Sync after submission.
force_submission
bool
default:"False"
Skip version compatibility check.

process_approved_jobs()

Execute all approved jobs. Data Owner only.
# Process with default settings
client.process_approved_jobs()

# Custom settings
client.process_approved_jobs(
    stream_output=True,
    timeout=600,
    force_execution=False,
    share_outputs_with_submitter=True,
    share_logs_with_submitter=True
)
stream_output
bool
default:"True"
Stream job output in real-time (False = capture at end).
timeout
int | None
default:"None"
Timeout in seconds per job (default: 300, or SYFT_DEFAULT_JOB_TIMEOUT_SECONDS env var).
force_execution
bool
default:"False"
Process all jobs regardless of version compatibility.
share_outputs_with_submitter
bool
default:"False"
Grant submitter read access to job outputs.
share_logs_with_submitter
bool
default:"False"
Grant submitter read access to job logs.
Behavior:
  • Automatically syncs after processing if PRE_SYNC=true (default)
  • Skips jobs from version-incompatible peers unless force_execution=True
  • Prints warnings for skipped jobs

Checkpoint Management

create_checkpoint()

Create a checkpoint of current state. Data Owner only.
checkpoint = client.create_checkpoint()
print(f"Created checkpoint with {len(checkpoint.file_hashes)} files")
return
Checkpoint
Checkpoint object containing snapshot of all files and hashes.
Purpose: Checkpoints allow new peers to sync quickly by downloading a snapshot instead of replaying all historical events.

should_create_checkpoint()

Check if checkpoint should be created based on event count.
if client.should_create_checkpoint(threshold=100):
    client.create_checkpoint()
threshold
int
default:"50"
Create checkpoint if events since last checkpoint >= this value.
return
bool
True if checkpoint should be created, False otherwise.

try_create_checkpoint()

Automatically create checkpoint if threshold exceeded.
# Create checkpoint if >= 100 events since last one
checkpoint = client.try_create_checkpoint(threshold=100)
if checkpoint:
    print("Checkpoint created")
threshold
int
default:"50"
Event count threshold.
return
Checkpoint | None
Created checkpoint if threshold exceeded, None otherwise.

Cleanup Methods

delete_syftbox()

Delete all SyftBox state: Google Drive files, local caches, and folders.
# Full cleanup with event broadcasting
client.delete_syftbox(verbose=True)

# Quick cleanup without broadcasting (testing)
client.delete_syftbox(broadcast_delete_events=False)
verbose
bool
default:"True"
Print deletion progress.
broadcast_delete_events
bool
default:"True"
Broadcast is_deleted events to peers before deleting (DO only).
Cleanup process:
  1. Gathers all files from folder hierarchy
  2. Finds orphaned files by name pattern
  3. Deletes all files from Google Drive
  4. Broadcasts delete events to peers (if DO)
  5. Clears in-memory and filesystem caches
  6. Deletes local SyftBox folder and cache directories
This operation is irreversible. All datasets, jobs, and sync history will be permanently deleted.

See Also

Build docs developers (and LLMs) love