Skip to main content

Data Owner Guide

This guide covers everything you need to know as a data owner using Syft Client to securely share datasets and manage computational jobs.

Overview

As a data owner, you’ll use Syft Client to:
  • Host private datasets with mock data previews
  • Approve or reject peer connection requests
  • Review and approve computational jobs
  • Execute jobs on your private data
  • Share results with data scientists

Getting Started

Login

Use login_do() to authenticate as a data owner:
import syft_client as sc

# For Google Colab (auto-detects your Google account)
client = sc.login_do()

# For Jupyter (requires token file)
client = sc.login_do(
    email="[email protected]",
    token_path="path/to/token_do.json"
)
The login_do() function configures your client as a data owner with additional permissions for dataset management and job execution.

Login Parameters

ParameterTypeDefaultDescription
emailstr | NoneNoneYour email address. Auto-detected in Colab.
syncboolTrueSync with Google Drive on login
load_peersboolTrueLoad peer connections on login
token_pathstr | Path | NoneNonePath to OAuth token file (Jupyter only)
Source: syft_client/sync/login.py:56-89

Managing Peer Connections

View Peer Requests

# See all peers (approved and pending)
peers = client.peers
print(peers)

# Filter by state
for peer in peers:
    if peer.is_pending:
        print(f"Pending: {peer.email}")
    elif peer.is_approved:
        print(f"Approved: {peer.email}")

Approve Peer Requests

1

Load latest peer requests

client.load_peers()
peers = client.peers
2

Review pending requests

pending = [p for p in peers if p.is_pending]
for peer in pending:
    print(f"Request from: {peer.email}")
3

Approve or reject

# Approve a peer
client.approve_peer_request("[email protected]")

# Reject a peer
client.reject_peer_request("[email protected]")
Source: syft_client/sync/syftbox_manager.py:765-808
When you approve a peer, they automatically gain access to:
  • Datasets you’ve shared with “any” permission
  • The ability to submit jobs to your datasite

Creating and Managing Datasets

Create a Dataset

# Create dataset with mock and private data
client.create_dataset(
    name="patient-records",
    mock_path="/path/to/mock_data.csv",
    private_path="/path/to/real_data.csv",
    summary="Anonymized patient records for research",
    readme_path="/path/to/README.md",
    tags=["healthcare", "research"],
    users=["[email protected]"]  # Specific users
)

Dataset Permission Models

Share with Specific Users

# Share with a list of email addresses
client.create_dataset(
    name="restricted-data",
    mock_path="mock.csv",
    private_path="private.csv",
    summary="Restricted access dataset",
    users=["[email protected]", "[email protected]"]
)

Share with All Approved Peers

# Share with anyone who has been approved
client.create_dataset(
    name="public-research-data",
    mock_path="mock.csv",
    private_path="private.csv",
    summary="Publicly available research data",
    users="any"  # Available to all approved peers
)
Source: syft_client/sync/syftbox_manager.py:920-1005

Upload Private Data Separately

For sensitive datasets, you can upload private data to a separate owner-only collection:
client.create_dataset(
    name="sensitive-data",
    mock_path="mock.csv",
    private_path="private.csv",
    summary="Sensitive dataset with separate private storage",
    users="any",
    upload_private=True  # Stores private data separately
)
When upload_private=True, private data is stored in a separate collection folder with owner-only access, providing an extra layer of security.
Source: Tests in tests/unit/test_dataset_upload_private.py

Share Existing Datasets

Update permissions on existing datasets:
# Share with additional users
client.share_dataset(
    tag="patient-records",
    users=["[email protected]"]
)

# Make available to all approved peers
client.share_dataset(
    tag="patient-records",
    users="any"
)
Source: syft_client/sync/syftbox_manager.py:1047-1099

Delete Datasets

client.delete_dataset(name="old-dataset")

List Your Datasets

# View all datasets you've created
datasets = client.datasets.get_all()
for dataset in datasets:
    print(f"{dataset.name}: {dataset.summary}")

Managing Jobs

View Submitted Jobs

# See all jobs from data scientists
jobs = client.jobs

for job in jobs:
    print(f"Job: {job.name}")
    print(f"Status: {job.status}")
    print(f"Submitted by: {job.submitted_by}")
    print(f"Submitted at: {job.submitted_at}")
    print("---")

Review and Approve Jobs

1

Sync to get latest jobs

client.sync()
jobs = client.jobs
2

Review pending jobs

pending_jobs = [j for j in jobs if j.status == "pending"]

for job in pending_jobs:
    print(f"Job: {job.name}")
    print(f"From: {job.submitted_by}")
    
    # Review the code
    code_path = job.job_dir / "analysis.py"
    with open(code_path, "r") as f:
        print(f.read())
3

Approve or reject

# Approve a job
job = jobs[0]
job.approve()

# Or reject with a reason
job.reject(reason="Code accesses unauthorized resources")

Execute Approved Jobs

# Process all approved jobs
client.process_approved_jobs(
    stream_output=True,           # Show real-time output
    timeout=300,                  # 5 minute timeout per job
    share_outputs_with_submitter=True,  # Share results
    share_logs_with_submitter=True      # Share execution logs
)

Execution Parameters

ParameterTypeDefaultDescription
stream_outputboolTrueStream output in real-time vs. capture at end
timeoutint | None300Timeout in seconds per job
force_executionboolFalseExecute jobs even from incompatible client versions
share_outputs_with_submitterboolFalseGrant submitter read access to outputs
share_logs_with_submitterboolFalseGrant submitter read access to logs
Source: syft_client/sync/syftbox_manager.py:823-890

Version Compatibility

By default, jobs from data scientists with incompatible Syft Client versions are skipped:
# Force execution of all jobs (bypass version checks)
client.process_approved_jobs(force_execution=True)
Use force_execution=True with caution. Jobs from incompatible versions may fail or produce unexpected results.

Advanced Job Management

Job Execution Flow

When you run process_approved_jobs(), the following happens:
  1. Sync - Pull latest job submissions from Google Drive
  2. Version Check - Verify submitter’s client version is compatible
  3. Environment Setup - Create isolated virtual environment
  4. Dependency Installation - Install required packages
  5. Code Execution - Run the job script
  6. Result Storage - Save outputs to outputs/ folder
  7. Sync Results - Upload results to Google Drive (if sharing enabled)

Share Job Results Manually

You can also share results after execution:
# Execute jobs without auto-sharing
client.process_approved_jobs(
    share_outputs_with_submitter=False,
    share_logs_with_submitter=False
)

# Review results, then share manually
client.job_runner.share_job_results(
    job_name="Analysis Job",
    share_outputs=True,
    share_logs=True
)
Source: Test examples in tests/unit/test_sync_manager.py:545-558

Job Directory Structure

Executed jobs create the following structure:
SyftBox_{your_email}/
└── [email protected]/
    └── jobs/
        └── [email protected]/
            └── Analysis Job/
                ├── config.yaml      # Job metadata
                ├── run.sh          # Execution script
                ├── analysis.py     # Submitted code
                ├── outputs/        # Computation results
                │   └── result.json
                └── logs/           # Execution logs
                    ├── stdout.txt
                    └── stderr.txt

Performance Optimization

Checkpoints

Checkpoints speed up sync operations by creating snapshots of your datasite state:
# Manually create a checkpoint
client.create_checkpoint()

# Check if checkpoint is needed (>= 50 events since last checkpoint)
if client.should_create_checkpoint(threshold=50):
    client.create_checkpoint()

# Auto-checkpoint during sync (default behavior)
client.sync(auto_checkpoint=True, checkpoint_threshold=50)
Checkpoints are automatically created every 50 events during sync() operations. This significantly reduces initial sync time for new peers.
Source: syft_client/sync/syftbox_manager.py:1246-1294

Disable Auto-Sync

For better performance when making multiple API calls:
import os
os.environ["PRE_SYNC"] = "false"

# Now these won't auto-sync
datasets = client.datasets.get_all()
jobs = client.jobs

# Manually sync when ready
client.sync()

Best Practices

Dataset Security

  1. Always use mock data - Never include real private data in mock files
  2. Review job code carefully - Ensure jobs only access authorized datasets
  3. Use specific permissions - Prefer explicit user lists over “any”
  4. Enable private upload - Use upload_private=True for highly sensitive data
  5. Monitor access - Regularly review approved peers and revoke access when needed

Job Review Checklist

Before approving a job, verify:
  • Code only accesses datasets the submitter has permission for
  • No attempts to access network resources (if prohibited)
  • No attempts to access system files outside the job directory
  • Dependencies are from trusted sources
  • Output volume is reasonable (won’t fill disk)
  • Execution time is acceptable (set appropriate timeout)

Example Job Review Code

import os
os.environ["PRE_SYNC"] = "false"  # Faster when reviewing multiple jobs

client.sync()
jobs = client.jobs

for job in jobs:
    if job.status == "pending":
        print(f"\n{'='*60}")
        print(f"Job: {job.name}")
        print(f"From: {job.submitted_by}")
        print(f"Submitted: {job.submitted_at}")
        print(f"{'='*60}")
        
        # Show the code
        code_files = list(job.job_dir.rglob("*.py"))
        for code_file in code_files:
            print(f"\n--- {code_file.name} ---")
            with open(code_file, "r") as f:
                print(f.read())
        
        # Manual approval decision
        decision = input("Approve? (y/n): ")
        if decision.lower() == "y":
            job.approve()
            print("✓ Approved")
        else:
            reason = input("Rejection reason: ")
            job.reject(reason=reason)
            print("✗ Rejected")

# Execute all approved jobs
client.process_approved_jobs(
    timeout=300,
    share_outputs_with_submitter=True,
    share_logs_with_submitter=True
)

Automated Workflows

Auto-Approve Trusted Peers

For trusted collaborators, you can set up auto-approval:
TRUSTED_PEERS = [
    "[email protected]",
    "[email protected]"
]

client.load_peers()
for peer in client.peers:
    if peer.is_pending and peer.email in TRUSTED_PEERS:
        client.approve_peer_request(peer.email)
        print(f"Auto-approved: {peer.email}")
Auto-approval scripts should be used with extreme caution and only for highly trusted collaborators.
Source: Example script in scripts/auto_approve_peers_and_share.py

Cleanup and Maintenance

Delete Your Syftbox

To completely remove all Syft data:
# Delete all Google Drive files, local caches, and folders
client.delete_syftbox(
    verbose=True,                    # Show deletion progress
    broadcast_delete_events=True     # Notify peers of deletion
)
This operation is irreversible and will delete:
  • All Google Drive files and folders
  • Local SyftBox directory and caches
  • Event history and checkpoints
Peers will be notified that your datasite is being deleted.
Source: syft_client/sync/syftbox_manager.py:1170-1240

Environment Variables

VariableDefaultDescription
PRE_SYNC"true"Auto-sync before accessing datasets/jobs/peers
SYFTCLIENT_TOKEN_PATHNoneDefault token path for authentication
SYFTCLIENT_DEV_MODEFalseEnable development mode features
SYFT_DEFAULT_JOB_TIMEOUT_SECONDS300Default job execution timeout

Common Issues

Peer approval not visible to data scientist

Ensure both parties sync:
client.approve_peer_request("[email protected]")
client.sync()  # Push approval to Google Drive
The data scientist should also sync:
client.sync()  # Pull approval

Job execution fails with missing dependencies

Check that all dependencies are specified in the job submission:
# Data scientist should include all dependencies
client.submit_python_job(
    user="[email protected]",
    code_path="analysis.py",
    dependencies=["pandas==2.0.0", "numpy", "scikit-learn"]
)

“Version unknown” warnings when processing jobs

The data scientist’s client version couldn’t be determined. Either:
  1. Have them upgrade to a newer version of syft-client
  2. Use force_execution=True to bypass version checks (use with caution)

Next Steps

Authentication Setup

Set up OAuth tokens for Jupyter environments

Data Scientist Guide

Understand the data scientist workflow

Notebooks Guide

Learn notebook-specific workflows

API Reference

Explore the full API documentation

Build docs developers (and LLMs) love