Skip to main content

Data Scientist Guide

This guide covers everything you need to know as a data scientist using Syft Client to collaborate on private data through peer-to-peer connections.

Overview

As a data scientist, you’ll use Syft Client to:
  • Connect to data owners’ datasites
  • Discover and access shared datasets
  • Submit computational jobs to run on private data
  • Retrieve results from approved jobs

Getting Started

Login

Use login_ds() to authenticate and connect to the Syft network:
import syft_client as sc

# For Google Colab (auto-detects your Google account)
client = sc.login_ds()

# For Jupyter (requires token file)
client = sc.login_ds(
    email="[email protected]",
    token_path="path/to/token_ds.json"
)
The login_ds() function automatically detects your environment (Colab or Jupyter) and configures the appropriate authentication method.

Login Parameters

ParameterTypeDefaultDescription
emailstr | NoneNoneYour email address. Auto-detected in Colab.
syncboolTrueSync with Google Drive on login
load_peersboolTrueLoad peer connections on login
token_pathstr | Path | NoneNonePath to OAuth token file (Jupyter only)
Source: syft_client/sync/login.py:19-53

Working with Peers

Adding a Data Owner

Before you can access datasets or submit jobs, you need to connect with a data owner:
# Add a peer connection (sends a request to the data owner)
client.add_peer("[email protected]")

Viewing Your Peers

# See all your peer connections
peers = client.peers
print(peers)
Peers will show as “outstanding” until the data owner approves your request.
By default, client.peers automatically syncs before returning results. To disable auto-sync, set the environment variable: PRE_SYNC=false
Source: syft_client/sync/syftbox_manager.py:407-432

Discovering Datasets

List Available Datasets

Once a data owner approves your peer request and shares datasets with you:
# View all datasets you have access to
datasets = client.datasets.get_all()
print(datasets)

Access a Specific Dataset

# Get a dataset by name and owner
dataset = client.datasets.get(
    name="my dataset",
    datasite="[email protected]"
)

# Access mock data files
mock_file = dataset.mock_files[0]
with open(mock_file, "r") as f:
    data = f.read()

# View dataset metadata
print(dataset.name)
print(dataset.summary)
print(dataset.tags)
You can only access mock data directly. To work with the real private data, you must submit a job that runs on the data owner’s machine.

Resolve Dataset Paths in Jobs

Use sc.resolve_dataset_file_path() in your job code to reference datasets:
import syft_client as sc

# Automatically resolves to the correct path
data_path = sc.resolve_dataset_file_path("my dataset")
with open(data_path, "r") as f:
    data = f.read()
Source: syft_client/utils.py and test examples in tests/unit/test_sync_manager.py:584-606

Submitting Jobs

Python Jobs

1

Write your analysis code

Create a Python file that performs your analysis:
# analysis.py
import syft_client as sc
import json

# Load dataset
data_path = sc.resolve_dataset_file_path("my dataset")
with open(data_path, "r") as f:
    data = f.read()

# Perform analysis
result = {"length": len(data), "summary": "Analyzed data"}

# Write results to outputs/ folder
with open("outputs/result.json", "w") as f:
    f.write(json.dumps(result))
2

Submit the job

client.submit_python_job(
    user="[email protected]",
    code_path="/path/to/analysis.py",
    job_name="My Analysis",
    dependencies=["pandas==2.0.0", "numpy"]
)
3

Wait for approval

The data owner will review and approve your job. You can check status:
jobs = client.jobs
for job in jobs:
    print(f"{job.name}: {job.status}")
4

Retrieve results

Once the job is executed, sync to get results:
client.sync()

# Access job outputs
job = client.jobs[0]
output_path = job.output_paths[0]
with open(output_path, "r") as f:
    result = json.load(f)
print(result)

Submit Folder-Based Jobs

For complex projects with multiple files:
client.submit_python_job(
    user="[email protected]",
    code_path="/path/to/project_folder",
    job_name="Complex Analysis",
    entrypoint="main.py",  # Entry point within the folder
    dependencies=["scikit-learn", "matplotlib"]
)
Source: packages/syft-job/src/syft_job/client.py:308-415

Bash Jobs

Submit shell scripts for simple tasks:
bash_script = """
#!/bin/bash
echo "Starting analysis..."
python analyze.py
echo "Done!"
"""

client.submit_bash_job(
    user="[email protected]",
    script=bash_script,
    job_name="Bash Analysis"
)
Source: packages/syft-job/src/syft_job/client.py:106-169

Job Workflow Details

Job Directory Structure

When you submit a job, the following structure is created:
SyftBox_{your_email}/
└── [email protected]/
    └── jobs/
        └── [email protected]/
            └── My Analysis/
                ├── config.yaml      # Job metadata
                ├── run.sh          # Execution script
                ├── analysis.py     # Your code
                ├── outputs/        # Results folder (created during execution)
                └── logs/           # Execution logs

Job Configuration

Each job includes a config.yaml with metadata:
name: My Analysis
submitted_by: [email protected]
submitted_at: '2026-03-02T10:30:00Z'
type: python
code_path: /path/to/analysis.py
entry_point: analysis.py
dependencies:
  - syft-client
  - pandas==2.0.0
  - numpy

Dependencies

Your job automatically includes syft-client as a dependency. The data owner’s machine will:
  1. Create a virtual environment
  2. Install all specified dependencies
  3. Execute your code
Source: packages/syft-job/src/syft_job/client.py:387-408

Syncing Data

Syft Client syncs data with Google Drive to coordinate with peers:
# Manually trigger a sync
client.sync()

# Disable auto-sync for performance
import os
os.environ["PRE_SYNC"] = "false"

# Now you need to manually sync
client.sync()
datasets = client.datasets.get_all()
Auto-sync behavior:
  • client.datasets - syncs before returning
  • client.jobs - syncs before returning
  • client.peers - syncs before returning
Set PRE_SYNC=false to disable this behavior for better performance when making multiple calls.
Source: syft_client/sync/syftbox_manager.py:727-755

Best Practices

Job Submission

  1. Test locally first - Verify your code works with mock data before submitting
  2. Use specific dependency versions - Pin versions to avoid compatibility issues
  3. Write outputs to the outputs/ folder - This is the standard location for results
  4. Handle errors gracefully - Include try/except blocks in your code
  5. Keep jobs focused - Break complex analyses into smaller jobs

Code Example with Error Handling

# robust_analysis.py
import syft_client as sc
import json
import sys

try:
    # Load dataset
    data_path = sc.resolve_dataset_file_path("my dataset")
    with open(data_path, "r") as f:
        data = f.read()
    
    # Perform analysis
    result = {"status": "success", "length": len(data)}
    
except Exception as e:
    # Write error to outputs
    result = {"status": "error", "message": str(e)}
    print(f"Error: {e}", file=sys.stderr)

finally:
    # Always write a result
    with open("outputs/result.json", "w") as f:
        f.write(json.dumps(result))

Working with Multiple Data Owners

# Connect to multiple data owners
client.add_peer("[email protected]")
client.add_peer("[email protected]")

# Access datasets from different owners
dataset_a = client.datasets.get("patient-records", datasite="[email protected]")
dataset_b = client.datasets.get("patient-records", datasite="[email protected]")

# Submit jobs to different owners
client.submit_python_job(
    user="[email protected]",
    code_path="analyze_a.py",
    job_name="Analysis A"
)

client.submit_python_job(
    user="[email protected]",
    code_path="analyze_b.py",
    job_name="Analysis B"
)

Environment Variables

VariableDefaultDescription
PRE_SYNC"true"Auto-sync before accessing datasets/jobs/peers
SYFTCLIENT_TOKEN_PATHNoneDefault token path for authentication
SYFTCLIENT_DEV_MODEFalseEnable development mode features
Source: syft_client/sync/config/config.py:1-13

Common Issues

”Email is required for Jupyter login”

When using Jupyter, you must provide your email:
client = sc.login_ds(
    email="[email protected]",
    token_path="token_ds.json"
)

“Token path is required for Jupyter login”

See the Authentication Guide for setting up OAuth tokens.

Job stuck in pending

The data owner hasn’t approved your job yet. Contact them or wait for approval.

Cannot find dataset

Ensure:
  1. The data owner has shared the dataset with you
  2. You’ve synced recently: client.sync()
  3. The dataset name and owner email are correct

Next Steps

Authentication Setup

Set up OAuth tokens for Jupyter environments

Notebooks Guide

Learn notebook-specific workflows for Colab and Jupyter

API Reference

Explore the full API documentation

Data Owner Guide

Learn how to share data and manage jobs

Build docs developers (and LLMs) love