Data Scientist Guide

This guide covers everything you need to know as a data scientist using Syft Client to collaborate on private data through peer-to-peer connections.

Overview

As a data scientist, you’ll use Syft Client to:

Connect to data owners’ datasites
Discover and access shared datasets
Submit computational jobs to run on private data
Retrieve results from approved jobs

Getting Started

Use login_ds() to authenticate and connect to the Syft network:

import syft_client as sc

# For Google Colab (auto-detects your Google account)
client = sc.login_ds()

# For Jupyter (requires token file)
client = sc.login_ds(
    email="[email protected]",
    token_path="path/to/token_ds.json"
)

The login_ds() function automatically detects your environment (Colab or Jupyter) and configures the appropriate authentication method.

Parameter	Type	Default	Description
`email`	`str \| None`	`None`	Your email address. Auto-detected in Colab.
`sync`	`bool`	`True`	Sync with Google Drive on login
`load_peers`	`bool`	`True`	Load peer connections on login
`token_path`	`str \| Path \| None`	`None`	Path to OAuth token file (Jupyter only)

Source: syft_client/sync/login.py:19-53

Working with Peers

Adding a Data Owner

Before you can access datasets or submit jobs, you need to connect with a data owner:

# Add a peer connection (sends a request to the data owner)
client.add_peer("[email protected]")

Viewing Your Peers

# See all your peer connections
peers = client.peers
print(peers)

Peers will show as “outstanding” until the data owner approves your request.

By default, client.peers automatically syncs before returning results. To disable auto-sync, set the environment variable: PRE_SYNC=false

Source: syft_client/sync/syftbox_manager.py:407-432

Discovering Datasets

List Available Datasets

Once a data owner approves your peer request and shares datasets with you:

# View all datasets you have access to
datasets = client.datasets.get_all()
print(datasets)

Access a Specific Dataset

# Get a dataset by name and owner
dataset = client.datasets.get(
    name="my dataset",
    datasite="[email protected]"
)

# Access mock data files
mock_file = dataset.mock_files[0]
with open(mock_file, "r") as f:
    data = f.read()

# View dataset metadata
print(dataset.name)
print(dataset.summary)
print(dataset.tags)

You can only access mock data directly. To work with the real private data, you must submit a job that runs on the data owner’s machine.

Resolve Dataset Paths in Jobs

Use sc.resolve_dataset_file_path() in your job code to reference datasets:

import syft_client as sc

# Automatically resolves to the correct path
data_path = sc.resolve_dataset_file_path("my dataset")
with open(data_path, "r") as f:
    data = f.read()

Source: syft_client/utils.py and test examples in tests/unit/test_sync_manager.py:584-606

Submitting Jobs

Python Jobs

Write your analysis code

Create a Python file that performs your analysis:

# analysis.py
import syft_client as sc
import json

# Load dataset
data_path = sc.resolve_dataset_file_path("my dataset")
with open(data_path, "r") as f:
    data = f.read()

# Perform analysis
result = {"length": len(data), "summary": "Analyzed data"}

# Write results to outputs/ folder
with open("outputs/result.json", "w") as f:
    f.write(json.dumps(result))

Submit the job

client.submit_python_job(
    user="[email protected]",
    code_path="/path/to/analysis.py",
    job_name="My Analysis",
    dependencies=["pandas==2.0.0", "numpy"]
)

Wait for approval

The data owner will review and approve your job. You can check status:

jobs = client.jobs
for job in jobs:
    print(f"{job.name}: {job.status}")

Retrieve results

Once the job is executed, sync to get results:

client.sync()

# Access job outputs
job = client.jobs[0]
output_path = job.output_paths[0]
with open(output_path, "r") as f:
    result = json.load(f)
print(result)

Submit Folder-Based Jobs

For complex projects with multiple files:

client.submit_python_job(
    user="[email protected]",
    code_path="/path/to/project_folder",
    job_name="Complex Analysis",
    entrypoint="main.py",  # Entry point within the folder
    dependencies=["scikit-learn", "matplotlib"]
)

Source: packages/syft-job/src/syft_job/client.py:308-415

Bash Jobs

Submit shell scripts for simple tasks:

bash_script = """
#!/bin/bash
echo "Starting analysis..."
python analyze.py
echo "Done!"
"""

client.submit_bash_job(
    user="[email protected]",
    script=bash_script,
    job_name="Bash Analysis"
)

Source: packages/syft-job/src/syft_job/client.py:106-169

Job Workflow Details

Job Directory Structure

When you submit a job, the following structure is created:

SyftBox_{your_email}/
└── [email protected]/
    └── jobs/
        └── [email protected]/
            └── My Analysis/
                ├── config.yaml      # Job metadata
                ├── run.sh          # Execution script
                ├── analysis.py     # Your code
                ├── outputs/        # Results folder (created during execution)
                └── logs/           # Execution logs

Job Configuration

Each job includes a config.yaml with metadata:

name: My Analysis
submitted_by: [email protected]
submitted_at: '2026-03-02T10:30:00Z'
type: python
code_path: /path/to/analysis.py
entry_point: analysis.py
dependencies:
  - syft-client
  - pandas==2.0.0
  - numpy

Dependencies

Your job automatically includes syft-client as a dependency. The data owner’s machine will:

Create a virtual environment
Install all specified dependencies
Execute your code

Source: packages/syft-job/src/syft_job/client.py:387-408

Syncing Data

Syft Client syncs data with Google Drive to coordinate with peers:

# Manually trigger a sync
client.sync()

# Disable auto-sync for performance
import os
os.environ["PRE_SYNC"] = "false"

# Now you need to manually sync
client.sync()
datasets = client.datasets.get_all()

Auto-sync behavior:

client.datasets - syncs before returning
client.jobs - syncs before returning
client.peers - syncs before returning

Set PRE_SYNC=false to disable this behavior for better performance when making multiple calls.

Source: syft_client/sync/syftbox_manager.py:727-755

Best Practices

Job Submission

Test locally first - Verify your code works with mock data before submitting
Use specific dependency versions - Pin versions to avoid compatibility issues
Write outputs to the outputs/ folder - This is the standard location for results
Handle errors gracefully - Include try/except blocks in your code
Keep jobs focused - Break complex analyses into smaller jobs

Code Example with Error Handling

# robust_analysis.py
import syft_client as sc
import json
import sys

try:
    # Load dataset
    data_path = sc.resolve_dataset_file_path("my dataset")
    with open(data_path, "r") as f:
        data = f.read()
    
    # Perform analysis
    result = {"status": "success", "length": len(data)}
    
except Exception as e:
    # Write error to outputs
    result = {"status": "error", "message": str(e)}
    print(f"Error: {e}", file=sys.stderr)

finally:
    # Always write a result
    with open("outputs/result.json", "w") as f:
        f.write(json.dumps(result))

Working with Multiple Data Owners

# Connect to multiple data owners
client.add_peer("[email protected]")
client.add_peer("[email protected]")

# Access datasets from different owners
dataset_a = client.datasets.get("patient-records", datasite="[email protected]")
dataset_b = client.datasets.get("patient-records", datasite="[email protected]")

# Submit jobs to different owners
client.submit_python_job(
    user="[email protected]",
    code_path="analyze_a.py",
    job_name="Analysis A"
)

client.submit_python_job(
    user="[email protected]",
    code_path="analyze_b.py",
    job_name="Analysis B"
)

Environment Variables

Variable	Default	Description
`PRE_SYNC`	`"true"`	Auto-sync before accessing datasets/jobs/peers
`SYFTCLIENT_TOKEN_PATH`	`None`	Default token path for authentication
`SYFTCLIENT_DEV_MODE`	`False`	Enable development mode features

Source: syft_client/sync/config/config.py:1-13

Common Issues

When using Jupyter, you must provide your email:

client = sc.login_ds(
    email="[email protected]",
    token_path="token_ds.json"
)

See the Authentication Guide for setting up OAuth tokens.

Job stuck in pending

The data owner hasn’t approved your job yet. Contact them or wait for approval.

Cannot find dataset

Ensure:

The data owner has shared the dataset with you
You’ve synced recently: client.sync()
The dataset name and owner email are correct

Next Steps

Authentication Setup

Set up OAuth tokens for Jupyter environments

Notebooks Guide

Learn notebook-specific workflows for Colab and Jupyter

API Reference

Explore the full API documentation

Data Owner Guide

Learn how to share data and manage jobs

Get Started

Core Concepts

User Guides

Data Scientist Guide

Data Scientist Guide

Overview

Getting Started

Working with Peers

Adding a Data Owner

Viewing Your Peers

Discovering Datasets

List Available Datasets

Access a Specific Dataset

Resolve Dataset Paths in Jobs

Submitting Jobs

Python Jobs

Submit Folder-Based Jobs

Bash Jobs

Job Workflow Details

Job Directory Structure

Job Configuration

Dependencies

Syncing Data

Best Practices

Job Submission

Code Example with Error Handling

Working with Multiple Data Owners

Environment Variables

Common Issues

Job stuck in pending

Cannot find dataset

Next Steps

Authentication Setup

Notebooks Guide

API Reference

Data Owner Guide

Build docs developers (and LLMs) love

Get Started

Core Concepts

User Guides

​Data Scientist Guide

​Overview

​Getting Started

​Login

​Login Parameters

​Working with Peers

​Adding a Data Owner

​Viewing Your Peers

​Discovering Datasets

​List Available Datasets

​Access a Specific Dataset

​Resolve Dataset Paths in Jobs

​Submitting Jobs

​Python Jobs

​Submit Folder-Based Jobs

​Bash Jobs

​Job Workflow Details

​Job Directory Structure

​Job Configuration

​Dependencies

​Syncing Data

​Best Practices

​Job Submission

​Code Example with Error Handling

​Working with Multiple Data Owners

​Environment Variables

​Common Issues

​”Email is required for Jupyter login”

​“Token path is required for Jupyter login”

​Job stuck in pending

​Cannot find dataset

​Next Steps

Authentication Setup

Notebooks Guide

API Reference

Data Owner Guide

Build docs developers (and LLMs) love

Data Scientist Guide

Overview

Getting Started

Login

Login Parameters

Working with Peers

Adding a Data Owner

Viewing Your Peers

Discovering Datasets

List Available Datasets

Access a Specific Dataset

Resolve Dataset Paths in Jobs

Submitting Jobs

Python Jobs

Submit Folder-Based Jobs

Bash Jobs

Job Workflow Details

Job Directory Structure

Job Configuration

Dependencies

Syncing Data

Best Practices

Job Submission

Code Example with Error Handling

Working with Multiple Data Owners

Environment Variables

Common Issues

”Email is required for Jupyter login”

“Token path is required for Jupyter login”

Job stuck in pending

Cannot find dataset

Next Steps