Data Scientist Guide
This guide covers everything you need to know as a data scientist using Syft Client to collaborate on private data through peer-to-peer connections.Overview
As a data scientist, you’ll use Syft Client to:- Connect to data owners’ datasites
- Discover and access shared datasets
- Submit computational jobs to run on private data
- Retrieve results from approved jobs
Getting Started
Login
Uselogin_ds() to authenticate and connect to the Syft network:
The
login_ds() function automatically detects your environment (Colab or Jupyter) and configures the appropriate authentication method.Login Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
email | str | None | None | Your email address. Auto-detected in Colab. |
sync | bool | True | Sync with Google Drive on login |
load_peers | bool | True | Load peer connections on login |
token_path | str | Path | None | None | Path to OAuth token file (Jupyter only) |
syft_client/sync/login.py:19-53
Working with Peers
Adding a Data Owner
Before you can access datasets or submit jobs, you need to connect with a data owner:Viewing Your Peers
By default,
client.peers automatically syncs before returning results. To disable auto-sync, set the environment variable: PRE_SYNC=falsesyft_client/sync/syftbox_manager.py:407-432
Discovering Datasets
List Available Datasets
Once a data owner approves your peer request and shares datasets with you:Access a Specific Dataset
Resolve Dataset Paths in Jobs
Usesc.resolve_dataset_file_path() in your job code to reference datasets:
syft_client/utils.py and test examples in tests/unit/test_sync_manager.py:584-606
Submitting Jobs
Python Jobs
Submit Folder-Based Jobs
For complex projects with multiple files:packages/syft-job/src/syft_job/client.py:308-415
Bash Jobs
Submit shell scripts for simple tasks:packages/syft-job/src/syft_job/client.py:106-169
Job Workflow Details
Job Directory Structure
When you submit a job, the following structure is created:Job Configuration
Each job includes aconfig.yaml with metadata:
Dependencies
Your job automatically includessyft-client as a dependency. The data owner’s machine will:
- Create a virtual environment
- Install all specified dependencies
- Execute your code
packages/syft-job/src/syft_job/client.py:387-408
Syncing Data
Syft Client syncs data with Google Drive to coordinate with peers:Auto-sync behavior:
client.datasets- syncs before returningclient.jobs- syncs before returningclient.peers- syncs before returning
PRE_SYNC=false to disable this behavior for better performance when making multiple calls.syft_client/sync/syftbox_manager.py:727-755
Best Practices
Job Submission
- Test locally first - Verify your code works with mock data before submitting
- Use specific dependency versions - Pin versions to avoid compatibility issues
- Write outputs to the outputs/ folder - This is the standard location for results
- Handle errors gracefully - Include try/except blocks in your code
- Keep jobs focused - Break complex analyses into smaller jobs
Code Example with Error Handling
Working with Multiple Data Owners
Environment Variables
| Variable | Default | Description |
|---|---|---|
PRE_SYNC | "true" | Auto-sync before accessing datasets/jobs/peers |
SYFTCLIENT_TOKEN_PATH | None | Default token path for authentication |
SYFTCLIENT_DEV_MODE | False | Enable development mode features |
syft_client/sync/config/config.py:1-13
Common Issues
”Email is required for Jupyter login”
When using Jupyter, you must provide your email:“Token path is required for Jupyter login”
See the Authentication Guide for setting up OAuth tokens.Job stuck in pending
The data owner hasn’t approved your job yet. Contact them or wait for approval.Cannot find dataset
Ensure:- The data owner has shared the dataset with you
- You’ve synced recently:
client.sync() - The dataset name and owner email are correct
Next Steps
Authentication Setup
Set up OAuth tokens for Jupyter environments
Notebooks Guide
Learn notebook-specific workflows for Colab and Jupyter
API Reference
Explore the full API documentation
Data Owner Guide
Learn how to share data and manage jobs