Skip to main content
A datasite is a peer in the Syft network - a folder on a user’s local filesystem that contains their data, permissions, and shared state.

What is a Datasite?

Each datasite is identified by an email address and contains:
  • Private data: Files accessible only to the owner
  • Public/shared data: Files shared with specific peers or everyone
  • Permission files (syft.pub.yaml): Define who can access what
  • Job queues: Incoming computation requests from peers
  • Event logs: History of all state changes

Datasite Structure

~/syftbox/
├── [email protected]/              # Alice's datasite
│   ├── public/                     # Public folder
│   │   ├── syft.pub.yaml          # Permissions: read = ["*"]
│   │   ├── datasets/              # Shared datasets
│   │   └── mock_data.csv          # Mock data for testing
│   ├── private/                    # Private folder (owner only)
│   │   ├── syft.pub.yaml          # Permissions: read = ["[email protected]"]
│   │   └── sensitive_data.csv
│   ├── jobs/                       # Job queue
│   │   └── [email protected]/       # Jobs from Bob
│   │       └── analysis_job_123/
│   │           ├── run.sh
│   │           ├── config.yaml
│   │           └── approved       # Status marker
│   └── api_data/                  # Shared with specific peers
│       ├── syft.pub.yaml          # Permissions: read = ["[email protected]"]
│       └── results.json
└── [email protected]/                # Bob's datasite (synced copy)
    └── public/
        └── his_mock_data.csv
Following Principle 1: File first, the datasite folder is the state. No database required - everything is visible as files.

Two Core Roles

Syft defines two personas that collaborate through datasites:

Data Owner

Has private data and receives computation requests. Reviews and approves jobs before execution.

Data Scientist

Wants to run analysis on others’ data. Submits jobs and receives results/mocks.
The same user can be both a data owner (for their own data) and a data scientist (when requesting others’ data). Roles are contextual, not fixed.

Data Owner Workflow

As a data owner, you:
  1. Store private data in your datasite
  2. Create mocks for data scientists to develop against
  3. Set permissions controlling who can access what
  4. Receive job requests from data scientists
  5. Review and approve/reject jobs manually or via policies
  6. Execute approved jobs in isolated environments
  7. Share results back through the network

Data Owner Components

class DatasiteOwnerSyncer(BaseModelCallbackMixin):
    """Responsible for downloading files and checking permissions"""
    
    def sync(self, peer_emails: list[str], recompute_hashes: bool = True):
        """Pull proposed file changes from peers"""
        for peer_email in peer_emails:
            msg = self.pull_and_process_next_proposed_filechange(peer_email)
            if msg:
                self.handle_proposed_filechange_events_message(peer_email, msg)
    
    def check_write_permission(self, sender_email: str, path: str) -> bool:
        """Check if sender has write access to the given path"""
        self.perm_context._reload()
        return self.perm_context.open(path).has_write_access(sender_email)
    
    def handle_proposed_filechange_events_message(
        self, sender_email: str, proposed_events_message: ProposedFileChangesMessage
    ):
        # Filter to only changes sender has permission to make
        allowed_changes = [
            change for change in proposed_events_message.proposed_file_changes
            if self.check_write_permission(sender_email, str(change.path_in_datasite))
        ]
        
        if allowed_changes:
            # Process and accept allowed changes
            self.event_cache.process_proposed_events_message(...)
Key responsibilities:
  • Pull incoming messages from peers
  • Check write permissions on all proposed changes
  • Process only allowed file changes
  • Maintain event history and checkpoints

Job Approval

Jobs can be approved:
Following Principle 7: Manual-review-first, the default is manual approval:
from syft_job import get_client

client = get_client("/path/to/syftbox", "[email protected]")

# View pending jobs
for job in client.jobs:
    if job.status == "inbox":
        print(f"Job from {job.submitted_by}: {job.name}")
        # Review job contents
        print(open(job.location / "run.sh").read())
        
        # Approve if safe
        job.approve()
Following Principle 2: File-permission-first, job-policy second, no other permission system exists. All access control is through file permissions or job policies.

Data Scientist Workflow

As a data scientist, you:
  1. Discover data owners (through SyftHub or other channels)
  2. Request peer connection and get approved
  3. Access mock data to develop and test your analysis
  4. Submit jobs to run analysis on real data
  5. Wait for approval (manual or policy-based)
  6. Receive results when job completes

Data Scientist Components

class DatasiteWatcherSyncer(BaseModelCallbackMixin):
    """Handles both pushing proposed file changes and pulling from datasite outboxes."""
    
    def on_file_change(
        self, relative_path: Path | str, content: str | None = None, process_now=True
    ):
        """Queue file change for syncing to peer"""
        relative_path = Path(relative_path)
        self.queue.put((relative_path, content))
        if process_now:
            self.process_file_changes_queue()
    
    def sync_down(self, peer_emails: list[str]):
        """Pull messages and datasets from peer outboxes"""
        for peer_email in peer_emails:
            # Sync messages with parallel download
            self.datasite_watcher_cache.sync_down_parallel(
                peer_email,
                self._executor,
                self.download_events_message_with_new_connection,
            )
Key responsibilities:
  • Monitor local file changes
  • Push proposed changes to peer inboxes
  • Pull results from peer outboxes
  • Cache remote datasets locally

Submitting Jobs

from syft_job import get_client

client = get_client("/path/to/syftbox", "[email protected]")

# Submit bash job to data owner
script = """
#!/bin/bash
set -e
python analysis.py
"""

job_dir = client.submit_bash_job(
    user="[email protected]",
    script=script,
    job_name="My Analysis"
)
print(f"Job submitted to: {job_dir}")
Jobs are written to [email protected]/jobs/[email protected]/job_name/ and synced via the transport layer.

Mock Data

Following Principle 15: Mock-always, every piece of state comes with a mock:
1

Data Owner Creates Mock

Owner provides mock data for scientists to develop against:
# Real data (private)
~/syftbox/alice@example.com/private/customers.csv

# Mock data (public)
~/syftbox/alice@example.com/public/customers_mock.csv
2

Data Scientist Develops Locally

Scientists write and test code against mock data:
import pandas as pd

# Develop using mock data
df = pd.read_csv("/syftbox/[email protected]/public/customers_mock.csv")
result = df.groupby("region").sum()
result.to_csv("output.csv")
3

Submit Job to Real Data

Submit the tested code to run on real private data:
client.submit_python_job(
    user="[email protected]",
    code_path="analysis.py",
    job_name="Regional Analysis"
)
The same code runs, but reads from private/customers.csv on Alice’s machine.
Following Principle 16: Automock-first, mock generation should be automatic. Manual mocks are fallback when privacy norms are unclear.

MapReduce Model

Following Principle 11: MapReduce-first, all interactions are viewed through the MapReduce lens: Map phase: Submit same job to multiple data owners
Reduce phase: Aggregate results locally
# Map: Submit to multiple data owners
data_owners = ["[email protected]", "[email protected]", "[email protected]"]
jobs = []

for owner in data_owners:
    job_dir = client.submit_python_job(
        user=owner,
        code_path="count_customers.py",
        job_name=f"Count for {owner}"
    )
    jobs.append((owner, job_dir))

# Wait for results...

# Reduce: Aggregate results
total = 0
for owner, job_dir in jobs:
    result = pd.read_csv(job_dir / "outputs" / "count.csv")
    total += result["count"].sum()

print(f"Total customers across all owners: {total}")

Single Gateway

Following Principle 17: Single-gateway only, there is only one job queue per datasite:
  • Data owners can see everything entering/leaving their datasite
  • No hidden channels or backdoors
  • Full transparency for data owners
[email protected]/
└── jobs/                    # THE ONLY GATEWAY
    ├── [email protected]/     # All jobs from Bob
    ├── [email protected]/   # All jobs from Carol
    └── [email protected]/    # All jobs from Dave
Even what looks like RPC or state queries goes through the job queue under the hood (Principle 18: Job-only).

Client-First Debugging

Following Principle 8: DataScientist-first-debugging, when something goes wrong, the data scientist does the work:
1

Job Fails

Job executes but produces an error:
job = client.jobs["My Analysis"]
print(job.status)  # "done"
print(job.stderr.text)
# Traceback: KeyError: 'customer_id'
2

Review Error Locally

Data scientist reviews error and fixes code:
# Fix the code
df = pd.read_csv("data.csv")
df = df.rename(columns={"id": "customer_id"})  # Fix!
3

Resubmit

Submit fixed version:
client.submit_python_job(
    user="[email protected]",
    code_path="analysis_fixed.py",
    job_name="My Analysis v2"
)
Not the data owner’s job to:
  • Normalize their data to match scientist’s expectations
  • Debug scientist’s code
  • Coordinate with other data owners
The scientist adapts to each data owner’s schema (using mocks).

Fail Softly

Following Principle 6: Fail-softly, jobs don’t get rejected - they produce errors:
# Traditional: Hard failure
response = api.query("/customers/count")
# 404 Not Found - endpoint doesn't exist
# ❌ No debugging info
With data owner’s permission, the scientist can see why it failed and fix it.

Datasite Configuration

Datasites are created and configured through the permission system:
from syft_perm import SyftPermContext

# Initialize datasite
datasite = Path("/path/to/syftbox/[email protected]")
ctx = SyftPermContext(datasite=datasite)

# Set up folder permissions
public_folder = ctx.open("public/")
public_folder.grant_read_access("*")  # Everyone can read

private_folder = ctx.open("private/")
private_folder.grant_read_access("[email protected]")  # Owner only

api_folder = ctx.open("api_data/")
api_folder.grant_read_access("[email protected]")  # Specific peer
api_folder.grant_write_access("[email protected]")  # Can write back results
See Permissions for details.

Next Steps

Permissions

Learn about file-first access control

Job System

Submit and manage jobs

Datasets

Share datasets between datasites

P2P Network

Understand transport layers and sync

Build docs developers (and LLMs) love