Skip to main content

Overview

SyftDatasetManager is the primary interface for creating, retrieving, and managing datasets in SyftBox. It handles dataset storage, permissions, and synchronization across datasites.

Constructor

from syft_datasets import SyftDatasetManager

manager = SyftDatasetManager(
    syftbox_folder_path="/path/to/syftbox",
    email="[email protected]"
)
syftbox_folder_path
PathLike
required
Path to the SyftBox folder on the local filesystem
email
str
required
Email address associated with the datasite

Class Methods

from_config

Create a SyftDatasetManager from an existing SyftBoxConfig.
manager = SyftDatasetManager.from_config(config)
config
SyftBoxConfig
required
SyftBox configuration object
return
SyftDatasetManager
Configured dataset manager instance

Methods

create

Create a new dataset with mock and private data.
dataset = manager.create(
    name="my_dataset",
    mock_path="./data/mock",
    private_path="./data/private",
    summary="Sample dataset for analysis",
    readme_path="./README.md",
    tags=["healthcare", "research"],
    users=["[email protected]"]
)
name
str
required
Unique identifier for the dataset. Only alphanumeric characters, underscores, and hyphens are allowed.
mock_path
PathLike
required
Path to the mock data (file or directory) that will be shared publicly
private_path
PathLike
required
Path to the private data (file or directory) that remains local
summary
str | None
Short summary describing the dataset
readme_path
Path | None
Path to a markdown README file to include in the dataset
location
str | None
Location identifier for datasets hosted on remote locations requiring manual syncing (e.g., ‘high-side-1234’)
tags
list[str] | None
Tags for categorizing and discovering the dataset
users
list[str] | str | None
Users to share the dataset with. Can be:
  • List of email addresses
  • "any" to share with all users
  • None (default) to share with no one
return
Dataset
The created Dataset object with metadata and file URLs
Raises:
  • ValueError: If dataset name contains invalid characters
  • FileNotFoundError: If mock_path or readme_path doesn’t exist
  • FileExistsError: If dataset directory already exists and is not empty

get

Retrieve a dataset by name.
dataset = manager.get("my_dataset")

# Get dataset from another datasite
dataset = manager.get("my_dataset", datasite="[email protected]")
name
str
required
Name of the dataset to retrieve
datasite
str | None
Email of the datasite owner. Defaults to the current user’s email
return
Dataset
The requested Dataset object
Raises:
  • FileNotFoundError: If dataset doesn’t exist

get_all

Retrieve all accessible datasets with optional filtering and pagination.
# Get all datasets
all_datasets = manager.get_all()

# Get datasets from specific datasite
datasets = manager.get_all(datasite="[email protected]")

# Get datasets with pagination and sorting
datasets = manager.get_all(
    limit=10,
    offset=0,
    order_by="created_at",
    sort_order="desc"
)
datasite
str | None
Filter datasets by datasite owner email
limit
int | None
Maximum number of datasets to return
offset
int | None
Number of datasets to skip (for pagination)
order_by
str | None
Field name to sort by (e.g., “created_at”, “name”)
sort_order
Literal['asc', 'desc']
default:"asc"
Sort order: ascending or descending
return
list[Dataset]
List of Dataset objects (as a TableList for nice display)

delete

Delete a dataset from the datasite.
# Delete with confirmation prompt
manager.delete("my_dataset")

# Delete without confirmation
manager.delete("my_dataset", require_confirmation=False)
name
str
required
Name of the dataset to delete
datasite
str | None
Email of the datasite owner. Defaults to current user. Must be your own datasite.
require_confirmation
bool
default:"true"
Whether to prompt for confirmation before deleting
Raises:
  • ValueError: If attempting to delete another user’s dataset
  • FileNotFoundError: If dataset doesn’t exist
Deleting a dataset removes both mock and private metadata directories. Private data files are only deleted if they’re managed by SyftBox.

share_dataset

Share an existing dataset with users.
# Share with specific users
manager.share_dataset("my_dataset", users=["[email protected]", "[email protected]"])

# Share with everyone
manager.share_dataset("my_dataset", users="any")
name
str
required
Name of the dataset to share
users
list[str] | str
required
List of email addresses or “any” to share with all users
Raises:
  • ValueError: If dataset doesn’t exist

Special Methods

Indexing

Access datasets by name or index.
# Access by name
dataset = manager["my_dataset"]

# Access by index
first_dataset = manager[0]

Iteration

Iterate over all datasets.
for dataset in manager:
    print(dataset.name)

Length

Get the total number of datasets.
total = len(manager)

Properties

syftbox_config
SyftBoxConfig
The SyftBox configuration used by this manager

Usage Example

from syft_datasets import SyftDatasetManager

# Initialize manager
manager = SyftDatasetManager(
    syftbox_folder_path="~/SyftBox",
    email="[email protected]"
)

# Create a new dataset
dataset = manager.create(
    name="patient_records",
    mock_path="./synthetic_data",
    private_path="./real_data",
    summary="Synthetic patient records for model training",
    tags=["healthcare", "synthetic"],
    users=["[email protected]"]
)

print(f"Created dataset: {dataset.name}")
print(f"Mock data location: {dataset.mock_dir}")

# List all datasets
all_datasets = manager.get_all()
print(f"Total datasets: {len(all_datasets)}")

# Retrieve a specific dataset
my_dataset = manager.get("patient_records")
print(f"Dataset summary: {my_dataset.summary}")

Constants

FOLDER_NAME
str
Default folder name for storing datasets
METADATA_FILENAME
str
Filename for dataset metadata
SHARE_WITH_ANY
str
Constant for sharing datasets with all users

Build docs developers (and LLMs) love