Skip to main content
Datasets represent configured instances of vector databases that store and search your private data. Each dataset is backed by a dataset type that defines how data is stored, searched, and managed.

Dataset entity

A dataset is defined by the following properties:
class Dataset:
    id: UUID                    # Unique identifier
    tenant_id: UUID             # Tenant isolation
    name: str                   # Unique name per tenant
    dtype: str                  # Dataset type (e.g., "weaviate", "chromadb_local")
    configuration: dict         # Type-specific config
    summary: str                # Brief description
    tags: str                   # Comma-separated tags
    provisioner_state_id: UUID  # Link to shared provisioner (for local types)
    created_at: datetime
    updated_at: datetime
Location: backend/syft_space/components/datasets/entities.py:47

Dataset types

Dataset types implement the BaseDatasetType protocol and provide:

Configuration schema

Each type defines required fields:
@classmethod
def configuration_schema(cls) -> dict[str, Any]:
    """Return configuration schema for this dataset type."""
    return {
        "httpPort": {"type": "integer", "required": True},
        "grpcPort": {"type": "integer", "required": True},
        "collectionName": {"type": "string", "required": True}
    }

Search interface

All dataset types implement search functionality:
async def search(
    self,
    ctx: SearchContext,
    query: str,
    params: SearchParameters | None = None
) -> SearchResult:
    """Search the dataset for matching documents."""
SearchParameters (from interfaces.py:25):
  • similarity_threshold (float): Minimum similarity score (0.0-1.0)
  • limit (int): Maximum number of results
  • include_metadata (bool): Whether to include document metadata
SearchResult contains:
  • documents: List of SearchedDocument objects
    • document_id: Unique document identifier content: Document text
    • metadata: Custom metadata dict
    • similarity_score: Relevance score (0.0-1.0)
Location: backend/syft_space/components/dataset_types/interfaces.py:55

Available dataset types

Weaviate (remote)

Type name: weaviate Cloud or self-hosted Weaviate vector database. Configuration:
{
  "httpPort": 8080,
  "grpcPort": 50051,
  "collectionName": "MyCollection",
  "ingestionPath": "/data/documents"
}
No provisioner: Weaviate runs externally, you provide connection details.

ChromaDB (local)

Type name: chromadb_local Local ChromaDB instance managed by Syft Space. Configuration:
{
  "collectionName": "MyCollection",
  "ingestionPath": "/data/documents"
}
Has provisioner: Syft Space automatically starts/stops ChromaDB containers.

Provisioners

Provisioners manage the lifecycle of local dataset infrastructure (containers, processes). They are shared across all datasets of the same type.

Provisioner lifecycle

Location: backend/syft_space/components/datasets/entities.py:16

Provisioner state

Shared state is tracked in the database:
class ProvisionerState:
    id: UUID
    dtype: str              # One provisioner per dataset type
    state: dict             # Connection config + runtime info (container_id, ports)
    status: str             # STOPPED, STARTING, RUNNING, STOPPING, ERROR
    started_at: datetime
    stopped_at: datetime
    error: str | None
Location: backend/syft_space/components/datasets/entities.py:113

Key provisioner behaviors

Shared provisioners: Multiple datasets of the same type share one provisioner. When you create a second ChromaDB dataset, it reuses the existing ChromaDB container.
Connection field override: When a provisioner is running, new datasets inherit connection fields (ports, URLs) from the provisioner state, ignoring user-provided values.
Location: backend/syft_space/components/datasets/handlers.py:377

Startup/shutdown

Provisioners are automatically managed:
  1. On app startup (startup_all_provisioners):
    • Finds all provisioners with attached datasets
    • Starts them if not already running
    • Recovers from stuck STARTING/STOPPING states
  2. On app shutdown (shutdown_all_provisioners):
    • Stops all running provisioners
    • Best-effort (continues on errors)
Location: backend/syft_space/components/datasets/handlers.py:194

Data ingestion

Datasets that implement IngestableDatasetType support file uploads:
async def ingest(
    self,
    ctx: IngestContext,
    request: IngestRequest
) -> None:
    """Ingest files into the dataset."""
IngestRequest contains:
  • files: List of IngestFile objects
    • file_handle: File-like object (BytesIO, SpooledTemporaryFile)
    • filename: Original filename
    • content_type: MIME type
    • file_size: Size in bytes
Location: backend/syft_space/components/dataset_types/interfaces.py:206

File watching

Datasets can monitor directories for new files:
class FileIngestableDatasetType:
    def watched_paths(self) -> list[str]:
        """Paths to monitor for new files."""
    
    def allowed_extensions(self) -> set[str]:
        """File extensions to accept (e.g., {".pdf", ".txt"})."""
Location: backend/syft_space/components/dataset_types/interfaces.py:233

Dataset operations

Create dataset

async def create_dataset(
    request: CreateDatasetRequest,
    tenant: Tenant
) -> DatasetResponse:
    """
    1. Validates dataset type exists
    2. Validates configuration
    3. Ensures provisioner is running (starts if needed)
    4. Overrides connection fields from provisioner state
    5. Creates dataset entity
    """
Location: backend/syft_space/components/datasets/handlers.py:333

Delete dataset

async def delete_dataset(name: str, tenant: Tenant) -> dict:
    """
    1. Deletes dataset type resources (collections, files)
    2. Deletes database record
    3. DOES NOT stop provisioner (it may be shared)
    """
Deleting a dataset does NOT stop its provisioner. Use admin endpoints to manually stop/delete provisioners if needed.
Location: backend/syft_space/components/datasets/handlers.py:494

Healthcheck

Check if a dataset’s connection is healthy:
async def healthcheck(name: str, tenant: Tenant) -> HealthcheckResponse:
    """
    Returns:
    - dataset_type_status: Connection health (HEALTHY/UNHEALTHY)
    - provisioner_status: Provisioner state (if local type)
    - message: Details
    """
Location: backend/syft_space/components/datasets/handlers.py:544

Connection fields

Dataset types define which configuration fields are connection-related:
@classmethod
def connection_fields(cls) -> list[str]:
    """Fields shared across datasets via provisioner.
    
    Example for Weaviate: ["httpPort", "grpcPort"]
    """
These fields are:
  • Shared across all datasets of the same type
  • Overridden from provisioner state when creating new datasets
  • Stored in ProvisionerState.state
Non-connection fields (e.g., collectionName, ingestionPath) remain unique per dataset. Location: backend/syft_space/components/dataset_types/interfaces.py:188

Relationships

  • Tenant: Each dataset belongs to one tenant
  • Endpoints: One dataset can be used by multiple endpoints
  • ProvisionerState: Local datasets link to shared provisioner state

Example workflow

1

Create ChromaDB dataset

POST /api/v1/datasets with dtype: "chromadb_local"Backend starts ChromaDB provisioner (first time)
2

Ingest documents

POST /api/v1/datasets/{name}/ingest with PDF filesFiles are chunked and embedded into collection
3

Create second dataset

POST /api/v1/datasets with different collectionNameReuses existing ChromaDB provisioner
4

Query via endpoint

Create endpoint linking to datasetQuery endpoint → searches dataset → returns results

Next steps

Models

Learn how to connect AI models for response generation

Endpoints

Combine datasets and models into queryable endpoints

Build docs developers (and LLMs) love