Dataset entity
A dataset is defined by the following properties:backend/syft_space/components/datasets/entities.py:47
Dataset types
Dataset types implement theBaseDatasetType protocol and provide:
Configuration schema
Each type defines required fields:Search interface
All dataset types implement search functionality:interfaces.py:25):
similarity_threshold(float): Minimum similarity score (0.0-1.0)limit(int): Maximum number of resultsinclude_metadata(bool): Whether to include document metadata
documents: List ofSearchedDocumentobjectsdocument_id: Unique document identifiercontent: Document textmetadata: Custom metadata dictsimilarity_score: Relevance score (0.0-1.0)
backend/syft_space/components/dataset_types/interfaces.py:55
Available dataset types
Weaviate (remote)
Type name:weaviate
Cloud or self-hosted Weaviate vector database.
Configuration:
ChromaDB (local)
Type name:chromadb_local
Local ChromaDB instance managed by Syft Space.
Configuration:
Provisioners
Provisioners manage the lifecycle of local dataset infrastructure (containers, processes). They are shared across all datasets of the same type.Provisioner lifecycle
Location:backend/syft_space/components/datasets/entities.py:16
Provisioner state
Shared state is tracked in the database:backend/syft_space/components/datasets/entities.py:113
Key provisioner behaviors
Shared provisioners: Multiple datasets of the same type share one provisioner. When you create a second ChromaDB dataset, it reuses the existing ChromaDB container.
backend/syft_space/components/datasets/handlers.py:377
Startup/shutdown
Provisioners are automatically managed:-
On app startup (
startup_all_provisioners):- Finds all provisioners with attached datasets
- Starts them if not already running
- Recovers from stuck STARTING/STOPPING states
-
On app shutdown (
shutdown_all_provisioners):- Stops all running provisioners
- Best-effort (continues on errors)
backend/syft_space/components/datasets/handlers.py:194
Data ingestion
Datasets that implementIngestableDatasetType support file uploads:
files: List ofIngestFileobjectsfile_handle: File-like object (BytesIO, SpooledTemporaryFile)filename: Original filenamecontent_type: MIME typefile_size: Size in bytes
backend/syft_space/components/dataset_types/interfaces.py:206
File watching
Datasets can monitor directories for new files:backend/syft_space/components/dataset_types/interfaces.py:233
Dataset operations
Create dataset
backend/syft_space/components/datasets/handlers.py:333
Delete dataset
backend/syft_space/components/datasets/handlers.py:494
Healthcheck
Check if a dataset’s connection is healthy:backend/syft_space/components/datasets/handlers.py:544
Connection fields
Dataset types define which configuration fields are connection-related:- Shared across all datasets of the same type
- Overridden from provisioner state when creating new datasets
- Stored in
ProvisionerState.state
collectionName, ingestionPath) remain unique per dataset.
Location: backend/syft_space/components/dataset_types/interfaces.py:188
Relationships
- Tenant: Each dataset belongs to one tenant
- Endpoints: One dataset can be used by multiple endpoints
- ProvisionerState: Local datasets link to shared provisioner state
Example workflow
Create ChromaDB dataset
POST
/api/v1/datasets with dtype: "chromadb_local"Backend starts ChromaDB provisioner (first time)Ingest documents
POST
/api/v1/datasets/{name}/ingest with PDF filesFiles are chunked and embedded into collectionCreate second dataset
POST
/api/v1/datasets with different collectionNameReuses existing ChromaDB provisionerNext steps
Models
Learn how to connect AI models for response generation
Endpoints
Combine datasets and models into queryable endpoints