Creating datasets

Datasets are the foundation of your Syft Space endpoints. They store and index your documents, making them searchable through vector similarity. This guide walks you through creating and configuring datasets.

Understanding dataset types

Syft Space supports multiple dataset types, each designed for different use cases:

Local file dataset

Stores documents locally using ChromaDB for vector search. Best for getting started quickly with your own documents. Icon: 🎨 Key features:

Automatic file watching and ingestion
Supports PDF, TXT, HTML, XLSX, DOCX, MD, CSV, JSON
Built-in ChromaDB provisioning
Local storage with persistent data

Configuration:

collectionName - Name for your document collection (alphanumeric and underscores only)
httpPort - ChromaDB server HTTP port (default: 8100)
filePaths - List of directories or files to watch for automatic ingestion
ingestFileTypeOptions - File extensions to ingest (e.g., .pdf, .txt)

Remote Weaviate

Connects to an existing Weaviate instance for vector search. Ideal if you already have data in Weaviate. Icon: 🌐 Key features:

Connect to hosted or self-hosted Weaviate
Query existing collections
Custom filters and metadata mapping
Third-party embedding API support

Configuration:

http_url - HTTP URL of your Weaviate server
grpc_url - gRPC URL of your Weaviate server
api_key - API key for authentication
collection_name - Name of the Weaviate collection
content_property - Property to use as main content (optional)
metadata_properties - Properties to include in metadata (optional)
filters - Query filters to apply when searching (optional)
headers - Additional HTTP headers for third-party APIs (optional)
default_similarity_threshold - Default similarity threshold (default: 0.5)

Creating a dataset

Navigate to datasets

From your Syft Space dashboard, click Datasets in the sidebar, then click Add Dataset.

Choose dataset type

Select the dataset type that matches your needs:

Local file - For uploading and managing files locally

Remote Weaviate - For connecting to an existing Weaviate instance

Configure dataset settings

Fill in the required configuration fields based on your selected type.

Local file

Basic settings:

Name - A unique identifier for your dataset (e.g., “research-papers”)
Collection name - Name for the ChromaDB collection (e.g., “ResearchDocs”)
Summary - Brief description of what this dataset contains
Tags - Comma-separated tags for organization (e.g., “research,papers,ai”)

File watching:

Click Add file path to specify directories or files to watch
Enter the absolute path (e.g., /home/user/documents)
Add a description for this path (e.g., “Research papers from 2024”)
Add multiple paths if needed

File type options:Select which file types to ingest from your watched paths:

.pdf - PDF documents
.txt - Plain text files
.md - Markdown files
.docx - Word documents
.html - HTML files
.csv - CSV data files
.json - JSON data files
.xlsx - Excel spreadsheets

Remote Weaviate

Connection settings:

Name - A unique identifier for your dataset (e.g., “production-kb”)
HTTP URL - Full HTTP URL (e.g., https://my-cluster.weaviate.network)
gRPC URL - Full gRPC URL (e.g., https://my-cluster.weaviate.network)
API Key - Your Weaviate API key
Collection name - Name of the collection in Weaviate
Summary - Brief description of this dataset
Tags - Comma-separated tags (e.g., “production,knowledge-base”)

Content mapping (optional):

Content property - Specify which property contains the main text (e.g., “body”, “description”)
- If not specified, all properties are JSON-serialized as content
Metadata properties - List properties to include in metadata (e.g., [“title”, “author”, “date”])
- If not specified, all properties are included

Advanced options (optional):

Default similarity threshold - Minimum similarity score (0.0-1.0, default: 0.5)

Headers - Additional HTTP headers for third-party embedding APIs

{
  "X-Cohere-Api-Key": "your-cohere-key",
  "X-OpenAI-Api-Key": "your-openai-key"
}

Filters - Query filters in Weaviate filter format

Save and provision

Click Create Dataset. For local file datasets, Syft Space automatically:

Starts a ChromaDB Docker container

Creates the collection with your specified settings

Begins watching your specified file paths

Ingests existing files automatically

For remote Weaviate datasets, Syft Space validates the connection and collection access.

Managing file ingestion

For local file datasets, Syft Space provides automatic file watching and ingestion.

Starting ingestion

View dataset details

Click on your dataset to view its details page.

Check ingestion status

The dataset detail page shows:

Whether file watching is active

Total ingestion jobs (pending, in progress, completed, failed)

List of recent ingestion jobs with status

Start watching

If file watching is not active, click Start Ingestion to:

Begin monitoring your configured file paths

Create ingestion jobs for all existing files

Automatically process new files as they appear

Monitoring ingestion jobs

Each file ingestion creates a job that you can monitor:

Pending - Waiting to be processed
In progress - Currently being parsed and indexed
Completed - Successfully ingested
Failed - Ingestion failed (see error message)
Cancelled - Manually cancelled

Retrying failed ingestions

If ingestion jobs fail:

Review the error message to understand the issue
Fix the underlying problem (e.g., corrupted file, permissions)
Click Retry Failed Jobs to reprocess failed files

Stopping ingestion

To stop file watching:

Click Stop Ingestion on the dataset detail page
Any in-progress jobs will be cancelled
New files will not be automatically processed

File watching continues even after a server restart. Syft Space remembers your configuration and resumes monitoring.

Checking dataset health

Before using a dataset in an endpoint, verify it’s healthy:

View provisioner status

For local file datasets, check the provisioner status on the dataset detail page:

Running - ChromaDB container is active and healthy

Stopped - Container is not running

Starting - Container is being provisioned

Error - Provisioning failed (see error message)

Test dataset connection

Click Test Connection to verify:

The dataset service is responding

Authentication is working (for remote datasets)

The collection exists and is accessible

For local file datasets, the provisioner automatically restarts if the container stops. You can manually restart it from the dataset detail page if needed.

Dataset configuration examples

Research papers dataset

{
  "name": "research-papers",
  "dtype": "local_file",
  "configuration": {
    "collectionName": "ResearchPapers",
    "httpPort": 8100,
    "filePaths": [
      {
        "path": "/home/user/research/papers",
        "description": "Academic papers on machine learning"
      },
      {
        "path": "/home/user/research/preprints",
        "description": "ArXiv preprints"
      }
    ],
    "ingestFileTypeOptions": [".pdf", ".txt", ".md"]
  },
  "summary": "Collection of ML research papers and preprints",
  "tags": "research,ml,papers"
}

Production knowledge base

{
  "name": "production-kb",
  "dtype": "remote_weaviate",
  "configuration": {
    "http_url": "https://kb-cluster.weaviate.network",
    "grpc_url": "https://kb-cluster.weaviate.network",
    "api_key": "your-api-key-here",
    "collection_name": "KnowledgeArticles",
    "content_property": "body",
    "metadata_properties": ["title", "author", "category", "published_at"],
    "default_similarity_threshold": 0.7
  },
  "summary": "Production knowledge base articles",
  "tags": "production,kb,articles"
}

Updating datasets

You can update certain dataset properties after creation:

Navigate to dataset

Click on the dataset you want to update.

Edit properties

Click Edit to modify:

Name - Change the dataset identifier

Summary - Update the description

Tags - Modify the tag list

You cannot change the dataset type or core configuration (like collection name or connection settings) after creation. To change these, you must create a new dataset.

Save changes

Click Save to apply your changes.

Deleting datasets

Deleting a dataset permanently removes all associated data:

Check connected endpoints

Before deleting, verify no endpoints are using this dataset. The dataset detail page shows all connected endpoints.

Delete dataset

Click Delete Dataset and confirm the action.

What gets deleted

For local file datasets:

The ChromaDB collection and all vectors

All ingestion jobs and history

Associated page images and extracted content

The Docker container (if not shared)

For remote Weaviate datasets:

Only the dataset configuration in Syft Space

Your actual Weaviate collection is NOT affected

Deletion is permanent and cannot be undone. Make sure to back up any important data before deleting.

Get Started

Core Concepts

Guides

Desktop App

Deployment

Advanced

Understanding dataset types

Local file dataset

Remote Weaviate

Creating a dataset

Managing file ingestion

Starting ingestion

Monitoring ingestion jobs

Retrying failed ingestions

Stopping ingestion

Checking dataset health

Dataset configuration examples

Research papers dataset

Production knowledge base

Updating datasets

Deleting datasets

Next steps

Connect models

Build endpoints

Build docs developers (and LLMs) love

Get Started

Core Concepts

Guides

Desktop App

Deployment

Advanced

​Understanding dataset types

​Local file dataset

​Remote Weaviate

​Creating a dataset

​Managing file ingestion

​Starting ingestion

​Monitoring ingestion jobs

​Retrying failed ingestions

​Stopping ingestion

​Checking dataset health

​Dataset configuration examples

​Research papers dataset

​Production knowledge base

​Updating datasets

​Deleting datasets

​Next steps

Connect models

Build endpoints

Build docs developers (and LLMs) love

Understanding dataset types

Local file dataset

Remote Weaviate

Creating a dataset

Managing file ingestion

Starting ingestion

Monitoring ingestion jobs

Retrying failed ingestions

Stopping ingestion

Checking dataset health

Dataset configuration examples

Research papers dataset

Production knowledge base

Updating datasets

Deleting datasets

Next steps