Understanding dataset types
Syft Space supports multiple dataset types, each designed for different use cases:Local file dataset
Stores documents locally using ChromaDB for vector search. Best for getting started quickly with your own documents. Icon: 🎨 Key features:- Automatic file watching and ingestion
- Supports PDF, TXT, HTML, XLSX, DOCX, MD, CSV, JSON
- Built-in ChromaDB provisioning
- Local storage with persistent data
collectionName- Name for your document collection (alphanumeric and underscores only)httpPort- ChromaDB server HTTP port (default: 8100)filePaths- List of directories or files to watch for automatic ingestioningestFileTypeOptions- File extensions to ingest (e.g.,.pdf,.txt)
Remote Weaviate
Connects to an existing Weaviate instance for vector search. Ideal if you already have data in Weaviate. Icon: 🌐 Key features:- Connect to hosted or self-hosted Weaviate
- Query existing collections
- Custom filters and metadata mapping
- Third-party embedding API support
http_url- HTTP URL of your Weaviate servergrpc_url- gRPC URL of your Weaviate serverapi_key- API key for authenticationcollection_name- Name of the Weaviate collectioncontent_property- Property to use as main content (optional)metadata_properties- Properties to include in metadata (optional)filters- Query filters to apply when searching (optional)headers- Additional HTTP headers for third-party APIs (optional)default_similarity_threshold- Default similarity threshold (default: 0.5)
Creating a dataset
Local file
Basic settings:
- Name - A unique identifier for your dataset (e.g., “research-papers”)
- Collection name - Name for the ChromaDB collection (e.g., “ResearchDocs”)
- Summary - Brief description of what this dataset contains
- Tags - Comma-separated tags for organization (e.g., “research,papers,ai”)
- Click Add file path to specify directories or files to watch
- Enter the absolute path (e.g.,
/home/user/documents) - Add a description for this path (e.g., “Research papers from 2024”)
- Add multiple paths if needed
.pdf- PDF documents.txt- Plain text files.md- Markdown files.docx- Word documents.html- HTML files.csv- CSV data files.json- JSON data files.xlsx- Excel spreadsheets
Remote Weaviate
Connection settings:
- Name - A unique identifier for your dataset (e.g., “production-kb”)
- HTTP URL - Full HTTP URL (e.g.,
https://my-cluster.weaviate.network) - gRPC URL - Full gRPC URL (e.g.,
https://my-cluster.weaviate.network) - API Key - Your Weaviate API key
- Collection name - Name of the collection in Weaviate
- Summary - Brief description of this dataset
- Tags - Comma-separated tags (e.g., “production,knowledge-base”)
- Content property - Specify which property contains the main text (e.g., “body”, “description”)
- If not specified, all properties are JSON-serialized as content
- Metadata properties - List properties to include in metadata (e.g., [“title”, “author”, “date”])
- If not specified, all properties are included
- Default similarity threshold - Minimum similarity score (0.0-1.0, default: 0.5)
- Headers - Additional HTTP headers for third-party embedding APIs
- Filters - Query filters in Weaviate filter format
Managing file ingestion
For local file datasets, Syft Space provides automatic file watching and ingestion.Starting ingestion
Monitoring ingestion jobs
Each file ingestion creates a job that you can monitor:- Pending - Waiting to be processed
- In progress - Currently being parsed and indexed
- Completed - Successfully ingested
- Failed - Ingestion failed (see error message)
- Cancelled - Manually cancelled
Retrying failed ingestions
If ingestion jobs fail:- Review the error message to understand the issue
- Fix the underlying problem (e.g., corrupted file, permissions)
- Click Retry Failed Jobs to reprocess failed files
Stopping ingestion
To stop file watching:- Click Stop Ingestion on the dataset detail page
- Any in-progress jobs will be cancelled
- New files will not be automatically processed
File watching continues even after a server restart. Syft Space remembers your configuration and resumes monitoring.
Checking dataset health
Before using a dataset in an endpoint, verify it’s healthy:Dataset configuration examples
Research papers dataset
Production knowledge base
Updating datasets
You can update certain dataset properties after creation:You cannot change the dataset type or core configuration (like collection name or connection settings) after creation. To change these, you must create a new dataset.
Deleting datasets
Deleting a dataset permanently removes all associated data:Before deleting, verify no endpoints are using this dataset. The dataset detail page shows all connected endpoints.
Next steps
Connect models
Add AI models to generate responses from your dataset
Build endpoints
Create queryable endpoints that combine datasets and models