Skip to main content
Create a new dataset with a specific type and configuration. For file-ingestable dataset types, ingestion automatically starts after creation.

Request

POST
string
/datasets/

Authentication

Requires authentication via tenant credentials.

Request body

name
string
required
Unique name for the dataset (must be unique per tenant)
dtype
string
required
Dataset type name. Available types:
  • local_file - Local filesystem with ChromaDB for vector search
  • remote_weaviate - Remote Weaviate server connection
configuration
object
required
Configuration schema specific to the dataset type. Use the list dataset types endpoint to get the schema for each type.For local_file:
  • collectionName (string) - ChromaDB collection name (alphanumeric and underscores only)
  • httpPort (integer) - ChromaDB server HTTP port (default: 8100)
  • ingestionPaths (array) - List of file/directory paths to watch and ingest
  • fileTypes (array) - Allowed file extensions (e.g., [“.pdf”, “.txt”, “.md”])
For remote_weaviate:
  • http_url (string) - HTTP URL of the Weaviate server
  • grpc_url (string) - gRPC URL of the Weaviate server
  • api_key (string) - API key for authentication
  • collection_name (string) - Name of the Weaviate collection
  • headers (object, optional) - Additional headers for third-party API keys
  • content_property (string, optional) - Property name to use as main content
  • metadata_properties (array, optional) - Properties to include in metadata
  • filters (object, optional) - Filter conditions for search queries
summary
string
default:""
Brief summary describing the dataset
tags
string
default:""
Comma-separated tags (e.g., “legal,documents,analysis”)

Response

id
string
Unique identifier (UUID) for the dataset
name
string
Dataset name
dtype
string
Dataset type name
configuration
object
Full configuration for the dataset
summary
string
Dataset summary
tags
string
Comma-separated tags
provisioner_state
object
Provisioner state information (for local dataset types)
created_at
string
ISO 8601 timestamp of creation
updated_at
string
ISO 8601 timestamp of last update
connected_endpoints
array
List of endpoints connected to this dataset
curl -X POST https://your-domain.com/datasets/ \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -d '{
    "name": "legal-docs",
    "dtype": "local_file",
    "configuration": {
      "collectionName": "LegalDocuments",
      "httpPort": 8100,
      "ingestionPaths": [
        {
          "path": "/home/user/documents/legal",
          "description": "Legal documents directory"
        }
      ],
      "fileTypes": [".pdf", ".txt", ".docx"]
    },
    "summary": "Legal documents for RAG analysis",
    "tags": "legal,documents,analysis"
  }'
{
  "id": "550e8400-e29b-41d4-a716-446655440000",
  "name": "legal-docs",
  "dtype": "local_file",
  "configuration": {
    "collectionName": "LegalDocuments",
    "httpPort": 8100,
    "ingestionPaths": [
      {
        "path": "/home/user/documents/legal",
        "description": "Legal documents directory"
      }
    ],
    "fileTypes": [".pdf", ".txt", ".docx"]
  },
  "summary": "Legal documents for RAG analysis",
  "tags": "legal,documents,analysis",
  "provisioner_state": {
    "status": "running",
    "state": {
      "httpPort": 8100,
      "container_id": "abc123def456"
    },
    "started_at": "2024-03-15T10:30:00Z",
    "stopped_at": null,
    "error": null
  },
  "created_at": "2024-03-15T10:30:00Z",
  "updated_at": "2024-03-15T10:30:00Z",
  "connected_endpoints": []
}

Build docs developers (and LLMs) love