Overview
A dataset configuration is the simplest type of Snuba configuration file. It primarily serves to:- Define a logical grouping of entities
- Provide a namespace for queries
- Organize related data models
- Enable dataset-level routing and permissions
Schema
Dataset configurations follow thev1 schema:
Schema version. Must be
v1.Component type. Must be
dataset.Unique name for the dataset. Used for routing queries and referencing the dataset programmatically.
Array of entity names associated with this dataset. These must correspond to entity configuration files.
Basic Example
Here’s a simple dataset configuration:dataset.yaml
- Creates a dataset named
events - Associates it with a single entity named
events - The entity must be defined in a separate entity configuration file
Complete Examples
Multi-Entity Datasets
Some datasets contain multiple entities that represent different views or subsets of the data:dataset.yaml
When to Use Multiple Entities
Use multiple entities in a dataset when:- Different data types: Events vs transactions, each with unique schemas
- Performance optimization: Separate hot and cold data paths
- Different query patterns: Read-only vs read-write access
- Access control: Different permission levels for different views
Each entity in a dataset can have its own storage backend, query processors, and validators. This allows fine-grained control over data access and query optimization.
Dataset Registration
Datasets are loaded and registered at startup by the configuration loader:Querying Datasets
Queries are routed to datasets via the Snuba API:Directory Structure
Dataset configurations should follow this structure:Validation
Dataset configurations are validated against the JSON schema at startup:Common Validation Errors
Missing Required Fields
Missing Required Fields
'version' is a required propertyFix: Add the version field:Invalid Kind Value
Invalid Kind Value
data.kind must be equal to constant 'dataset'Fix: Use the correct kind:Entity Not Found
Entity Not Found
Dataset Naming Conventions
Follow these conventions when naming datasets:- Use lowercase with underscores:
events,generic_metrics - Be descriptive: Name should indicate the data type
- Be consistent: Use similar patterns across datasets
- Avoid abbreviations unless widely understood
Integration with Entities
The relationship between datasets and entities:Migration from Code-Based Configuration
If you’re migrating from code-based dataset definitions:Create dataset.yaml
Create a new YAML file in
snuba/datasets/configuration/{dataset_name}/dataset.yamlExtract configuration
Extract dataset name and entity list from the Python class:
Old Python Code
New YAML Config
Best Practices
Keep It Simple
Datasets should be simple groupings. Complex logic belongs in entities and storages.
Logical Grouping
Group entities that are commonly queried together or represent related data.
Clear Naming
Use descriptive, unambiguous names that indicate the dataset’s purpose.
Document Purpose
Add YAML comments to explain the dataset’s purpose and entity relationships.
Related Configuration
Entities
Configure entity schemas and query logic
Storages
Set up storage backends for entities