Skip to main content
Datasets are the top-level organizational units in Snuba that group related entities. They provide a logical namespace for collections of data that are queried together.

Overview

A dataset configuration is the simplest type of Snuba configuration file. It primarily serves to:
  • Define a logical grouping of entities
  • Provide a namespace for queries
  • Organize related data models
  • Enable dataset-level routing and permissions

Schema

Dataset configurations follow the v1 schema:
version
string
required
Schema version. Must be v1.
kind
string
required
Component type. Must be dataset.
name
string
required
Unique name for the dataset. Used for routing queries and referencing the dataset programmatically.
entities
array
Array of entity names associated with this dataset. These must correspond to entity configuration files.

Basic Example

Here’s a simple dataset configuration:
dataset.yaml
version: v1
kind: dataset
name: events

entities:
  - events
This configuration:
  • Creates a dataset named events
  • Associates it with a single entity named events
  • The entity must be defined in a separate entity configuration file

Complete Examples

version: v1
kind: dataset
name: events

entities:
  - events

Multi-Entity Datasets

Some datasets contain multiple entities that represent different views or subsets of the data:
dataset.yaml
version: v1
kind: dataset
name: discover

entities:
  - discover              # Combined view of events and transactions
  - discover_events       # Events-only view
  - discover_transactions # Transactions-only view

When to Use Multiple Entities

Use multiple entities in a dataset when:
  • Different data types: Events vs transactions, each with unique schemas
  • Performance optimization: Separate hot and cold data paths
  • Different query patterns: Read-only vs read-write access
  • Access control: Different permission levels for different views
Each entity in a dataset can have its own storage backend, query processors, and validators. This allows fine-grained control over data access and query optimization.

Dataset Registration

Datasets are loaded and registered at startup by the configuration loader:
from snuba.datasets.configuration.dataset_builder import build_dataset_from_config
from snuba.datasets.factory import get_dataset

# Load from configuration
dataset = build_dataset_from_config("path/to/dataset.yaml")

# Access registered dataset
events_dataset = get_dataset("events")

Querying Datasets

Queries are routed to datasets via the Snuba API:
curl -X POST http://localhost:1218/events/snql \
  -H "Content-Type: application/json" \
  -d '{
    "query": "MATCH (events) SELECT count() WHERE project_id IN (1, 2)",
    "dataset": "events"
  }'

Directory Structure

Dataset configurations should follow this structure:
snuba/datasets/configuration/
└── {dataset_name}/
    ├── dataset.yaml              # Dataset configuration
    ├── entities/
   ├── {entity_1}.yaml      # Entity configurations
   ├── {entity_2}.yaml
   └── ...
    └── storages/
        ├── {storage_1}.yaml     # Storage configurations
        ├── {storage_2}.yaml
        └── ...

Validation

Dataset configurations are validated against the JSON schema at startup:
from snuba.datasets.configuration.json_schema import V1_DATASET_SCHEMA

# Schema structure
V1_DATASET_SCHEMA = {
    "title": "Dataset Schema",
    "type": "object",
    "properties": {
        "version": {"const": "v1"},
        "kind": {"const": "dataset"},
        "name": {"type": "string"},
        "entities": {
            "type": "array",
            "items": {"type": "string"}
        }
    },
    "required": ["version", "kind", "name"],
    "additionalProperties": False
}

Common Validation Errors

# ❌ Missing 'version' field
kind: dataset
name: events
Error: 'version' is a required propertyFix: Add the version field:
# ✅ Correct
version: v1
kind: dataset
name: events
# ❌ Wrong kind value
version: v1
kind: entity
name: events
Error: data.kind must be equal to constant 'dataset'Fix: Use the correct kind:
# ✅ Correct
version: v1
kind: dataset
name: events
version: v1
kind: dataset
name: events
entities:
  - nonexistent_entity  # ❌ Entity file doesn't exist
Error: Runtime error when loading entityFix: Ensure entity configuration exists:
snuba/datasets/configuration/events/entities/nonexistent_entity.yaml

Dataset Naming Conventions

Follow these conventions when naming datasets:
  • Use lowercase with underscores: events, generic_metrics
  • Be descriptive: Name should indicate the data type
  • Be consistent: Use similar patterns across datasets
  • Avoid abbreviations unless widely understood
events
transactions
generic_metrics
sessions
replays
profiles

Integration with Entities

The relationship between datasets and entities:

Migration from Code-Based Configuration

If you’re migrating from code-based dataset definitions:
1

Create dataset.yaml

Create a new YAML file in snuba/datasets/configuration/{dataset_name}/dataset.yaml
2

Extract configuration

Extract dataset name and entity list from the Python class:
Old Python Code
class EventsDataset(Dataset):
    def __init__(self):
        self.name = "events"
        self.entities = [EventsEntity()]
New YAML Config
version: v1
kind: dataset
name: events
entities:
  - events
3

Create entity configs

Create corresponding entity configuration files for each entity
4

Test configuration

Start Snuba and verify the dataset loads correctly:
snuba devserver
# Check logs for "Loading dataset: events"
5

Remove old code

Once validated, remove the old Python-based dataset definition

Best Practices

Keep It Simple

Datasets should be simple groupings. Complex logic belongs in entities and storages.

Logical Grouping

Group entities that are commonly queried together or represent related data.

Clear Naming

Use descriptive, unambiguous names that indicate the dataset’s purpose.

Document Purpose

Add YAML comments to explain the dataset’s purpose and entity relationships.

Entities

Configure entity schemas and query logic

Storages

Set up storage backends for entities

Build docs developers (and LLMs) love