Skip to main content
Datasets are the primary organizational unit in Snuba, providing a logical grouping of entities and storages that represent different types of data ingested into ClickHouse.

What are Datasets?

A dataset in Snuba represents a logical collection of related data and defines the interface through which clients can query that data. Each dataset contains:
  • Entities: Logical query interfaces that define schema and available columns
  • Storages: Physical ClickHouse tables where data is stored
  • Query Processors: Components that optimize and transform queries
  • Validators: Rules that ensure query correctness and performance

Core Datasets

Snuba provides several production datasets that power different Sentry features:

Events

Error and exception data with stack traces, user context, and tags

Transactions

Performance monitoring data with timing information and span details

Metrics

Time-series metrics including counters, distributions, and sets

Replays

Session replay data with user interactions and browser events

Dataset Architecture

Configuration Structure

Datasets are defined using YAML configuration files located in snuba/datasets/configuration/. Each dataset requires:
version: v1
kind: dataset
name: dataset_name

entities:
  - entity_name_1
  - entity_name_2

Entity-Storage Relationship

Entities provide the query interface while storages handle physical data organization. This separation allows for flexible query patterns and storage optimizations.

Entities

Entities define:
  • Column schema with types
  • Available query processors
  • Validation rules
  • Subscription capabilities
  • Join relationships

Storages

Storages define:
  • Physical ClickHouse table names
  • Partitioning strategies
  • Data retention policies
  • Allocation policies for rate limiting
  • Stream loader configuration

Query Flow

  1. Query Reception: Client submits query through SnQL or API
  2. Entity Selection: Dataset routes to appropriate entity
  3. Query Parsing: Entity parses and validates query
  4. Query Processing: Processors optimize and transform query
  5. Storage Selection: Storage selector chooses optimal storage
  6. Query Execution: Translated to ClickHouse SQL and executed
  7. Result Processing: Results formatted and returned to client

Data Ingestion

Each writable storage defines a stream loader that handles Kafka message processing:
stream_loader:
  processor: ProcessorClass
  default_topic: topic-name
  commit_log_topic: commit-log-topic
  subscription_scheduler_mode: partition|global
  • processor: Message processor class that transforms Kafka messages
  • default_topic: Primary Kafka topic for data ingestion
  • commit_log_topic: Topic for tracking processed offsets
  • subscription_scheduler_mode: How subscriptions are scheduled (partition or global)
  • subscription_delay_seconds: Delay before subscription execution

Common Configuration Patterns

Time-Based Partitioning

Most datasets use retention-based partitioning:
partition_format:
  - retention_days
  - date
This allows efficient data deletion when retention periods expire.

Allocation Policies

Datasets define resource allocation policies:
allocation_policies:
  - name: ConcurrentRateLimitAllocationPolicy
    args:
      required_tenant_types:
        - organization_id
        - referrer
        - project_id
  - name: BytesScannedWindowAllocationPolicy
    args:
      required_tenant_types:
        - organization_id

Query Processors

Processors optimize queries before execution:
query_processors:
  - processor: UUIDColumnProcessor
    args:
      columns: [event_id, trace_id]
  - processor: MappingOptimizer
    args:
      column_name: tags
      hash_map_name: _tags_hash_map

Schema Design

Snuba schemas use ClickHouse types with additional metadata:
schema:
  - name: project_id
    type: UInt
    args: { size: 64 }
  - name: timestamp
    type: DateTime
  - name: tags
    type: Nested
    args:
      subcolumns:
        - { name: key, type: String }
        - { name: value, type: String }

Type System

Numeric

UInt, Int, Float with size parameters (8, 16, 32, 64)

String

Variable-length strings stored as ClickHouse String

DateTime

DateTime and DateTime64 with optional precision

UUID

128-bit unique identifiers

Nested

Nested structures with subcolumns

Array

Arrays of any base type

Schema Modifiers

  • nullable: Column can contain NULL values
  • readonly: Column computed by storage, not queryable directly
  • lowcardinality: Use ClickHouse LowCardinality optimization

Validation System

Datasets enforce query validation at multiple levels:
validators:
  - validator: EntityRequiredColumnValidator
    args:
      required_filter_columns:
        - project_id
  - validator: DatetimeConditionValidator
    args: {}
  - validator: TagConditionValidator
    args: {}

Common Validators

  • EntityRequiredColumnValidator: Ensures required columns in WHERE clause
  • DatetimeConditionValidator: Validates time range queries
  • TagConditionValidator: Optimizes tag-based queries
  • GranularityValidator: Enforces minimum granularity for aggregations

Best Practices

Most datasets require project_id in the WHERE clause for performance and data isolation. This is enforced by validators.
Limit queries to reasonable time ranges (typically < 90 days) to avoid scanning excessive data.
Use promoted tag columns (environment, release, etc.) directly rather than querying nested tag structures.
Some entities use multiple storages. The storage selector chooses based on query characteristics.

Migration System

Datasets support schema evolution through migrations:
# List available migrations
snuba migrations list

# Run pending migrations
snuba migrations migrate --dataset events

# Rollback migrations
snuba migrations reverse --dataset events
See Migrations Overview for detailed information.

Next Steps

Events Dataset

Learn about error and exception data storage

Transactions Dataset

Explore performance monitoring data

Query with SnQL

Write queries using Snuba Query Language

Storage Architecture

Deep dive into ClickHouse storage design

Build docs developers (and LLMs) love