Datasets are the primary organizational unit in Snuba, providing a logical grouping of entities and storages that represent different types of data ingested into ClickHouse.
What are Datasets?
A dataset in Snuba represents a logical collection of related data and defines the interface through which clients can query that data. Each dataset contains:- Entities: Logical query interfaces that define schema and available columns
- Storages: Physical ClickHouse tables where data is stored
- Query Processors: Components that optimize and transform queries
- Validators: Rules that ensure query correctness and performance
Core Datasets
Snuba provides several production datasets that power different Sentry features:Events
Error and exception data with stack traces, user context, and tags
Transactions
Performance monitoring data with timing information and span details
Metrics
Time-series metrics including counters, distributions, and sets
Replays
Session replay data with user interactions and browser events
Dataset Architecture
Configuration Structure
Datasets are defined using YAML configuration files located insnuba/datasets/configuration/. Each dataset requires:
Entity-Storage Relationship
Entities provide the query interface while storages handle physical data organization. This separation allows for flexible query patterns and storage optimizations.
Entities
Entities define:- Column schema with types
- Available query processors
- Validation rules
- Subscription capabilities
- Join relationships
Storages
Storages define:- Physical ClickHouse table names
- Partitioning strategies
- Data retention policies
- Allocation policies for rate limiting
- Stream loader configuration
Query Flow
- Query Reception: Client submits query through SnQL or API
- Entity Selection: Dataset routes to appropriate entity
- Query Parsing: Entity parses and validates query
- Query Processing: Processors optimize and transform query
- Storage Selection: Storage selector chooses optimal storage
- Query Execution: Translated to ClickHouse SQL and executed
- Result Processing: Results formatted and returned to client
Data Ingestion
Each writable storage defines a stream loader that handles Kafka message processing:Stream Loader Configuration
Stream Loader Configuration
- processor: Message processor class that transforms Kafka messages
- default_topic: Primary Kafka topic for data ingestion
- commit_log_topic: Topic for tracking processed offsets
- subscription_scheduler_mode: How subscriptions are scheduled (partition or global)
- subscription_delay_seconds: Delay before subscription execution
Common Configuration Patterns
Time-Based Partitioning
Most datasets use retention-based partitioning:Allocation Policies
Datasets define resource allocation policies:Query Processors
Processors optimize queries before execution:Schema Design
Snuba schemas use ClickHouse types with additional metadata:Type System
Numeric
UInt, Int, Float with size parameters (8, 16, 32, 64)
String
Variable-length strings stored as ClickHouse String
DateTime
DateTime and DateTime64 with optional precision
UUID
128-bit unique identifiers
Nested
Nested structures with subcolumns
Array
Arrays of any base type
Schema Modifiers
- nullable: Column can contain NULL values
- readonly: Column computed by storage, not queryable directly
- lowcardinality: Use ClickHouse LowCardinality optimization
Validation System
Datasets enforce query validation at multiple levels:Common Validators
- EntityRequiredColumnValidator: Ensures required columns in WHERE clause
- DatetimeConditionValidator: Validates time range queries
- TagConditionValidator: Optimizes tag-based queries
- GranularityValidator: Enforces minimum granularity for aggregations
Best Practices
Always filter by project_id
Always filter by project_id
Most datasets require project_id in the WHERE clause for performance and data isolation. This is enforced by validators.
Use appropriate time ranges
Use appropriate time ranges
Limit queries to reasonable time ranges (typically < 90 days) to avoid scanning excessive data.
Leverage promoted tags
Leverage promoted tags
Understand storage selectors
Understand storage selectors
Some entities use multiple storages. The storage selector chooses based on query characteristics.
Migration System
Datasets support schema evolution through migrations:Next Steps
Events Dataset
Learn about error and exception data storage
Transactions Dataset
Explore performance monitoring data
Query with SnQL
Write queries using Snuba Query Language
Storage Architecture
Deep dive into ClickHouse storage design