Skip to main content
Snuba uses a declarative YAML-based configuration system to define datasets, entities, storages, and subscriptions. This approach allows for flexible, maintainable configuration of the data layer without requiring code changes.

Architecture

The configuration system is built around four primary components:

Datasets

Top-level organizational units that group related entities

Entities

Logical data models that define schemas and query logic

Storages

Physical data storage abstractions (ClickHouse tables)

Subscriptions

Real-time query subscriptions with validation rules

Configuration Files

Configuration files are located in the snuba/datasets/configuration/ directory and organized by dataset:
snuba/datasets/configuration/
├── events/
   ├── dataset.yaml           # Dataset definition
   ├── entities/
   └── events.yaml        # Entity definitions
   └── storages/
       ├── errors.yaml        # Writable storage
       └── errors_ro.yaml     # Readable storage
├── discover/
├── metrics/
└── ...

Schema Version

All configuration files use version v1 and must specify:
version
string
required
Schema version. Currently always v1.
kind
string
required
Component type: dataset, entity, readable_storage, writable_storage, or cdc_storage.
name
string
required
Unique identifier for the component.

Configuration Loading

Snuba loads and validates configuration files at startup using JSON Schema validation. The loading process:
  1. Discovery: Scans configuration directories for YAML files
  2. Parsing: Parses YAML into Python dictionaries
  3. Validation: Validates against JSON schemas defined in json_schema.py
  4. Building: Constructs Python objects from validated configuration
  5. Registration: Registers components in the global registry
Configuration validation is controlled by the VALIDATE_DATASET_YAMLS_ON_STARTUP setting. In production, validation is enabled to catch errors early.

Configuration Validation

The system validates configurations using fastjsonschema for performance:
from snuba.datasets.configuration.loader import load_configuration_data
from snuba.datasets.configuration.json_schema import ENTITY_VALIDATORS

# Load and validate entity configuration
config = load_configuration_data(
    "path/to/entity.yaml",
    ENTITY_VALIDATORS
)

Validation Errors

Common validation errors include:
  • Missing required fields: All required properties must be present
  • Invalid types: Values must match the expected type (string, integer, array, etc.)
  • Unknown properties: Additional properties not defined in the schema
  • Invalid references: References to non-existent storages, processors, or validators

Builder Pattern

Configuration files are transformed into Python objects using builder functions:
from snuba.datasets.configuration.dataset_builder import build_dataset_from_config

dataset = build_dataset_from_config("events/dataset.yaml")

Registered Classes

Many configuration properties reference registered Python classes:
  • Query Processors: Transform queries before execution
  • Validators: Enforce query constraints and security rules
  • Mappers: Translate column and function names
  • Storage Selectors: Choose which storage to query
  • Allocation Policies: Control resource allocation per query
These classes are registered using decorators:
from snuba.query.processors.logical import LogicalQueryProcessor

@LogicalQueryProcessor.register("TimeSeriesProcessor")
class TimeSeriesProcessor(LogicalQueryProcessor):
    def process_query(self, query, query_settings):
        # Transform query logic
        ...

Column Schemas

Columns are defined with a consistent structure across all components:
schema:
  - name: project_id
    type: UInt
    args:
      size: 64
  - name: timestamp
    type: DateTime
  - name: tags
    type: Nested
    args:
      subcolumns:
        - name: key
          type: String
        - name: value
          type: String

Supported Column Types

  • UInt - Unsigned integers (8, 16, 32, 64 bit)
  • Int - Signed integers (8, 16, 32, 64 bit)
  • Float - Floating point (32, 64 bit)
  • String - Variable-length strings
  • FixedString - Fixed-length strings
  • DateTime - Unix timestamp with second precision
  • DateTime64 - Unix timestamp with configurable precision
  • Date - Calendar date
  • Array - Arrays of any supported type
  • Nested - Nested structures (like arrays of structs)
  • Map - Key-value mappings
  • Tuple - Fixed-size tuples
  • JSON - JSON objects (ClickHouse experimental)
  • UUID - Universally unique identifiers
  • IPv4 / IPv6 - IP addresses
  • Enum - Enumeration types
  • AggregateFunction - Aggregation state
  • SimpleAggregateFunction - Simple aggregation state

Schema Modifiers

Columns can have schema modifiers:
schema_modifiers
array
Array of modifiers that affect column behavior:
  • nullable - Column can contain NULL values
  • readonly - Column is computed/derived, not directly writable
  • lowcardinality - Optimize for columns with few distinct values
Example:
- name: environment
  type: String
  args:
    schema_modifiers: [nullable]

Best Practices

  • Keep related configurations in the same directory
  • Use consistent naming conventions (dataset_name, entity_name)
  • Document complex configurations with YAML comments
  • Separate read-only and writable storages
  • Always validate configurations locally before deploying
  • Use descriptive names for query processors and validators
  • Test configuration changes with integration tests
  • Monitor startup logs for validation warnings
  • Configure appropriate allocation policies
  • Use query processors to optimize common patterns
  • Set up proper indexes in storage schemas
  • Enable PREWHERE optimization where applicable
  • Always include mandatory condition checkers
  • Enforce project_id filtering on multi-tenant storages
  • Use validators to prevent expensive queries
  • Configure rate limits via allocation policies

Next Steps

Configure Datasets

Learn how to define and organize datasets

Configure Entities

Create entity schemas and query logic

Configure Storages

Set up ClickHouse storage abstractions

Configure Subscriptions

Define subscription validation rules

Build docs developers (and LLMs) love