Architecture
The configuration system is built around four primary components:Datasets
Top-level organizational units that group related entities
Entities
Logical data models that define schemas and query logic
Storages
Physical data storage abstractions (ClickHouse tables)
Subscriptions
Real-time query subscriptions with validation rules
Configuration Files
Configuration files are located in thesnuba/datasets/configuration/ directory and organized by dataset:
Schema Version
All configuration files use versionv1 and must specify:
Schema version. Currently always
v1.Component type:
dataset, entity, readable_storage, writable_storage, or cdc_storage.Unique identifier for the component.
Configuration Loading
Snuba loads and validates configuration files at startup using JSON Schema validation. The loading process:- Discovery: Scans configuration directories for YAML files
- Parsing: Parses YAML into Python dictionaries
- Validation: Validates against JSON schemas defined in
json_schema.py - Building: Constructs Python objects from validated configuration
- Registration: Registers components in the global registry
Configuration validation is controlled by the
VALIDATE_DATASET_YAMLS_ON_STARTUP setting. In production, validation is enabled to catch errors early.Configuration Validation
The system validates configurations usingfastjsonschema for performance:
Validation Errors
Common validation errors include:- Missing required fields: All required properties must be present
- Invalid types: Values must match the expected type (string, integer, array, etc.)
- Unknown properties: Additional properties not defined in the schema
- Invalid references: References to non-existent storages, processors, or validators
Builder Pattern
Configuration files are transformed into Python objects using builder functions:Registered Classes
Many configuration properties reference registered Python classes:- Query Processors: Transform queries before execution
- Validators: Enforce query constraints and security rules
- Mappers: Translate column and function names
- Storage Selectors: Choose which storage to query
- Allocation Policies: Control resource allocation per query
Column Schemas
Columns are defined with a consistent structure across all components:Supported Column Types
Numeric Types
Numeric Types
UInt- Unsigned integers (8, 16, 32, 64 bit)Int- Signed integers (8, 16, 32, 64 bit)Float- Floating point (32, 64 bit)
String Types
String Types
String- Variable-length stringsFixedString- Fixed-length strings
Date/Time Types
Date/Time Types
DateTime- Unix timestamp with second precisionDateTime64- Unix timestamp with configurable precisionDate- Calendar date
Complex Types
Complex Types
Array- Arrays of any supported typeNested- Nested structures (like arrays of structs)Map- Key-value mappingsTuple- Fixed-size tuplesJSON- JSON objects (ClickHouse experimental)
Special Types
Special Types
UUID- Universally unique identifiersIPv4/IPv6- IP addressesEnum- Enumeration typesAggregateFunction- Aggregation stateSimpleAggregateFunction- Simple aggregation state
Schema Modifiers
Columns can have schema modifiers:Array of modifiers that affect column behavior:
nullable- Column can contain NULL valuesreadonly- Column is computed/derived, not directly writablelowcardinality- Optimize for columns with few distinct values
Best Practices
Organization
Organization
- Keep related configurations in the same directory
- Use consistent naming conventions (
dataset_name,entity_name) - Document complex configurations with YAML comments
- Separate read-only and writable storages
Validation
Validation
- Always validate configurations locally before deploying
- Use descriptive names for query processors and validators
- Test configuration changes with integration tests
- Monitor startup logs for validation warnings
Performance
Performance
- Configure appropriate allocation policies
- Use query processors to optimize common patterns
- Set up proper indexes in storage schemas
- Enable PREWHERE optimization where applicable
Security
Security
- Always include mandatory condition checkers
- Enforce project_id filtering on multi-tenant storages
- Use validators to prevent expensive queries
- Configure rate limits via allocation policies
Next Steps
Configure Datasets
Learn how to define and organize datasets
Configure Entities
Create entity schemas and query logic
Configure Storages
Set up ClickHouse storage abstractions
Configure Subscriptions
Define subscription validation rules