Datasets

The Datasets API provides programmatic access to Snuba’s data models and entities.

Dataset Class

A Dataset represents a data model that can be queried in Snuba.

from snuba.datasets.dataset import Dataset
from snuba.datasets.factory import get_dataset

# Get a dataset by name
dataset = get_dataset("events")

# Access entities in the dataset
entities = dataset.get_all_entities()
for entity in entities:
    print(entity)

Source: snuba/datasets/dataset.py

Dataset Methods

get_all_entities()

method

Returns all entities belonging to this datasetReturns: Sequence[Entity]

Dataset Factory

The factory provides centralized access to all datasets.

from snuba.datasets.factory import (
    get_dataset,
    get_dataset_name,
    get_enabled_dataset_names,
    InvalidDatasetError
)

# Get dataset by name
try:
    dataset = get_dataset("events")
except InvalidDatasetError as e:
    print(f"Dataset not found: {e}")

# Get dataset name from instance
name = get_dataset_name(dataset)
print(f"Dataset name: {name}")

# List all enabled datasets
all_datasets = get_enabled_dataset_names()
print(f"Available datasets: {all_datasets}")

Source: snuba/datasets/factory.py

Factory Functions

get_dataset

function

Get a dataset by nameParameters:

name (str): Dataset name (e.g., “events”, “transactions”)

Returns: DatasetRaises: InvalidDatasetError if dataset doesn’t exist or is disabled

get_dataset_name

function

Get the name of a dataset instanceParameters:

dataset (Dataset): Dataset instance

Returns: str

get_enabled_dataset_names

function

List all enabled dataset namesReturns: list[str]

Entities

Entities represent queryable data models within datasets.

from snuba.datasets.entities.factory import get_entity
from snuba.datasets.entities.entity_key import EntityKey

# Get entity by key
entity = get_entity(EntityKey.EVENTS)

# Access entity properties
print(f"Storages: {entity.get_all_storages()}")
print(f"Required time column: {entity.required_time_column}")

# Get writable storage
writable_storage = entity.get_writable_storage()
if writable_storage:
    table_writer = writable_storage.get_table_writer()

Entity Methods

get_all_storages()

method

Get all storages for this entityReturns: Sequence[Storage]

get_writable_storage()

method

Get the writable storage for this entityReturns: Optional[WritableTableStorage]

get_subscription_processors()

method

Get processors for subscription queriesReturns: Optional[Sequence[SubscriptionProcessor]]

get_subscription_validators()

method

Get validators for subscription creationReturns: Optional[Sequence[SubscriptionValidator]]

Pluggable Datasets

Pluggable datasets are configured via YAML:

from snuba.datasets.pluggable_dataset import PluggableDataset

# Create a pluggable dataset
dataset = PluggableDataset(
    name="my_dataset",
    all_entities=[EntityKey.MY_ENTITY]
)

Source: snuba/datasets/pluggable_dataset.py

Storage

Storages represent the physical tables in ClickHouse.

from snuba.datasets.storage import Storage, WritableTableStorage

# Get storage from entity
entity = get_entity(EntityKey.EVENTS)
storages = entity.get_all_storages()

for storage in storages:
    # Get storage key
    key = storage.get_storage_key()
    
    # Get schema
    schema = storage.get_schema()
    
    # Get cluster
    cluster = storage.get_cluster()
    
    # Check if writable
    if isinstance(storage, WritableTableStorage):
        table_writer = storage.get_table_writer()
        stream_loader = table_writer.get_stream_loader()

Storage Methods

get_storage_key()

method

Get the storage identifierReturns: StorageKey

get_schema()

method

Get the table schemaReturns: TableSchema

get_cluster()

method

Get the ClickHouse clusterReturns: ClickhouseCluster

Complete Example

Here’s a complete example using the datasets API:

from snuba.datasets.factory import get_dataset, get_enabled_dataset_names
from snuba.datasets.entities.entity_key import EntityKey
from snuba.datasets.entities.factory import get_entity

# List all available datasets
print("Available datasets:")
for dataset_name in get_enabled_dataset_names():
    print(f"  - {dataset_name}")

# Get events dataset
events_dataset = get_dataset("events")
print(f"\nDataset: events")

# Get all entities
entities = events_dataset.get_all_entities()
print(f"Entities: {len(entities)}")

for entity in entities:
    print(f"\nEntity: {entity}")
    
    # Get storages
    storages = entity.get_all_storages()
    print(f"  Storages: {len(storages)}")
    
    for storage in storages:
        key = storage.get_storage_key()
        schema = storage.get_schema()
        print(f"    - {key.value}")
        
        # Get table info
        if hasattr(schema, 'get_local_table_name'):
            table = schema.get_local_table_name()
            print(f"      Table: {table}")
    
    # Check subscription support
    sub_processors = entity.get_subscription_processors()
    if sub_processors:
        print(f"  Subscription processors: {len(sub_processors)}")
    
    # Get time column
    time_col = entity.required_time_column
    if time_col:
        print(f"  Time column: {time_col}")

Output:

Available datasets:
  - events
  - transactions
  - metrics
  - profiles

Dataset: events
Entities: 1

Entity: <Entity: events>
  Storages: 2
    - events
      Table: sentry_local
    - errors
      Table: errors_local
  Time column: timestamp

Dataset Configuration

Datasets can be configured via YAML files:

version: v1
kind: readable_storage
name: events

storage:
  key: events
  set_key: events

readiness_state: complete

schema:
  columns:
    - name: event_id
      type: String
    - name: project_id
      type: UInt64
    - name: timestamp
      type: DateTime

Configuration files are loaded from: settings.DATASET_CONFIG_FILES_GLOB

Entity Keys

Common entity keys:

from snuba.datasets.entities.entity_key import EntityKey

# Available entity keys
EntityKey.EVENTS
EntityKey.TRANSACTIONS
EntityKey.METRICS_COUNTERS
EntityKey.METRICS_DISTRIBUTIONS
EntityKey.METRICS_SETS
EntityKey.PROFILES
EntityKey.PROFILE_FUNCTIONS
EntityKey.REPLAYS

Error Handling

from snuba.datasets.factory import InvalidDatasetError

try:
    dataset = get_dataset("nonexistent")
except InvalidDatasetError as e:
    print(f"Error: {e}")
    # Error: dataset 'nonexistent' does not exist

try:
    dataset = get_dataset("disabled_dataset")
except InvalidDatasetError as e:
    print(f"Error: {e}")
    # Error: dataset 'disabled_dataset' is disabled in this environment

Query Builder

Build queries programmatically

Processors

Query processing pipeline

Endpoints

Python API

Dataset Class

Dataset Methods

Dataset Factory

Factory Functions

Entities

Entity Methods

Pluggable Datasets

Storage

Storage Methods

Complete Example

Dataset Configuration

Entity Keys

Error Handling

Query Builder

Processors

Build docs developers (and LLMs) love

Endpoints

Python API

​Dataset Class

​Dataset Methods

​Dataset Factory

​Factory Functions

​Entities

​Entity Methods

​Pluggable Datasets

​Storage

​Storage Methods

​Complete Example

​Dataset Configuration

​Entity Keys

​Error Handling

​Related

Query Builder

Processors

Build docs developers (and LLMs) love

Dataset Class

Dataset Methods

Dataset Factory

Factory Functions

Entities

Entity Methods

Pluggable Datasets

Storage

Storage Methods

Complete Example

Dataset Configuration

Entity Keys

Error Handling

Related