Data Model - Databuddy

Databuddy uses a dual-database architecture to optimize for both high-volume analytics and transactional metadata. Analytics data is stored in ClickHouse, while application metadata lives in PostgreSQL.

Database Architecture

ClickHouse

Analytics & Time-Series Data

Events (page views, custom events)
Web Vitals metrics
Error tracking
Revenue transactions
AI usage logs
Uptime monitoring

PostgreSQL

Application Metadata

Users & organizations
Websites & domains
API keys & authentication
Feature flags
Goals & funnels
Links & annotations

Why Two Databases?

This separation provides several benefits:

Performance: ClickHouse excels at analytical queries on billions of events
Scalability: Analytics data can grow independently from metadata
Cost: ClickHouse compresses data 10-100x, reducing storage costs
Query Speed: Columnar storage makes aggregations lightning-fast
Isolation: Critical user data is separated from high-volume analytics

ClickHouse Schema

ClickHouse stores all time-series and event data across three databases:

analytics - Main analytics events and aggregations
uptime - Uptime monitoring data
observability - AI call spans and traces

Events Table

The main events table stores all page views and interactions:

-- From packages/db/src/clickhouse/schema.ts:9-78
CREATE TABLE analytics.events (
  id UUID,
  client_id String,
  event_name String,
  anonymous_id String,
  time DateTime64(3, 'UTC'),
  session_id String,
  
  event_type LowCardinality(String) DEFAULT 'track',
  event_id Nullable(String),
  session_start_time Nullable(DateTime64(3, 'UTC')),
  timestamp DateTime64(3, 'UTC') DEFAULT time,
  
  -- Page context
  referrer Nullable(String),
  url String,
  path String,
  title Nullable(String),
  
  -- User agent & geo (enriched server-side)
  ip String,
  user_agent String,
  browser_name Nullable(String),
  browser_version Nullable(String),
  os_name Nullable(String),
  os_version Nullable(String),
  device_type Nullable(String),
  device_brand Nullable(String),
  device_model Nullable(String),
  country Nullable(String),
  region Nullable(String),
  city Nullable(String),
  
  -- Browser info
  screen_resolution Nullable(String),
  viewport_size Nullable(String),
  language Nullable(String),
  timezone Nullable(String),
  
  -- Network quality
  connection_type Nullable(String),
  rtt Nullable(Int16),
  downlink Nullable(Float32),
  
  -- Engagement metrics
  time_on_page Nullable(Float32),
  scroll_depth Nullable(Float32),
  interaction_count Nullable(Int16),
  page_count UInt8 DEFAULT 1,
  
  -- UTM parameters
  utm_source Nullable(String),
  utm_medium Nullable(String),
  utm_campaign Nullable(String),
  utm_term Nullable(String),
  utm_content Nullable(String),
  
  -- Performance metrics
  load_time Nullable(Int32),
  dom_ready_time Nullable(Int32),
  dom_interactive Nullable(Int32),
  ttfb Nullable(Int32),
  connection_time Nullable(Int32),
  request_time Nullable(Int32),
  render_time Nullable(Int32),
  redirect_time Nullable(Int32),
  domain_lookup_time Nullable(Int32),
  
  -- Custom properties (JSON string)
  properties String,
  
  created_at DateTime64(3, 'UTC')
) ENGINE = MergeTree()
PARTITION BY toYYYYMM(time)
ORDER BY (client_id, time, id)
SETTINGS index_granularity = 8192;

The events table is partitioned by month and ordered by client_id and time for optimal query performance.

Web Vitals Tables

Web Vitals use a spans-oriented design for efficient storage:

-- From packages/db/src/clickhouse/schema.ts:162-180
CREATE TABLE analytics.web_vitals_spans (
  client_id String CODEC(ZSTD(1)),
  anonymous_id String CODEC(ZSTD(1)),
  session_id String CODEC(ZSTD(1)),
  
  timestamp DateTime64(3, 'UTC') CODEC(Delta(8), ZSTD(1)),
  path String CODEC(ZSTD(1)),
  
  metric_name LowCardinality(String) CODEC(ZSTD(1)),
  metric_value Float64 CODEC(Gorilla, ZSTD(1)),
  
  INDEX idx_session_id session_id TYPE bloom_filter(0.01) GRANULARITY 1,
  INDEX idx_metric_value metric_value TYPE minmax GRANULARITY 1
) ENGINE = MergeTree
PARTITION BY toDate(timestamp)
ORDER BY (client_id, metric_name, path, timestamp)
SETTINGS index_granularity = 8192;

Supported metrics:

LCP (Largest Contentful Paint)
FCP (First Contentful Paint)
CLS (Cumulative Layout Shift)
INP (Interaction to Next Paint)
TTFB (Time to First Byte)
FPS (Frames Per Second)

Hourly Aggregations

Web Vitals are pre-aggregated hourly for fast dashboard queries:

-- From packages/db/src/clickhouse/schema.ts:185-203
CREATE TABLE analytics.web_vitals_hourly (
  client_id String CODEC(ZSTD(1)),
  path String CODEC(ZSTD(1)),
  metric_name LowCardinality(String) CODEC(ZSTD(1)),
  hour DateTime CODEC(Delta(4), ZSTD(1)),
  
  sample_count UInt64 CODEC(ZSTD(1)),
  p75 Float64 CODEC(ZSTD(1)),
  p50 Float64 CODEC(ZSTD(1)),
  avg_value Float64 CODEC(ZSTD(1)),
  min_value Float64 CODEC(ZSTD(1)),
  max_value Float64 CODEC(ZSTD(1))
) ENGINE = SummingMergeTree
PARTITION BY toYYYYMM(hour)
ORDER BY (client_id, metric_name, path, hour)
TTL toDateTime(hour) + INTERVAL 1 YEAR;

A materialized view automatically populates this table:

CREATE MATERIALIZED VIEW analytics.web_vitals_hourly_mv
TO analytics.web_vitals_hourly
AS SELECT
  client_id,
  path,
  metric_name,
  toStartOfHour(timestamp) AS hour,
  count() AS sample_count,
  quantile(0.75)(metric_value) AS p75,
  quantile(0.50)(metric_value) AS p50,
  avg(metric_value) AS avg_value,
  min(metric_value) AS min_value,
  max(metric_value) AS max_value
FROM analytics.web_vitals_spans
GROUP BY client_id, path, metric_name, hour;

Error Tracking Tables

Errors are stored in a lightweight spans table:

-- From packages/db/src/clickhouse/schema.ts:84-107
CREATE TABLE analytics.error_spans (
  client_id String CODEC(ZSTD(1)),
  anonymous_id String CODEC(ZSTD(1)),
  session_id String CODEC(ZSTD(1)),
  
  timestamp DateTime64(3, 'UTC') CODEC(Delta(8), ZSTD(1)),
  path String CODEC(ZSTD(1)),
  
  message String CODEC(ZSTD(1)),
  filename Nullable(String) CODEC(ZSTD(1)),
  lineno Nullable(Int32) CODEC(ZSTD(1)),
  colno Nullable(Int32) CODEC(ZSTD(1)),
  stack Nullable(String) CODEC(ZSTD(1)),
  error_type LowCardinality(String) CODEC(ZSTD(1)),
  
  INDEX idx_session_id session_id TYPE bloom_filter(0.01) GRANULARITY 1,
  INDEX idx_error_type error_type TYPE bloom_filter(0.01) GRANULARITY 1,
  INDEX idx_message message TYPE tokenbf_v1(10240, 3, 0) GRANULARITY 1
) ENGINE = MergeTree
PARTITION BY toDate(timestamp)
ORDER BY (client_id, error_type, path, timestamp);

Similar to Web Vitals, errors are aggregated hourly:

CREATE TABLE analytics.error_hourly (
  client_id String,
  path String,
  error_type LowCardinality(String),
  message_hash UInt64,  -- Hash of error message
  hour DateTime,
  
  error_count UInt64,
  affected_users AggregateFunction(uniq, String),
  affected_sessions AggregateFunction(uniq, String),
  sample_message String
) ENGINE = AggregatingMergeTree
PARTITION BY toYYYYMM(hour)
ORDER BY (client_id, error_type, path, hour, message_hash);

Custom Events Table

Custom events tracked via the SDK are stored separately:

-- From packages/db/src/clickhouse/schema.ts:355-380
CREATE TABLE analytics.custom_events (
  owner_id String CODEC(ZSTD(1)),          -- Org ID from API key
  website_id Nullable(String) CODEC(ZSTD(1)),
  
  timestamp DateTime64(3, 'UTC') CODEC(Delta(8), ZSTD(1)),
  
  event_name LowCardinality(String) CODEC(ZSTD(1)),
  namespace LowCardinality(Nullable(String)) CODEC(ZSTD(1)),
  path Nullable(String) CODEC(ZSTD(1)),
  properties String CODEC(ZSTD(1)),
  
  anonymous_id Nullable(String) CODEC(ZSTD(1)),
  session_id Nullable(String) CODEC(ZSTD(1)),
  
  source LowCardinality(Nullable(String)) CODEC(ZSTD(1)),
  
  INDEX idx_event_name event_name TYPE bloom_filter(0.01) GRANULARITY 1,
  INDEX idx_namespace namespace TYPE bloom_filter(0.01) GRANULARITY 1,
  INDEX idx_website_id website_id TYPE bloom_filter(0.01) GRANULARITY 1
) ENGINE = MergeTree
PARTITION BY toDate(timestamp)
ORDER BY (owner_id, event_name, timestamp);

Custom events are scoped to the organization (owner_id) rather than individual websites, allowing cross-site analytics.

Revenue Tracking

Revenue transactions from Stripe/Paddle are stored for attribution:

-- From packages/db/src/clickhouse/schema.ts:492-529
CREATE TABLE analytics.revenue (
  owner_id String,
  website_id Nullable(String),
  
  transaction_id String,
  provider LowCardinality(String),  -- stripe, paddle
  type LowCardinality(String),      -- sale, refund, subscription
  status LowCardinality(String),
  
  amount Decimal64(4),
  original_amount Decimal64(4),
  original_currency LowCardinality(String),
  currency LowCardinality(String),
  
  anonymous_id Nullable(String),
  session_id Nullable(String),
  customer_id Nullable(String),
  
  product_id Nullable(String),
  product_name Nullable(String),
  
  metadata String DEFAULT '{}',
  
  created DateTime('UTC'),
  synced_at DateTime('UTC'),
  
  INDEX idx_anonymous_id anonymous_id TYPE bloom_filter(0.01) GRANULARITY 1
) ENGINE = ReplacingMergeTree(synced_at)
PARTITION BY toYYYYMM(created)
ORDER BY (owner_id, transaction_id);

The ReplacingMergeTree engine handles updates (e.g., refunds) by replacing rows with matching transaction_id.

AI Call Spans

For AI observability, Databuddy tracks LLM API calls:

-- From packages/db/src/clickhouse/schema.ts:410-453
CREATE TABLE observability.ai_call_spans (
  owner_id String,
  
  timestamp DateTime64(3, 'UTC'),
  
  type LowCardinality(String),       -- generate, stream
  model String,
  provider LowCardinality(String),   -- openai, anthropic, etc.
  finish_reason LowCardinality(Nullable(String)),
  
  input_tokens UInt32,
  output_tokens UInt32,
  total_tokens UInt32,
  cached_input_tokens Nullable(UInt32),
  cache_creation_input_tokens Nullable(UInt32),
  reasoning_tokens Nullable(UInt32),
  web_search_count Nullable(UInt16),
  
  input_token_cost_usd Nullable(Float64),
  output_token_cost_usd Nullable(Float64),
  total_token_cost_usd Nullable(Float64),
  
  tool_call_count UInt16,
  tool_result_count UInt16,
  tool_call_names Array(String),
  
  duration_ms UInt32,
  trace_id Nullable(String),
  http_status Nullable(UInt16),
  
  error_name LowCardinality(Nullable(String)),
  error_message Nullable(String),
  error_stack Nullable(String)
) ENGINE = MergeTree
PARTITION BY toDate(timestamp)
ORDER BY (owner_id, provider, model, timestamp);

PostgreSQL Schema

PostgreSQL stores application metadata using Drizzle ORM. Key tables include:

Users & Organizations

// From packages/db/src/drizzle/schema.ts:338-356
export const user = pgTable("user", {
  id: text().primaryKey().notNull(),
  name: text().notNull(),
  email: text().notNull(),
  emailVerified: boolean("email_verified").notNull(),
  image: text(),
  firstName: text(),
  lastName: text(),
  status: userStatus().default("ACTIVE").notNull(),
  createdAt: timestamp("created_at").notNull(),
  updatedAt: timestamp("updated_at").notNull(),
  deletedAt: timestamp({ precision: 3, mode: "string" }),
  role: role().default("USER").notNull(),
  twoFactorEnabled: boolean("two_factor_enabled"),
});

export const organization = pgTable("organization", {
  id: text().primaryKey().notNull(),
  name: text().notNull(),
  slug: text(),
  logo: text(),
  createdAt: timestamp("created_at").notNull(),
  metadata: text(),
});

Websites

// From packages/db/src/drizzle/schema.ts:310-336
export const websites = pgTable("websites", {
  id: text().primaryKey().notNull(),
  domain: text().notNull(),
  name: text(),
  status: websiteStatus().default("ACTIVE").notNull(),
  isPublic: boolean().default(false).notNull(),
  createdAt: timestamp({ precision: 3 }).defaultNow().notNull(),
  updatedAt: timestamp({ precision: 3 }).defaultNow().notNull(),
  deletedAt: timestamp({ precision: 3 }),
  organizationId: text("organization_id").notNull(),
  integrations: jsonb(),
  settings: jsonb(),
});

API Keys

// From packages/db/src/drizzle/schema.ts:493-531
export const apikey = pgTable("apikey", {
  id: text().primaryKey().notNull(),
  name: text().notNull(),
  prefix: text().notNull(),
  start: text().notNull(),
  keyHash: text("key_hash").notNull(),
  userId: text("user_id"),
  organizationId: text("organization_id"),
  type: apiKeyType("type").notNull().default("user"),
  scopes: apiScope("scopes").array().notNull().default([]),
  enabled: boolean("enabled").notNull().default(true),
  revokedAt: timestamp("revoked_at"),
  rateLimitEnabled: boolean("rate_limit_enabled").notNull().default(true),
  rateLimitTimeWindow: integer("rate_limit_time_window"),
  rateLimitMax: integer("rate_limit_max"),
  expiresAt: timestamp("expires_at", { mode: "string" }),
  metadata: jsonb("metadata").default({}),
  createdAt: timestamp("created_at").notNull().defaultNow(),
  updatedAt: timestamp("updated_at").notNull().defaultNow(),
});

Feature Flags

// From packages/db/src/drizzle/schema.ts:649-675
export const flags = pgTable("flags", {
  id: text().primaryKey().notNull(),
  key: text().notNull(),
  name: text(),
  description: text(),
  type: flagType().default("boolean").notNull(),
  status: flagStatus().default("active").notNull(),
  defaultValue: jsonb("default_value").default(false).notNull(),
  payload: jsonb("payload"),
  rules: jsonb("rules").default([]),
  persistAcrossAuth: boolean("persist_across_auth").default(false).notNull(),
  rolloutPercentage: integer("rollout_percentage").default(0),
  rolloutBy: text("rollout_by"),
  websiteId: text("website_id"),
  organizationId: text("organization_id"),
  userId: text("user_id"),
  createdBy: text("created_by").notNull(),
  variants: jsonb("variants").default([]),
  dependencies: text("dependencies").array(),
  targetGroupIds: text("target_group_ids").array(),
  environment: text("environment"),
  createdAt: timestamp("created_at").defaultNow().notNull(),
  updatedAt: timestamp("updated_at").defaultNow().notNull(),
  deletedAt: timestamp("deleted_at"),
});

Data Flow

Event Collection

The tracker SDK collects events in the browser and batches them.

Ingestion

Events are sent to the basket API service for validation.

Enrichment

Server-side enrichment adds geo data, parses user agents, and extracts device info.

ClickHouse Insert

Enriched events are bulk-inserted into ClickHouse tables.

Materialized Views

Materialized views automatically create hourly aggregations.

Query Layer

The API service queries ClickHouse for analytics and PostgreSQL for metadata.

Compression & Storage

ClickHouse uses aggressive compression:

ZSTD(1): General-purpose compression for strings
Delta encoding: Efficient for timestamps and sequential numbers
Gorilla codec: Optimized for floating-point metrics
LowCardinality: Dictionary encoding for enum-like columns

Typical compression ratios:

Events: 15-20x compression
Web Vitals: 30-40x compression
Errors: 10-15x compression

1 billion events typically use 50-100 GB of disk space after compression.

Partitioning Strategy

All ClickHouse tables are partitioned by time:

Monthly partitions for events: PARTITION BY toYYYYMM(time)
Daily partitions for spans: PARTITION BY toDate(timestamp)

This allows:

Efficient TTL deletion of old data
Fast queries on recent time ranges
Parallel query execution across partitions

Indexing

ClickHouse uses specialized indexes:

Primary Key (ORDER BY)

The primary sorting key determines data layout on disk:

ORDER BY (client_id, time, id)

Queries filtering by client_id and time are extremely fast.

Bloom Filters

Probabilistic indexes for string matching:

INDEX idx_session_id session_id TYPE bloom_filter(0.01) GRANULARITY 1

Used for filtering by session_id, event_name, etc.

MinMax Indexes

Store min/max values for range queries:

INDEX idx_metric_value metric_value TYPE minmax GRANULARITY 1

Useful for filtering numeric ranges (e.g., LCP > 2500).

Token Bloom Filters

For full-text search in error messages:

INDEX idx_message message TYPE tokenbf_v1(10240, 3, 0) GRANULARITY 1

Enables fast substring searches.

Query Examples

SELECT
  path,
  count(*) as pageviews,
  uniq(anonymous_id) as unique_visitors
FROM analytics.events
WHERE client_id = 'your-client-id'
  AND event_name = 'screen_view'
  AND time >= now() - INTERVAL 7 DAY
GROUP BY path
ORDER BY pageviews DESC
LIMIT 10;

Data Retention

Configure TTL (Time To Live) policies:

-- Automatic deletion after 1 year
ALTER TABLE analytics.events
MODIFY TTL time + INTERVAL 1 YEAR;

-- Daily aggregates retained for 2 years
ALTER TABLE analytics.daily_pageviews
MODIFY TTL date + INTERVAL 2 YEAR;

Best Practices

Use Aggregates

Query hourly/daily aggregates instead of raw spans for date range queries.

Filter Early

Always filter by client_id and time range first:

WHERE client_id = 'xxx' AND time >= '2024-01-01'

Limit Results

Always use LIMIT on large tables:

SELECT * FROM events LIMIT 1000

Materialize Views

Create materialized views for frequently-run aggregations to pre-compute results.

Learn More

Privacy First

Understand data anonymization and privacy features

Event Tracking

Learn how events flow through the system

Sessions & Users

Explore user and session identification

Get Started

Core Concepts

SDK Integration

Tracking

Features

Privacy & Compliance

Self-Hosting

Guides

​Database Architecture

ClickHouse

PostgreSQL

​Why Two Databases?

​ClickHouse Schema

​Events Table

​Web Vitals Tables

​Hourly Aggregations

​Error Tracking Tables

​Custom Events Table

​Revenue Tracking

​AI Call Spans

​PostgreSQL Schema

​Users & Organizations

​Websites

​API Keys

​Feature Flags

​Data Flow

​Compression & Storage

​Partitioning Strategy

​Indexing

​Query Examples

​Data Retention

​Best Practices

Use Aggregates

Filter Early

Limit Results

Materialize Views

​Learn More

Privacy First

Event Tracking

Sessions & Users

Build docs developers (and LLMs) love

Database Architecture

Why Two Databases?

ClickHouse Schema

Events Table

Web Vitals Tables

Hourly Aggregations

Error Tracking Tables

Custom Events Table

Revenue Tracking

AI Call Spans

PostgreSQL Schema

Users & Organizations

Websites

API Keys

Feature Flags

Data Flow

Compression & Storage

Partitioning Strategy

Indexing

Query Examples

Data Retention

Best Practices

Learn More