PostHog uses a dual-database architecture: PostgreSQL for application metadata and ClickHouse for high-volume analytics data.
Database Architecture
-- Application metadata and configuration
-- Tables:
-- - posthog_team (teams and projects)
-- - posthog_user (user accounts)
-- - posthog_organization (organizations)
-- - posthog_featureflag (feature flag definitions)
-- - posthog_dashboard (dashboard configs)
-- - posthog_cohort (cohort definitions)
Core Concepts
Team-Based Isolation
All data in PostHog is scoped to a team_id:
# ✅ Always filter by team_id
FeatureFlag.objects.filter( team_id = team_id)
# ✅ In serializers, get team from context
class FeatureFlagSerializer ( serializers . ModelSerializer ):
def create ( self , validated_data ):
team = self .context[ "get_team" ]()
return FeatureFlag.objects.create(
team_id = team.id,
** validated_data
)
Security critical: Never query without filtering by team_id. This is enforced in code review and testing.
Team vs Organization
Organization : Billing entity, contains multiple teams
Team : Data isolation boundary (formerly called “Project”)
User : Can belong to multiple organizations
Organization
├── Team 1 (Production)
│ ├── Events
│ ├── Feature Flags
│ └── Dashboards
└── Team 2 (Staging)
├── Events
└── Feature Flags
PostgreSQL Schema
Core Models
class Team ( models . Model ):
organization = models.ForeignKey( "Organization" , on_delete = models. CASCADE )
name = models.CharField( max_length = 200 )
api_token = models.CharField( max_length = 200 , unique = True )
# Settings
timezone = models.CharField( max_length = 240 , default = "UTC" )
session_recording_opt_in = models.BooleanField( default = False )
autocapture_opt_out = models.BooleanField( default = False )
# Metadata
created_at = models.DateTimeField( auto_now_add = True )
updated_at = models.DateTimeField( auto_now = True )
Key points:
Each team has a unique API token for event ingestion
Teams are the primary data isolation boundary
Team settings control product behavior
class Organization ( models . Model ):
name = models.CharField( max_length = 200 )
created_at = models.DateTimeField( auto_now_add = True )
class OrganizationMembership ( models . Model ):
organization = models.ForeignKey( "Organization" , on_delete = models. CASCADE )
user = models.ForeignKey( "User" , on_delete = models. CASCADE )
level = models.IntegerField( default = 1 ) # Admin, Member, etc.
Key points:
Users belong to organizations via OrganizationMembership
Access control is organization-level
Billing is per organization
Product Models
Product-specific models live in products/<product>/backend/models.py:
Feature Flags
Dashboards
Cohorts
# products/feature_flags/backend/models.py
class FeatureFlag ( models . Model ):
team = models.ForeignKey( "posthog.Team" , on_delete = models. CASCADE )
key = models.CharField( max_length = 400 )
name = models.TextField( blank = True )
enabled = models.BooleanField( default = True )
filters = models.JSONField( default = dict )
class Meta :
unique_together = [( "team" , "key" )]
Do not add domain-specific fields to the Team model. Use a Team Extension model instead. See posthog/models/team/README.md for the pattern.
ClickHouse Schema
Events Table
The core analytics table storing all events:
CREATE TABLE sharded_events (
uuid UUID,
event VARCHAR ,
properties VARCHAR , -- JSON string
timestamp DateTime64( 6 , 'UTC' ),
team_id Int64,
distinct_id VARCHAR ,
elements_chain VARCHAR ,
created_at DateTime64( 6 , 'UTC' ),
-- Indexes
INDEX team_id_index team_id TYPE minmax GRANULARITY 1 ,
INDEX event_index event TYPE bloom_filter GRANULARITY 1 ,
INDEX timestamp_index timestamp TYPE minmax GRANULARITY 1
)
ENGINE = ReplicatedReplacingMergeTree( '/clickhouse/tables/{shard}/events' , '{replica}' , _timestamp)
PARTITION BY toYYYYMM( timestamp )
ORDER BY (team_id, toDate( timestamp ), event , cityHash64(distinct_id), cityHash64(uuid))
Key characteristics:
Partitioned by month for efficient pruning
Replicated for high availability
Replacing merge tree for deduplication
Indexed on team_id, event, and timestamp
Person & Distinct IDs
PostHog maintains identity resolution across devices:
-- Person records
CREATE TABLE sharded_person (
id UUID,
team_id Int64,
properties VARCHAR , -- JSON string
is_identified Boolean ,
created_at DateTime64( 6 , 'UTC' ),
-- ... version fields for merging
)
-- Identity mapping
CREATE TABLE sharded_person_distinct_id (
distinct_id VARCHAR ,
person_id UUID,
team_id Int64,
is_deleted Boolean ,
version Int64,
-- Enables fast lookups
PRIMARY KEY (team_id, distinct_id)
)
Identity resolution flow:
Event arrives with distinct_id (e.g., device ID)
Look up person_id in person_distinct_id table
Merge person properties if $identify event
Associate all events with resolved person_id
Session Recordings
Recording data uses a hybrid approach:
-- Metadata in ClickHouse
CREATE TABLE session_recording_events (
session_id VARCHAR ,
team_id Int64,
distinct_id VARCHAR ,
timestamp DateTime64( 6 , 'UTC' ),
snapshot_data VARCHAR , -- Small snapshots inline
-- Large recordings reference S3
has_full_snapshot Boolean ,
events_summary VARCHAR -- JSON
)
Small snapshots : Stored inline in ClickHouse
Large recordings : Stored in S3, referenced by session_id
Metadata : Always in ClickHouse for fast queries
Data Flow
Event Capture
Client SDK sends event to /e/ endpoint: {
"api_key" : "phc_..." ,
"event" : "$pageview" ,
"properties" : {
"$current_url" : "https://example.com" ,
"$browser" : "Chrome"
},
"distinct_id" : "user_123"
}
Ingestion & Enrichment
Validate API key → resolve team_id
Enrich with GeoIP data
Add server-side properties
Queue for processing
Identity Resolution
Look up person by distinct_id
Create person if first-time user
Merge on $identify events
Storage
Write event to ClickHouse sharded_events
Update person properties if changed
Trigger webhooks/exports if configured
Querying Data
HogQL
PostHog’s SQL-like query language for ClickHouse:
-- HogQL query
SELECT
event ,
count () as event_count
FROM events
WHERE
timestamp >= now () - INTERVAL 7 DAY
AND team_id = {team_id} -- Auto-injected
GROUP BY event
ORDER BY event_count DESC
LIMIT 10
Security features:
Automatic team_id injection
SQL injection prevention via AST parsing
Query timeout limits
Result size limits
Query Runners
Python code that executes queries:
# posthog/queries/trends/trends.py
from posthog.queries.query_runner import QueryRunner
class TrendsQueryRunner ( QueryRunner ):
def calculate ( self ):
query = self .build_query()
results = self .run_query(query)
return self .format_results(results)
Migrations
PostgreSQL Migrations
Standard Django migrations:
# Create migration
python manage.py makemigrations
# Apply migrations
python manage.py migrate
ClickHouse Migrations
Managed migrations for ClickHouse schema changes:
# posthog/clickhouse/migrations/0001_initial_schema.py
from posthog.clickhouse.client.migration_tools import run_sql_with_exceptions
def run ( client ):
run_sql_with_exceptions(
"""
CREATE TABLE IF NOT EXISTS new_table (
...
) ENGINE = ReplicatedReplacingMergeTree(...)
"""
)
ClickHouse migrations require careful coordination in distributed environments. Always test on staging first.
PostgreSQL Optimization
Use select_related() and prefetch_related() to avoid N+1 queries
Add indexes on frequently queried fields
Use only() to fetch specific fields
# ✅ Optimized
feature_flags = FeatureFlag.objects.filter( team_id = team_id).select_related( "team" )
# ❌ N+1 query problem
for flag in FeatureFlag.objects.filter( team_id = team_id):
print (flag.team.name) # Separate query for each flag
ClickHouse Optimization
Filter on team_id and timestamp first (partition keys)
Use materialized columns for frequently accessed JSON properties
Limit result sizes with LIMIT
Use sampling for approximate results on large datasets
-- ✅ Efficient - filters on partition keys
SELECT count () FROM events
WHERE team_id = 1 AND timestamp > now () - INTERVAL 1 DAY
-- ❌ Inefficient - full table scan
SELECT count () FROM events WHERE properties LIKE '%purchase%'
Next Steps
Development Setup Set up your local environment
Testing Learn about testing strategies