PostHog’s Data Warehouse lets you sync external data sources, query them with SQL, and join them with your product analytics data. Build unified views across your entire data stack without copying data to separate warehouses.
Overview
The Data Warehouse provides:
External data syncing - Connect to databases, SaaS tools, and data warehouses
SQL querying - Use HogQL to query external tables alongside PostHog events
Saved queries and views - Create reusable data models and materialized views
Real-time joins - Combine external data with product analytics
Usage tracking - Monitor rows synced and materialized
Supported Data Sources
Connect to databases, SaaS platforms, and data warehouses:
Stripe
PostgreSQL
Snowflake
BigQuery
from posthog import Posthog
posthog = Posthog( '<ph_project_api_key>' )
# Stripe data automatically syncs customers, invoices, and charges
# Auto-creates joins between persons and Stripe customers
Sync Configuration
Incremental Syncing
Use incremental syncing to only sync new or updated rows:
{
"name" : "orders" ,
"sync_type" : "incremental" ,
"incremental_field" : "updated_at" ,
"incremental_field_type" : "timestamp" ,
"sync_time_of_day" : "02:00"
}
Incremental fields - Must be monotonically increasing (timestamp, integer ID)
and present on every row for reliable syncing.
Full Table Syncs
For smaller tables or when you need complete refreshes:
{
"name" : "products" ,
"sync_type" : "full" ,
"sync_time_of_day" : "03:00"
}
Append-Only Syncs
For event streams and immutable logs:
{
"name" : "activity_logs" ,
"sync_type" : "append" ,
"incremental_field" : "event_timestamp" ,
"incremental_field_type" : "timestamp"
}
Querying External Data
Basic SQL Queries
Query external tables using HogQL:
SELECT
customer_email,
sum (amount_paid) as total_revenue,
count ( * ) as invoice_count
FROM stripe_invoice
WHERE status = 'paid'
AND created >= now () - INTERVAL 30 DAY
GROUP BY customer_email
ORDER BY total_revenue DESC
LIMIT 100
Joining with Product Data
Combine external data with PostHog events:
SELECT
persons . properties .email,
stripe_customer . name ,
stripe_customer . created as customer_since,
count ( DISTINCT events . uuid ) as event_count
FROM events
JOIN persons ON events . person_id = persons . id
JOIN stripe_customer ON persons . properties .email = stripe_customer . email
WHERE events . timestamp >= now () - INTERVAL 7 DAY
GROUP BY persons . properties .email, stripe_customer . name , stripe_customer . created
Saved Queries and Views
Creating Views
Save reusable queries as views:
from posthog.api import DataWarehouseSavedQuery
view = DataWarehouseSavedQuery.objects.create(
team_id = team.id,
name = "active_customers" ,
query = {
"query" : """
SELECT
c.email,
c.created,
count(DISTINCT e.uuid) as events_30d
FROM stripe_customer c
LEFT JOIN persons p ON c.email = p.properties.email
LEFT JOIN events e ON p.id = e.person_id
AND e.timestamp >= now() - INTERVAL 30 DAY
GROUP BY c.email, c.created
HAVING events_30d > 0
"""
},
created_by = user
)
Materialized Views
Enable materialization for faster queries on large datasets:
Create the view
First create a standard saved query with your SQL logic.
Enable materialization
view.is_materialized = True
view.sync_frequency_interval = "24hour" # Refresh daily
view.save()
view.schedule_materialization()
Query materialized table
Once materialized, query the view like any table: SELECT * FROM active_customers
WHERE events_30d > 100
Materialized views are refreshed on schedule and stored as ClickHouse tables.
They’re ideal for complex aggregations that power dashboards.
Schema Management
Refreshing Schemas
Fetch the latest table list from a source:
from products.data_warehouse.backend.api.external_data_source import ExternalDataSourceViewSet
# Refresh schemas from source
response = client.post(
f '/api/projects/ { project_id } /warehouse_sources/ { source_id } /refresh_schemas/'
)
# Returns: {"added": 3, "deleted": 1}
Managing Table Prefixes
Use prefixes when connecting multiple instances of the same source:
{
"source_type" : "Postgres" ,
"prefix" : "prod_" , # Tables appear as prod_users, prod_orders, etc.
"host" : "prod-db.example.com" ,
...
}
Monitoring and Usage
Sync Statistics
Track data warehouse usage:
GET / api / projects / {project_id} / warehouse / total_rows_stats /
{
"billing_period_start" : "2024-01-01T00:00:00Z" ,
"billing_period_end" : "2024-02-01T00:00:00Z" ,
"total_rows" : 1500000 ,
"materialized_rows_in_billing_period" : 500000 ,
"breakdown_of_rows_by_source" : {
"abc-123" : 1000000 ,
"def-456" : 500000
}
}
Job Status
Monitor sync jobs:
GET / api / projects / {project_id} / warehouse / running_activity /
{
"results" : [
{
"id" : "job-123" ,
"type" : "Postgres" ,
"name" : "users" ,
"status" : "Running" ,
"rows" : 45000 ,
"created_at" : "2024-01-15T10:30:00Z"
}
]
}
Data Health Issues
Get failed syncs and materializations:
GET / api / projects / {project_id} / warehouse / data_health_issues /
{
"results" : [
{
"id" : "schema-789" ,
"name" : "orders" ,
"type" : "external_data_sync" ,
"status" : "failed" ,
"error" : "Connection timeout" ,
"failed_at" : "2024-01-15T11:00:00Z"
}
],
"count" : 1
}
SSH Tunneling
Connect to databases behind firewalls using SSH tunnels:
{
"source_type" : "Postgres" ,
"host" : "localhost" , # Internal address
"port" : 5432 ,
"ssh_tunnel" : {
"enabled" : true,
"host" : "bastion.example.com" ,
"port" : 22 ,
"auth" : {
"selection" : "password" , # or "private_key"
"username" : "tunnel_user" ,
"password" : "***"
}
}
}
Revenue Analytics (Stripe)
For Stripe sources, enable automatic revenue analytics:
PATCH / api / projects / {project_id} / warehouse_sources / {source_id} / revenue_analytics_config /
{
"enabled" : true,
"include_invoiceless_charges" : false
}
This creates managed views for:
Customer lifetime value
Revenue by cohort
Churn analysis
API Reference
Create External Data Source
POST / api / projects / {project_id} / warehouse_sources /
{
"source_type" : "Postgres" ,
"prefix" : "analytics_" ,
"description" : "Production analytics database" ,
"payload" : {
"host" : "db.example.com" ,
"port" : 5432 ,
"database" : "analytics" ,
"user" : "readonly" ,
"password" : "***" ,
"schema" : "public" ,
"schemas" : [
{
"name" : "users" ,
"should_sync" : true,
"sync_type" : "incremental" ,
"incremental_field" : "updated_at" ,
"incremental_field_type" : "timestamp"
}
]
}
}
Trigger Manual Sync
POST / api / projects / {project_id} / warehouse_sources / {source_id} / reload /
Delete Source
DELETE / api / projects / {project_id} / warehouse_sources / {source_id} /
Soft-deletes the source, schemas, and schedules cleanup jobs.
Best Practices
Incremental Syncing Use incremental syncing for large tables. Ensure incremental fields are indexed
and monotonically increasing (timestamps, auto-increment IDs).
Materialized Views Materialize complex aggregations that power dashboards. Schedule refreshes during
off-peak hours to minimize impact on source databases.
Schema Prefixes Use prefixes when connecting multiple instances (prod/staging) of the same source
to avoid naming conflicts.
SSH Tunnels For production databases, use SSH tunneling with key-based auth instead of
exposing databases directly to the internet.
Troubleshooting
Connection Timeouts
If syncs fail with connection timeouts:
Verify firewall rules allow PostHog IPs
Check database connection limits
Consider using SSH tunneling for stable connections
Incremental Field Issues
Common mistake : Using non-monotonic fields like last_login which can decrease.
Always use created_at, updated_at, or auto-increment IDs.
Billing Limits
If syncs are paused due to billing limits:
GET / api / projects / {project_id} / warehouse / total_rows_stats /
# Check total_rows vs your plan limit
# Increase limit in billing settings or pause non-critical syncs