Skip to main content
PostHog’s Data Warehouse lets you sync external data sources, query them with SQL, and join them with your product analytics data. Build unified views across your entire data stack without copying data to separate warehouses.

Overview

The Data Warehouse provides:
  • External data syncing - Connect to databases, SaaS tools, and data warehouses
  • SQL querying - Use HogQL to query external tables alongside PostHog events
  • Saved queries and views - Create reusable data models and materialized views
  • Real-time joins - Combine external data with product analytics
  • Usage tracking - Monitor rows synced and materialized

Supported Data Sources

Connect to databases, SaaS platforms, and data warehouses:
from posthog import Posthog

posthog = Posthog('<ph_project_api_key>')

# Stripe data automatically syncs customers, invoices, and charges
# Auto-creates joins between persons and Stripe customers

Sync Configuration

Incremental Syncing

Use incremental syncing to only sync new or updated rows:
{
  "name": "orders",
  "sync_type": "incremental",
  "incremental_field": "updated_at",
  "incremental_field_type": "timestamp",
  "sync_time_of_day": "02:00"
}
Incremental fields - Must be monotonically increasing (timestamp, integer ID) and present on every row for reliable syncing.

Full Table Syncs

For smaller tables or when you need complete refreshes:
{
  "name": "products",
  "sync_type": "full",
  "sync_time_of_day": "03:00"
}

Append-Only Syncs

For event streams and immutable logs:
{
  "name": "activity_logs",
  "sync_type": "append",
  "incremental_field": "event_timestamp",
  "incremental_field_type": "timestamp"
}

Querying External Data

Basic SQL Queries

Query external tables using HogQL:
SELECT 
  customer_email,
  sum(amount_paid) as total_revenue,
  count(*) as invoice_count
FROM stripe_invoice
WHERE status = 'paid'
  AND created >= now() - INTERVAL 30 DAY
GROUP BY customer_email
ORDER BY total_revenue DESC
LIMIT 100

Joining with Product Data

Combine external data with PostHog events:
SELECT 
  persons.properties.email,
  stripe_customer.name,
  stripe_customer.created as customer_since,
  count(DISTINCT events.uuid) as event_count
FROM events
JOIN persons ON events.person_id = persons.id
JOIN stripe_customer ON persons.properties.email = stripe_customer.email
WHERE events.timestamp >= now() - INTERVAL 7 DAY
GROUP BY persons.properties.email, stripe_customer.name, stripe_customer.created

Saved Queries and Views

Creating Views

Save reusable queries as views:
from posthog.api import DataWarehouseSavedQuery

view = DataWarehouseSavedQuery.objects.create(
    team_id=team.id,
    name="active_customers",
    query={
        "query": """
            SELECT 
                c.email,
                c.created,
                count(DISTINCT e.uuid) as events_30d
            FROM stripe_customer c
            LEFT JOIN persons p ON c.email = p.properties.email
            LEFT JOIN events e ON p.id = e.person_id 
                AND e.timestamp >= now() - INTERVAL 30 DAY
            GROUP BY c.email, c.created
            HAVING events_30d > 0
        """
    },
    created_by=user
)

Materialized Views

Enable materialization for faster queries on large datasets:
1

Create the view

First create a standard saved query with your SQL logic.
2

Enable materialization

view.is_materialized = True
view.sync_frequency_interval = "24hour"  # Refresh daily
view.save()
view.schedule_materialization()
3

Query materialized table

Once materialized, query the view like any table:
SELECT * FROM active_customers
WHERE events_30d > 100
Materialized views are refreshed on schedule and stored as ClickHouse tables. They’re ideal for complex aggregations that power dashboards.

Schema Management

Refreshing Schemas

Fetch the latest table list from a source:
from products.data_warehouse.backend.api.external_data_source import ExternalDataSourceViewSet

# Refresh schemas from source
response = client.post(
    f'/api/projects/{project_id}/warehouse_sources/{source_id}/refresh_schemas/'
)

# Returns: {"added": 3, "deleted": 1}

Managing Table Prefixes

Use prefixes when connecting multiple instances of the same source:
{
  "source_type": "Postgres",
  "prefix": "prod_",  # Tables appear as prod_users, prod_orders, etc.
  "host": "prod-db.example.com",
  ...
}

Monitoring and Usage

Sync Statistics

Track data warehouse usage:
GET /api/projects/{project_id}/warehouse/total_rows_stats/

{
  "billing_period_start": "2024-01-01T00:00:00Z",
  "billing_period_end": "2024-02-01T00:00:00Z",
  "total_rows": 1500000,
  "materialized_rows_in_billing_period": 500000,
  "breakdown_of_rows_by_source": {
    "abc-123": 1000000,
    "def-456": 500000
  }
}

Job Status

Monitor sync jobs:
GET /api/projects/{project_id}/warehouse/running_activity/

{
  "results": [
    {
      "id": "job-123",
      "type": "Postgres",
      "name": "users",
      "status": "Running",
      "rows": 45000,
      "created_at": "2024-01-15T10:30:00Z"
    }
  ]
}

Data Health Issues

Get failed syncs and materializations:
GET /api/projects/{project_id}/warehouse/data_health_issues/

{
  "results": [
    {
      "id": "schema-789",
      "name": "orders",
      "type": "external_data_sync",
      "status": "failed",
      "error": "Connection timeout",
      "failed_at": "2024-01-15T11:00:00Z"
    }
  ],
  "count": 1
}

SSH Tunneling

Connect to databases behind firewalls using SSH tunnels:
{
  "source_type": "Postgres",
  "host": "localhost",  # Internal address
  "port": 5432,
  "ssh_tunnel": {
    "enabled": true,
    "host": "bastion.example.com",
    "port": 22,
    "auth": {
      "selection": "password",  # or "private_key"
      "username": "tunnel_user",
      "password": "***"
    }
  }
}

Revenue Analytics (Stripe)

For Stripe sources, enable automatic revenue analytics:
PATCH /api/projects/{project_id}/warehouse_sources/{source_id}/revenue_analytics_config/

{
  "enabled": true,
  "include_invoiceless_charges": false
}
This creates managed views for:
  • Customer lifetime value
  • Revenue by cohort
  • Churn analysis

API Reference

Create External Data Source

POST /api/projects/{project_id}/warehouse_sources/

{
  "source_type": "Postgres",
  "prefix": "analytics_",
  "description": "Production analytics database",
  "payload": {
    "host": "db.example.com",
    "port": 5432,
    "database": "analytics",
    "user": "readonly",
    "password": "***",
    "schema": "public",
    "schemas": [
      {
        "name": "users",
        "should_sync": true,
        "sync_type": "incremental",
        "incremental_field": "updated_at",
        "incremental_field_type": "timestamp"
      }
    ]
  }
}

Trigger Manual Sync

POST /api/projects/{project_id}/warehouse_sources/{source_id}/reload/

Delete Source

DELETE /api/projects/{project_id}/warehouse_sources/{source_id}/
Soft-deletes the source, schemas, and schedules cleanup jobs.

Best Practices

Incremental Syncing

Use incremental syncing for large tables. Ensure incremental fields are indexed and monotonically increasing (timestamps, auto-increment IDs).

Materialized Views

Materialize complex aggregations that power dashboards. Schedule refreshes during off-peak hours to minimize impact on source databases.

Schema Prefixes

Use prefixes when connecting multiple instances (prod/staging) of the same source to avoid naming conflicts.

SSH Tunnels

For production databases, use SSH tunneling with key-based auth instead of exposing databases directly to the internet.

Troubleshooting

Connection Timeouts

If syncs fail with connection timeouts:
  1. Verify firewall rules allow PostHog IPs
  2. Check database connection limits
  3. Consider using SSH tunneling for stable connections

Incremental Field Issues

Common mistake: Using non-monotonic fields like last_login which can decrease. Always use created_at, updated_at, or auto-increment IDs.

Billing Limits

If syncs are paused due to billing limits:
GET /api/projects/{project_id}/warehouse/total_rows_stats/

# Check total_rows vs your plan limit
# Increase limit in billing settings or pause non-critical syncs

Build docs developers (and LLMs) love