Experiments

Experiments let you test changes scientifically. Compare variants with statistical rigor to know which version actually performs better, not which one you think performs better.

How experiments work

An experiment is a feature flag plus measurement:

Create a feature flag with multiple variants (control and test)
Define metrics to measure (conversion, retention, revenue)
PostHog assigns users to variants randomly
Track metrics for each variant
Calculate statistical significance to determine a winner

Experiments use Bayesian statistics by default. Results show probability of each variant being best, not just p-values.

Creating an experiment

Define your hypothesis

Start with a clear hypothesis:

“Changing the CTA button from ‘Start’ to ‘Try for free’ will increase signups by 10%”
“Showing pricing upfront will improve trial-to-paid conversion”
“Reducing onboarding steps will increase activation rate”

Create the experiment

Go to Experiments → New experiment
Name it (e.g., “CTA button copy test”)
Create or link a feature flag
Define variants:
- Control: Current button text
- Test: New button text

Set metrics

Choose metrics to measure:Primary metric: The main goal (e.g., signup conversion)Secondary metrics: Watch for unintended effects:

Time on page
Bounce rate
Support tickets

Launch

Set rollout percentage (typically 50/50 split) and launch. PostHog starts tracking results immediately.

Defining metrics

Experiments support multiple metric types:

Conversion rate
Trend count
Funnel conversion

Percentage of users who complete an event:

// Track the conversion event
posthog.capture('signup_completed', {
  plan: 'pro',
  source: 'experiment'
})

Metric config:

Type: Conversion rate
Event: signup_completed
Conversion window: 7 days

Number of times users perform an event:

// Track repeated actions
posthog.capture('feature_used', {
  feature: 'export',
  variant: posthog.getFeatureFlag('export-button')
})

Metric config:

Type: Trend
Event: feature_used
Aggregation: Total count or unique users

Implementing variants

Use feature flags to show different variants:

// Get the experiment variant for this user
const variant = posthog.getFeatureFlag('cta-button-test')

if (variant === 'test') {
  // Test variant: new copy
  button.textContent = 'Try for free'
} else {
  // Control variant: original copy
  button.textContent = 'Start'
}

// Track when users see the experiment
posthog.capture('$experiment_started', {
  experiment: 'cta-button-test',
  variant: variant
})

// Track the conversion event
button.addEventListener('click', () => {
  posthog.capture('cta_clicked', {
    button_text: button.textContent
  })
})

PostHog automatically tracks experiment exposure. You don’t need to manually track $experiment_started unless you want custom timing.

Reading results

Experiment results show statistical analysis:

Bayesian analysis (default)

Probability of being best: Chance this variant is the winner (e.g., 94%)
Credible interval: Range where the true effect likely lies
Expected loss: Potential downside if you pick the wrong variant

When to ship:

Probability > 90% and expected loss is acceptable
Typically need 100+ conversions per variant for confidence

Frequentist analysis

Switch to frequentist in settings for traditional hypothesis testing:

P-value: Probability results are due to chance (want < 0.05)
Confidence interval: Range of likely effect sizes
Sample size recommendation: How many more users needed

When to ship:

P-value < 0.05 (95% confidence)
Effect size is practically significant (not just statistically)

Experiment duration

How long to run experiments:

Wait for full cycles

Run for at least one full cycle of user behavior. If users return weekly, run for 1-2 weeks minimum. This accounts for day-of-week effects.

Reach minimum sample size

Need enough conversions for statistical power:

Small effects (2-5% lift): 5,000+ users per variant
Medium effects (5-10% lift): 1,000+ users per variant
Large effects (10%+ lift): 500+ users per variant

PostHog shows recommended sample size in the results.

Don't peek too early

Looking at results daily and stopping when p < 0.05 inflates false positives. Decide duration beforehand or use Bayesian analysis which handles peeking better.

Multivariate experiments

Test more than two variants:

const variant = posthog.getFeatureFlag('pricing-test')

switch (variant) {
  case 'control':
    showPricing([10, 20, 50])  // Current prices
    break
  case 'test-a':
    showPricing([15, 25, 55])  // 50% higher
    break
  case 'test-b':
    showPricing([12, 22, 52])  // 20% higher
    break
}

Split traffic equally:

Control: 33%
Test A: 33%
Test B: 33%

More variants = longer test duration. Each variant needs enough traffic for statistical power.

Holdout groups

Measure long-term impact with holdout groups:

Create a holdout

After experiment wins, create a holdout group
Keep 5-10% of users on the control variant
Ship winning variant to everyone else

Measure over time

Track metrics for months to see if the effect persists or if novelty wore off.

Sunset the holdout

After confirming sustained impact, remove the holdout and ship to 100%.

Common workflows

Optimize conversion

Test changes to signup flow, pricing pages, or CTAs. Measure conversion rate as primary metric. Watch bounce rate as secondary.

Improve retention

Test onboarding flows or feature changes. Use 7-day retention as primary metric. Monitor activation rate as secondary.

Increase engagement

Test UI changes or new features. Measure daily active usage or feature adoption as primary metric.

Reduce churn

Test interventions for at-risk users. Measure churn rate over 30 days. Track support tickets as secondary metric.

Sequential testing

Run multiple experiments on the same flow:

Experiment 1: Test headline copy (winner increases signups 8%)
Experiment 2: Test form length on winning headline (winner increases 5%)
Experiment 3: Test button color on winning form (no significant change)

Cumulative impact: 8% × 5% = 13.4% total improvement

Run experiments sequentially, not simultaneously, on the same user flow. Simultaneous experiments create interaction effects that skew results.

Statistical rigor

Avoid p-hacking

Don’t:

Stop experiments early when results look good
Run multiple experiments and only report winners
Change metrics after seeing results
Add more traffic just to reach p < 0.05

Do:

Decide duration and sample size before starting
Report all experiments, even failures
Define metrics beforehand
Use Bayesian analysis if you need to peek

Understand statistical power

Power = probability of detecting an effect if it exists. Need 80%+ power for reliable experiments.Increases with:

Larger sample size
Larger effect size
Lower variance in metrics

PostHog calculates required sample size for 80% power.

Check for novelty effects

New features often see inflated engagement initially. Use holdout groups to measure if the effect persists beyond the first week.

Best practices

Start with high-impact changes

Don’t test button colors first. Test major changes like pricing structure, core flows, or key value props. Optimize details after you’ve optimized fundamentals.

Define success metrics upfront

Write down your hypothesis, primary metric, and decision criteria before launching. Prevents moving goalposts when results arrive.

Monitor secondary metrics

A test that increases signups but doubles support tickets isn’t a real win. Always track potential negative side effects.

Ship learnings, not just winners

Document why variants won or lost. Build institutional knowledge about what works for your users.

Combine with qualitative data

Watch session replays of experiment participants. Numbers tell you what happened, replays show you why.

Get Started

Core Products

Data & Infrastructure

Integration

Configuration

How experiments work

Creating an experiment

Defining metrics

Implementing variants

Reading results

Bayesian analysis (default)

Frequentist analysis

Experiment duration

Multivariate experiments

Holdout groups

Common workflows

Optimize conversion

Improve retention

Increase engagement

Reduce churn

Sequential testing

Statistical rigor

Best practices

Build docs developers (and LLMs) love

Get Started

Core Products

Data & Infrastructure

Integration

Configuration

​How experiments work

​Creating an experiment

​Defining metrics

​Implementing variants

​Reading results

​Bayesian analysis (default)

​Frequentist analysis

​Experiment duration

​Multivariate experiments

​Holdout groups

​Common workflows

Optimize conversion

Improve retention

Increase engagement

Reduce churn

​Sequential testing

​Statistical rigor

​Best practices

Build docs developers (and LLMs) love

How experiments work

Creating an experiment

Defining metrics

Implementing variants

Reading results

Bayesian analysis (default)

Frequentist analysis

Experiment duration

Multivariate experiments

Holdout groups

Common workflows

Sequential testing

Statistical rigor

Best practices