Skip to main content
Experiments let you test changes scientifically. Compare variants with statistical rigor to know which version actually performs better, not which one you think performs better.

How experiments work

An experiment is a feature flag plus measurement:
  1. Create a feature flag with multiple variants (control and test)
  2. Define metrics to measure (conversion, retention, revenue)
  3. PostHog assigns users to variants randomly
  4. Track metrics for each variant
  5. Calculate statistical significance to determine a winner
Experiments use Bayesian statistics by default. Results show probability of each variant being best, not just p-values.

Creating an experiment

1

Define your hypothesis

Start with a clear hypothesis:
  • “Changing the CTA button from ‘Start’ to ‘Try for free’ will increase signups by 10%”
  • “Showing pricing upfront will improve trial-to-paid conversion”
  • “Reducing onboarding steps will increase activation rate”
2

Create the experiment

  1. Go to ExperimentsNew experiment
  2. Name it (e.g., “CTA button copy test”)
  3. Create or link a feature flag
  4. Define variants:
    • Control: Current button text
    • Test: New button text
3

Set metrics

Choose metrics to measure:Primary metric: The main goal (e.g., signup conversion)Secondary metrics: Watch for unintended effects:
  • Time on page
  • Bounce rate
  • Support tickets
4

Launch

Set rollout percentage (typically 50/50 split) and launch. PostHog starts tracking results immediately.

Defining metrics

Experiments support multiple metric types:
Percentage of users who complete an event:
// Track the conversion event
posthog.capture('signup_completed', {
  plan: 'pro',
  source: 'experiment'
})
Metric config:
  • Type: Conversion rate
  • Event: signup_completed
  • Conversion window: 7 days

Implementing variants

Use feature flags to show different variants:
// Get the experiment variant for this user
const variant = posthog.getFeatureFlag('cta-button-test')

if (variant === 'test') {
  // Test variant: new copy
  button.textContent = 'Try for free'
} else {
  // Control variant: original copy
  button.textContent = 'Start'
}

// Track when users see the experiment
posthog.capture('$experiment_started', {
  experiment: 'cta-button-test',
  variant: variant
})

// Track the conversion event
button.addEventListener('click', () => {
  posthog.capture('cta_clicked', {
    button_text: button.textContent
  })
})
PostHog automatically tracks experiment exposure. You don’t need to manually track $experiment_started unless you want custom timing.

Reading results

Experiment results show statistical analysis:

Bayesian analysis (default)

  • Probability of being best: Chance this variant is the winner (e.g., 94%)
  • Credible interval: Range where the true effect likely lies
  • Expected loss: Potential downside if you pick the wrong variant
When to ship:
  • Probability > 90% and expected loss is acceptable
  • Typically need 100+ conversions per variant for confidence

Frequentist analysis

Switch to frequentist in settings for traditional hypothesis testing:
  • P-value: Probability results are due to chance (want < 0.05)
  • Confidence interval: Range of likely effect sizes
  • Sample size recommendation: How many more users needed
When to ship:
  • P-value < 0.05 (95% confidence)
  • Effect size is practically significant (not just statistically)

Experiment duration

How long to run experiments:
Run for at least one full cycle of user behavior. If users return weekly, run for 1-2 weeks minimum. This accounts for day-of-week effects.
Need enough conversions for statistical power:
  • Small effects (2-5% lift): 5,000+ users per variant
  • Medium effects (5-10% lift): 1,000+ users per variant
  • Large effects (10%+ lift): 500+ users per variant
PostHog shows recommended sample size in the results.
Looking at results daily and stopping when p < 0.05 inflates false positives. Decide duration beforehand or use Bayesian analysis which handles peeking better.

Multivariate experiments

Test more than two variants:
const variant = posthog.getFeatureFlag('pricing-test')

switch (variant) {
  case 'control':
    showPricing([10, 20, 50])  // Current prices
    break
  case 'test-a':
    showPricing([15, 25, 55])  // 50% higher
    break
  case 'test-b':
    showPricing([12, 22, 52])  // 20% higher
    break
}
Split traffic equally:
  • Control: 33%
  • Test A: 33%
  • Test B: 33%
More variants = longer test duration. Each variant needs enough traffic for statistical power.

Holdout groups

Measure long-term impact with holdout groups:
1

Create a holdout

  1. After experiment wins, create a holdout group
  2. Keep 5-10% of users on the control variant
  3. Ship winning variant to everyone else
2

Measure over time

Track metrics for months to see if the effect persists or if novelty wore off.
3

Sunset the holdout

After confirming sustained impact, remove the holdout and ship to 100%.

Common workflows

Optimize conversion

Test changes to signup flow, pricing pages, or CTAs. Measure conversion rate as primary metric. Watch bounce rate as secondary.

Improve retention

Test onboarding flows or feature changes. Use 7-day retention as primary metric. Monitor activation rate as secondary.

Increase engagement

Test UI changes or new features. Measure daily active usage or feature adoption as primary metric.

Reduce churn

Test interventions for at-risk users. Measure churn rate over 30 days. Track support tickets as secondary metric.

Sequential testing

Run multiple experiments on the same flow:
  1. Experiment 1: Test headline copy (winner increases signups 8%)
  2. Experiment 2: Test form length on winning headline (winner increases 5%)
  3. Experiment 3: Test button color on winning form (no significant change)
Cumulative impact: 8% × 5% = 13.4% total improvement
Run experiments sequentially, not simultaneously, on the same user flow. Simultaneous experiments create interaction effects that skew results.

Statistical rigor

Don’t:
  • Stop experiments early when results look good
  • Run multiple experiments and only report winners
  • Change metrics after seeing results
  • Add more traffic just to reach p < 0.05
Do:
  • Decide duration and sample size before starting
  • Report all experiments, even failures
  • Define metrics beforehand
  • Use Bayesian analysis if you need to peek
Power = probability of detecting an effect if it exists. Need 80%+ power for reliable experiments.Increases with:
  • Larger sample size
  • Larger effect size
  • Lower variance in metrics
PostHog calculates required sample size for 80% power.
New features often see inflated engagement initially. Use holdout groups to measure if the effect persists beyond the first week.

Best practices

Don’t test button colors first. Test major changes like pricing structure, core flows, or key value props. Optimize details after you’ve optimized fundamentals.
Write down your hypothesis, primary metric, and decision criteria before launching. Prevents moving goalposts when results arrive.
A test that increases signups but doubles support tickets isn’t a real win. Always track potential negative side effects.
Document why variants won or lost. Build institutional knowledge about what works for your users.
Watch session replays of experiment participants. Numbers tell you what happened, replays show you why.

Build docs developers (and LLMs) love