How experiments work
An experiment is a feature flag plus measurement:- Create a feature flag with multiple variants (control and test)
- Define metrics to measure (conversion, retention, revenue)
- PostHog assigns users to variants randomly
- Track metrics for each variant
- Calculate statistical significance to determine a winner
Experiments use Bayesian statistics by default. Results show probability of each variant being best, not just p-values.
Creating an experiment
Define your hypothesis
Start with a clear hypothesis:
- “Changing the CTA button from ‘Start’ to ‘Try for free’ will increase signups by 10%”
- “Showing pricing upfront will improve trial-to-paid conversion”
- “Reducing onboarding steps will increase activation rate”
Create the experiment
- Go to Experiments → New experiment
- Name it (e.g., “CTA button copy test”)
- Create or link a feature flag
- Define variants:
- Control: Current button text
- Test: New button text
Set metrics
Choose metrics to measure:Primary metric: The main goal (e.g., signup conversion)Secondary metrics: Watch for unintended effects:
- Time on page
- Bounce rate
- Support tickets
Defining metrics
Experiments support multiple metric types:- Conversion rate
- Trend count
- Funnel conversion
Percentage of users who complete an event:Metric config:
- Type: Conversion rate
- Event:
signup_completed - Conversion window: 7 days
Implementing variants
Use feature flags to show different variants:PostHog automatically tracks experiment exposure. You don’t need to manually track
$experiment_started unless you want custom timing.Reading results
Experiment results show statistical analysis:Bayesian analysis (default)
- Probability of being best: Chance this variant is the winner (e.g., 94%)
- Credible interval: Range where the true effect likely lies
- Expected loss: Potential downside if you pick the wrong variant
- Probability > 90% and expected loss is acceptable
- Typically need 100+ conversions per variant for confidence
Frequentist analysis
Switch to frequentist in settings for traditional hypothesis testing:- P-value: Probability results are due to chance (want < 0.05)
- Confidence interval: Range of likely effect sizes
- Sample size recommendation: How many more users needed
- P-value < 0.05 (95% confidence)
- Effect size is practically significant (not just statistically)
Experiment duration
How long to run experiments:Wait for full cycles
Wait for full cycles
Run for at least one full cycle of user behavior. If users return weekly, run for 1-2 weeks minimum. This accounts for day-of-week effects.
Reach minimum sample size
Reach minimum sample size
Need enough conversions for statistical power:
- Small effects (2-5% lift): 5,000+ users per variant
- Medium effects (5-10% lift): 1,000+ users per variant
- Large effects (10%+ lift): 500+ users per variant
Don't peek too early
Don't peek too early
Looking at results daily and stopping when p < 0.05 inflates false positives. Decide duration beforehand or use Bayesian analysis which handles peeking better.
Multivariate experiments
Test more than two variants:- Control: 33%
- Test A: 33%
- Test B: 33%
More variants = longer test duration. Each variant needs enough traffic for statistical power.
Holdout groups
Measure long-term impact with holdout groups:Create a holdout
- After experiment wins, create a holdout group
- Keep 5-10% of users on the control variant
- Ship winning variant to everyone else
Common workflows
Optimize conversion
Test changes to signup flow, pricing pages, or CTAs. Measure conversion rate as primary metric. Watch bounce rate as secondary.
Improve retention
Test onboarding flows or feature changes. Use 7-day retention as primary metric. Monitor activation rate as secondary.
Increase engagement
Test UI changes or new features. Measure daily active usage or feature adoption as primary metric.
Reduce churn
Test interventions for at-risk users. Measure churn rate over 30 days. Track support tickets as secondary metric.
Sequential testing
Run multiple experiments on the same flow:- Experiment 1: Test headline copy (winner increases signups 8%)
- Experiment 2: Test form length on winning headline (winner increases 5%)
- Experiment 3: Test button color on winning form (no significant change)
Run experiments sequentially, not simultaneously, on the same user flow. Simultaneous experiments create interaction effects that skew results.
Statistical rigor
Avoid p-hacking
Avoid p-hacking
Don’t:
- Stop experiments early when results look good
- Run multiple experiments and only report winners
- Change metrics after seeing results
- Add more traffic just to reach p < 0.05
- Decide duration and sample size before starting
- Report all experiments, even failures
- Define metrics beforehand
- Use Bayesian analysis if you need to peek
Understand statistical power
Understand statistical power
Power = probability of detecting an effect if it exists. Need 80%+ power for reliable experiments.Increases with:
- Larger sample size
- Larger effect size
- Lower variance in metrics
Check for novelty effects
Check for novelty effects
New features often see inflated engagement initially. Use holdout groups to measure if the effect persists beyond the first week.
Best practices
Start with high-impact changes
Start with high-impact changes
Don’t test button colors first. Test major changes like pricing structure, core flows, or key value props. Optimize details after you’ve optimized fundamentals.
Define success metrics upfront
Define success metrics upfront
Write down your hypothesis, primary metric, and decision criteria before launching. Prevents moving goalposts when results arrive.
Monitor secondary metrics
Monitor secondary metrics
A test that increases signups but doubles support tickets isn’t a real win. Always track potential negative side effects.
Ship learnings, not just winners
Ship learnings, not just winners
Document why variants won or lost. Build institutional knowledge about what works for your users.
Combine with qualitative data
Combine with qualitative data
Watch session replays of experiment participants. Numbers tell you what happened, replays show you why.