Hypothesis Testing
Hypothesis testing is the process of using sample data to make decisions about population parameters. This is the culmination of statistical inference - where probability, distributions, and sampling theory come together.Learning Objectives
By the end of this lesson, you will be able to:- Construct and interpret confidence intervals for means
- Formulate null and alternative hypotheses
- Conduct one-sample and two-sample t-tests
- Interpret p-values correctly
- Understand Type I and Type II errors
- Apply hypothesis tests to real research questions
Confidence Intervals
A confidence interval provides a range of plausible values for a population parameter (like the mean).Understanding Confidence Intervals
A 95% confidence interval means:If we repeated our sampling process many times and calculated a 95% CI each time, approximately 95% of those intervals would contain the true population parameter.It does NOT mean: “There’s a 95% probability the true mean is in this interval” (the true mean is fixed, not random).
Formula for CI of the Mean
When population standard deviation (σ) is unknown, we use the t-distribution:Example 1: Sleep Hours Confidence Interval
From our student health study (n = 150):Example 2: Stress Scores Confidence Interval
- Higher confidence level → wider interval
- This is the trade-off: more confidence requires more uncertainty (wider range)
- All intervals suggest mean stress is around 18-19 on the 0-40 scale
Why does the interval get wider with higher confidence?
Why does the interval get wider with higher confidence?
Think about it intuitively: to be MORE confident you’ve captured the true value, you need to cast a WIDER net.
- 90% confidence: willing to be wrong 10% of the time → narrower interval
- 99% confidence: only willing to be wrong 1% of the time → must expand the interval
Effect of Sample Size on Confidence Intervals
Let’s compare confidence intervals with different sample sizes:The Hypothesis Testing Framework
Step 1: State Hypotheses
Every hypothesis test involves two competing statements:- Null Hypothesis (H₀): The status quo or “no effect” hypothesis
- Alternative Hypothesis (H₁ or Hₐ): What we’re trying to find evidence for
Structure of hypotheses:Null hypothesis (H₀): Usually contains ”=”
- H₀: μ = μ₀
- H₀: μ₁ = μ₂
- H₀: p = p₀
- H₁: μ ≠ μ₀ (two-sided)
- H₁: μ < μ₀ (one-sided, left tail)
- H₁: μ > μ₀ (one-sided, right tail)
Step 2: Choose Significance Level (α)
- α = 0.05 is most common (5% risk of Type I error)
- More conservative: α = 0.01
- Less conservative: α = 0.10
Step 3: Calculate Test Statistic
The test statistic measures how many standard errors the sample estimate is from the null hypothesis value.Step 4: Find P-value
The p-value is the probability of obtaining results as extreme as observed, assuming H₀ is true.Step 5: Make Decision
- If p-value < α: Reject H₀ (statistically significant)
- If p-value ≥ α: Fail to reject H₀ (not statistically significant)
One-Sample T-Test
Tests whether a population mean equals a specific value.Example: Are Students Sleep Deprived?
Research Question: Do students sleep less than the recommended 7 hours per night on average? Hypotheses:Test for Proportions
Tests whether a population proportion equals a specific value.Example: Sleep Quality
Research Question: Is good sleep quality rare among students? (less than 50%) Hypotheses:When to use z vs. t:
- Proportions: Use z-test (normal approximation)
- Means with σ known: Use z-test
- Means with σ unknown: Use t-test (almost always in practice)
Two-Sample T-Test
Compares means between two independent groups.Example: Physical Activity and Stress
Research Question: Do students with high physical activity have lower stress than students with low activity? Hypotheses:One-sided vs. Two-sided Tests
One-sided vs. Two-sided Tests
Two-sided test (H₁: μ₁ ≠ μ₂):
- Use when you want to detect ANY difference (either direction)
- More conservative (harder to reject H₀)
- Reports if groups differ, without specifying direction
- Use when you have a specific directional hypothesis
- Based on theory, prior research, or research design
- More powerful for detecting effects in the predicted direction
- Should be decided BEFORE looking at the data
Understanding P-Values
The p-value is one of the most misunderstood concepts in statistics.What P-Value Actually Means
P-value definition:The probability of observing data as extreme as what we obtained (or more extreme), assuming the null hypothesis is true.P-value does NOT tell you:
- ❌ The probability that H₀ is true
- ❌ The probability that H₁ is true
- ❌ The size or importance of an effect
- ❌ Whether results are practically meaningful
- ✓ How compatible your data is with H₀
- ✓ Whether results are statistically significant at α level
- ✓ The strength of evidence against H₀ (smaller p = stronger evidence)
Interpreting Different P-Values
| P-value Range | Interpretation | Evidence Against H₀ |
|---|---|---|
| p > 0.10 | Not significant | Little to none |
| 0.05 < p ≤ 0.10 | Marginally significant | Weak |
| 0.01 < p ≤ 0.05 | Significant | Moderate |
| 0.001 < p ≤ 0.01 | Very significant | Strong |
| p ≤ 0.001 | Highly significant | Very strong |
Example: Comparing Evidence Strength
From our three hypothesis tests:- Sleep < 7 hours: p < 0.000001 → Very strong evidence
- Good sleep quality < 50%: p < 0.000001 → Very strong evidence
- High activity = lower stress: p = 0.0014 → Strong evidence
Type I and Type II Errors
Hypothesis testing involves uncertainty, which means we can make errors.The Two Types of Errors
| H₀ is Actually True | H₀ is Actually False | |
|---|---|---|
| Reject H₀ | ❌ Type I Error (False Positive) - Probability = α | ✓ Correct Decision - Probability = 1 - β (Power) |
| Fail to Reject H₀ | ✓ Correct Decision - Probability = 1 - α | ❌ Type II Error (False Negative) - Probability = β |
Type I Error (α)
Definition: Rejecting H₀ when it’s actually true (false positive) Example - Sleep Study:- Truth: Students actually DO sleep 7+ hours on average
- Our conclusion: We conclude they sleep less than 7 hours
- Consequence: Implement expensive sleep programs unnecessarily
Type II Error (β)
Definition: Failing to reject H₀ when it’s actually false (false negative) Example - Physical Activity Study:- Truth: High activity DOES reduce stress
- Our conclusion: We fail to find significant evidence
- Consequence: Don’t implement beneficial activity programs
The Trade-off Between Type I and Type II Errors
The Trade-off Between Type I and Type II Errors
There’s an inherent trade-off:If you decrease α (Type I error):
- Harder to reject H₀ (more conservative)
- Increased β (Type II error)
- Less likely to detect true effects
- Easier to reject H₀
- Decreased β (Type II error)
- More likely to detect true effects
- But more false positives
- Can decrease BOTH error types
- More expensive/time-consuming
- This is why power analysis guides sample size planning
Contextualizing Errors in Student Health
Type I Error Example: Sleep Intervention
Scenario: We conclude students are sleep deprived (reject H₀: μ = 7) when they’re actually not. Consequences:- Waste resources on unnecessary sleep programs
- Divert attention from real problems
- Possibly create anxiety about sleep when unnecessary
Type II Error Example: Missing Stress-Activity Link
Scenario: We fail to detect that exercise reduces stress (fail to reject H₀) when it actually does. Consequences:- Don’t implement physical activity programs
- Students continue suffering from high stress
- Miss opportunity to improve mental health and academic performance
Complete Hypothesis Test Example
Let’s walk through a complete analysis combining everything we’ve learned.Research Question
Do students with healthy nutrition (score ≥ 7) have better academic performance than those with poor nutrition (score < 7)?Step 1: Formulate Hypotheses
Step 2: Choose α
α = 0.05 (standard for educational research)Step 3: Check Assumptions
Step 4: Descriptive Statistics
Step 5: Conduct Test
Step 6: Make Decision
Step 7: Calculate Effect Size
Step 8: Report Results
Example write-up: “We conducted an independent samples t-test to compare academic performance between students with healthy nutrition (score ≥ 7, n=59, M=7.18, SD=0.89) and poor nutrition (score < 7, n=91, M=6.95, SD=0.94). Students with healthy nutrition scored significantly higher, t(148)=1.53, p = 0.064, though the effect was small (Cohen’s d=0.25).”Complete reporting includes:
- Test used and why
- Descriptive statistics for each group (n, M, SD)
- Test statistic value and degrees of freedom
- P-value
- Effect size
- Interpretation in context
Statistical Significance vs. Practical Significance
A result can be:- Statistically significant but practically trivial
- Practically important but not statistically significant
Example: Large Sample Paradox
With n=10,000 students, a 0.1-hour difference in sleep might be statistically significant (p < 0.05) but practically meaningless - 6 minutes is negligible.Example: Small Sample Problem
With n=15 students, a 1.0-hour difference might not reach statistical significance (p = 0.08) but could be very important practically.Common Mistakes and How to Avoid Them
1. Confusing “Not Significant” with “No Effect”
❌ Wrong: “We found no effect of exercise on stress (p = 0.12)” ✓ Right: “We found no statistically significant evidence that exercise reduces stress (p = 0.12), though the observed difference was in the expected direction. A larger sample size may be needed.”2. Multiple Testing Problem
If you test 20 hypotheses at α=0.05, you expect 1 false positive by chance alone! Solution: Use corrections like:- Bonferroni correction (α/number of tests)
- False Discovery Rate control
- Pre-registered primary outcomes
3. Stopping When You Get p < 0.05
❌ P-hacking: Keep collecting data and testing until p < 0.05 ✓ Right: Determine sample size beforehand (power analysis) and collect all data before testing4. Interpreting P-Values as Effect Size
p = 0.001 doesn’t mean a “bigger effect” than p = 0.04 - it means stronger evidence, often due to larger sample size!Key Takeaways
-
Confidence intervals quantify uncertainty in parameter estimates
- 95% CI means the procedure captures the true value 95% of the time
- Wider CI = more uncertainty; narrower CI = more precision
- Width decreases with larger samples
-
Hypothesis testing provides a framework for making decisions
- State H₀ and H₁ before analyzing data
- Choose α (usually 0.05)
- Calculate test statistic and p-value
- Make decision: reject or fail to reject H₀
-
P-values measure evidence against H₀
- NOT the probability H₀ is true
- NOT the importance of an effect
- Smaller p = stronger evidence against H₀
-
Two types of errors:
- Type I (α): False positive - rejecting true H₀
- Type II (β): False negative - failing to reject false H₀
- Trade-off between them; both decrease with larger n
-
Context is crucial:
- Statistical significance ≠ practical significance
- Report effect sizes, not just p-values
- Consider costs and benefits of errors
- Interpret findings in light of research question
Real-World Applications
Our student health study demonstrates how hypothesis testing informs policy:-
Sleep intervention needed: Strong evidence (p < 0.001) students are sleep-deprived
- Policy: Limit early classes, educate on sleep hygiene
-
Promote physical activity: Evidence (p = 0.0014) it reduces stress
- Policy: Subsidize gym memberships, create campus rec programs
-
Nutrition programs: Weak evidence (p = 0.064) for grade improvement
- Policy: More research needed before major investment
Statistical thinking in action:These analyses transformed raw data into actionable insights. The same framework applies to:
- Medical research (drug efficacy)
- A/B testing (website conversion rates)
- Quality control (manufacturing defects)
- Policy evaluation (program effectiveness)
- Social science (behavior interventions)
Next Steps in Your Statistical Journey
You now have the core tools for statistical inference! To go further:-
Learn more advanced tests:
- ANOVA (comparing 3+ groups)
- Chi-square tests (categorical data)
- Regression analysis (multiple predictors)
- Non-parametric tests (when assumptions violated)
-
Dive deeper into study design:
- Experimental vs. observational studies
- Controlling for confounding variables
- Power analysis and sample size planning
-
Explore modern approaches:
- Bootstrap methods
- Bayesian inference
- Machine learning vs. traditional statistics
-
Practice with real data:
- Replicate published studies
- Analyze datasets in your field
- Collaborate on research projects
Practice Exercises
- Calculate 90%, 95%, and 99% confidence intervals for academic averages
- Test whether students with good sleep quality have lower stress (use appropriate test)
- Compare grades between low, moderate, and high physical activity groups (requires ANOVA, but try with t-tests first)
- Calculate the probability of Type II error for the sleep test (assume true μ=6.8 hours)
- Write a complete results section reporting one of the hypothesis tests
Congratulations! You’ve completed the Statistical Inference module. You now understand how to move from data to conclusions with quantified uncertainty - a critical skill for data-driven decision making.