Probability Distributions
Probability distributions are mathematical models that describe how probabilities are distributed across the possible values of a random variable. Understanding distributions is essential for statistical inference.Learning Objectives
By the end of this lesson, you will be able to:- Identify common probability distributions in real data
- Work with the normal distribution and calculate probabilities
- Understand discrete distributions (binomial, Poisson)
- Apply the Central Limit Theorem to sampling distributions
- Visualize and interpret distribution properties
Types of Distributions
Continuous Distributions
For continuous random variables (can take any value in a range)Discrete Distributions
For discrete random variables (specific countable values)In our student health study, we have examples of both types:
- Continuous: sleep hours, stress score, academic average
- Discrete: age, nutrition score, activity level
The Normal Distribution
The normal distribution (or Gaussian distribution) is the most important probability distribution in statistics.Key Properties
- Bell-shaped and symmetric around the mean
- Characterized by two parameters:
- μ (mu): the mean (center of the distribution)
- σ (sigma): the standard deviation (spread)
- Notation: X ~ Normal(μ, σ)
Example: Sleep Hours Distribution
From our student health dataset of 150 students:Calculating Probabilities with Normal Distribution
Example 1: Probability of adequate sleep
Question: What’s the probability a randomly selected student sleeps at least 7 hours?Example 2: Probability within a range
Question: What’s the probability a student sleeps between 5 and 8 hours?Visualizing the Normal Distribution
Why does sleep follow a normal distribution?
Why does sleep follow a normal distribution?
Many biological and behavioral variables follow approximately normal distributions due to the combined effect of many small, independent factors. Sleep duration is influenced by:
- Individual biological needs
- Academic workload
- Social activities
- Personal habits
- Environmental factors
Discrete Distributions
The Binomial Distribution
The binomial distribution models the number of successes in a fixed number of independent trials. Parameters:- n: number of trials
- p: probability of success on each trial
Example: Nutrition Habits
Suppose we measure 10 different healthy eating behaviors (eating vegetables, avoiding fast food, breakfast regularly, etc.).- Each behavior is a “trial” (n = 10)
- Each student has some probability of following each behavior (p ≈ 0.5)
- Nutrition score = number of healthy behaviors followed
Real data vs. theoretical model: In our simulated dataset, 39.3% of students have nutrition score ≥7. This difference suggests the real probability per behavior might be higher than 0.5, or behaviors aren’t independent.
When to Use Binomial Distribution
Use binomial when:- Fixed number of trials (n)
- Each trial has two outcomes (success/failure)
- Trials are independent
- Probability of success (p) is constant
- Number of days per week a student exercises (n=7 days)
- Number of healthy meals in a day (n=3 meals)
- Survey responses (n=20 questions, yes/no answers)
The Poisson Distribution
The Poisson distribution models the number of events occurring in a fixed interval of time or space. Parameter:- λ (lambda): average rate of events
Example: Study Break Frequency
Suppose students take study breaks at an average rate of λ=3 breaks per study session.When to Use Poisson Distribution
Use Poisson when counting:- Events per unit time (emails per hour, visits per day)
- Events per unit space (typos per page, bacteria per sample)
- Rare events with many opportunities
The Central Limit Theorem (CLT)
The Central Limit Theorem is one of the most important concepts in statistics.The Theorem
Central Limit Theorem: When you take repeated random samples from ANY population and calculate the sample mean each time:
- The distribution of those sample means approaches a normal distribution
- The mean of the sample means equals the population mean (μ)
- The standard deviation of sample means = σ/√n (standard error)
Why This Matters
The CLT is why we can:- Use normal distribution theory for inference
- Calculate confidence intervals
- Perform hypothesis tests
- Make predictions about populations from samples
Simulation: Demonstrating the CLT
Let’s simulate the sampling distribution of mean sleep hours. Process:- Take a random sample of n students
- Calculate the mean sleep hours for that sample
- Repeat 1000 times
- Plot the distribution of those 1000 sample means
Results: Effect of Sample Size
| Sample Size | Mean of Means | Std Dev of Means |
|---|---|---|
| n = 10 | 6.52 | 0.372 |
| n = 30 | 6.56 | 0.219 |
| n = 50 | 6.55 | 0.169 |
-
Mean stays constant (~6.55) regardless of sample size
- Sample mean is an unbiased estimator of population mean
-
Standard deviation decreases as sample size increases
- Larger samples give more precise estimates
- Follows the formula: SE = σ/√n
-
Shape becomes more normal as n increases
- Even with n=10, the distribution looks fairly normal
- By n=30, it’s very close to normal
Visualizing the Central Limit Theorem
Applying Distributions to Research Questions
Example: Stress Levels
From our dataset, stress scores approximately follow:- Stress Score ~ Normal(18.85, 7.03)
What if the distribution isn't normal?
What if the distribution isn't normal?
Many real-world variables aren’t perfectly normal. Options include:
- Transform the data (log, square root) to make it more normal
- Use non-parametric methods that don’t assume normality
- Rely on the Central Limit Theorem for sample means (works even if population isn’t normal)
- Use the actual empirical distribution from your sample
Standard Error and Sampling Distributions
The standard error (SE) measures how much sample means vary from sample to sample.Example Calculation
For sleep hours with σ = 1.21:| Sample Size | Standard Error |
|---|---|
| n = 10 | 1.21/√10 = 0.38 |
| n = 30 | 1.21/√30 = 0.22 |
| n = 50 | 1.21/√50 = 0.17 |
| n = 100 | 1.21/√100 = 0.12 |
Why standard error matters:
- Smaller SE = more precise estimates
- SE is used to calculate confidence intervals
- SE determines the width of hypothesis test regions
- SE appears in formulas for t-tests, z-tests, etc.
Checking Distribution Assumptions
Visual Methods
-
Histogram with density curve
- Does the shape look normal (bell-shaped)?
- Are there outliers or multiple peaks?
-
Q-Q plot (Quantile-Quantile plot)
- Points should fall on a straight line for normal data
- Deviations indicate non-normality
Statistical Tests
- Shapiro-Wilk test: Tests if data comes from a normal distribution
- Anderson-Darling test: Another normality test
- Kolmogorov-Smirnov test: General goodness-of-fit test
Key Takeaways
- Normal distribution is central to statistical inference - characterized by mean (μ) and standard deviation (σ)
- Binomial distribution models count of successes in fixed trials
- Poisson distribution models rare events or counts per interval
-
Central Limit Theorem explains why normal distribution is so important:
- Sample means are approximately normal, regardless of population distribution
- Larger samples produce more normal distributions with less variability
- Standard error quantifies precision of sample means and decreases with √n
- Real data (like our student health study) often approximates theoretical distributions
Next Steps
Now that you understand probability distributions and the Central Limit Theorem, you’re ready for hypothesis testing - using sample data to make decisions about population parameters.In the next module, we’ll use these distribution concepts to:
- Build confidence intervals for population means
- Test hypotheses (Is average sleep really less than 7 hours?)
- Compare groups (Does exercise reduce stress?)
- Interpret p-values and statistical significance
Practice Exercises
- Calculate the probability that stress score is between 15 and 25
- Create a Q-Q plot for academic averages
- Simulate the sampling distribution for nutrition scores
- Compare the spread of sampling distributions for n=20 vs n=80