Statistical Hypothesis Testing and p-values Explained Simply
Key Terms (Read This First)
Why “alpha” and “beta”? These are the first two letters of the Greek alphabet. In statistics, “alpha” (α) is used for the cutoff for significance (Type I error rate), and “beta” (β) is used for the chance of missing a real effect (Type II error rate). “Alpha” means “first” and “beta” means “second”—they’re just labels, not special words.
- Null hypothesis (H0): The “nothing special” or “no effect” claim. Example: “The new drug does nothing.”
- Alternative hypothesis (H1): The “something is happening” claim. Example: “The new drug works.”
- p-value: The chance of getting results as extreme as yours (or more) if H0 is true. Small p-value = data is surprising if H0 is true.
- α (alpha): The cutoff for “statistically significant” (often 0.05). If p < α, you reject H0.
- Type I error (α): False positive — you say there’s an effect when there isn’t. Controlled by your α.
- Type II error (β): False negative — you miss a real effect. Lower β (higher power) is better.
- Power (1-β): The chance you’ll detect a real effect if it exists. Aim for power > 0.8 (80%).
How Hypothesis Testing Works (The Flow)
- Write down H0 (no effect) and H1 (what you want to test).
- Choose α (how much risk of a false positive you’ll accept, like 0.05).
- Collect your data and calculate the p-value.
- If p < α, you reject H0 (evidence for H1). If p ≥ α, you fail to reject H0 (not enough evidence for H1).
- Remember: “Reject H0” means you found evidence for an effect, but it’s not proof. “Fail to reject H0” means you didn’t find strong evidence, not that H0 is true.
Picking H0 vs H1 — Student Guide
- Write both hypotheses before you look at the data: H0 = “no effect” (default), H1 = the effect you want to test for.
- The test only evaluates H0: you either reject H0 (data support H1) or fail to reject H0 (not enough evidence). You never “prove” H1 outright.
- Choose one-sided vs two-sided in advance — switching after seeing data inflates false positives.
- Don’t run lots of primary tests without planning corrections (Bonferroni, false-discovery rate). Multiple unplanned tests increase the chance of false positives.
- Practical rule: pick one main question to test up front; treat other analyses as exploratory and report them as such.
What is a t-test?
- A t-test is a way to compare averages (means) between groups and see if they are really different, or if the difference could just be luck.
- Example: You want to know if people who drink coffee score higher on a test than people who don’t. You measure both groups and compare their average scores.
- The t-test gives you a p-value. If the p-value is small (like 0.05 or less), you can say the groups are probably different for real.
- If the p-value is big, the difference you see might just be random noise.
Types of t-tests:
- One-sample t-test: Is my group different from a specific number? (e.g., is the average height different from 5’7″?)
- Two-sample t-test: Are two groups different? (e.g., coffee vs. no coffee)
- Paired t-test: Did the same people change after something happened? (e.g., did their test scores improve after a class?)
In summary
- If you see p < 0.05, you can usually say “Yes, it’s probably real!”
- If p > 0.05, you can’t be sure—could just be luck.
Which Software Should You Use: R or Python?
- R
- Designed for statistics and data analysis from the ground up
- Excellent for academic work, research, and specialized statistical methods
- Huge library of statistical packages (CRAN)
- Great for making publication-quality plots (ggplot2, etc.)
- Can be less intuitive for general programming or large projects
- Python
- General-purpose language that’s also great for data science and machine learning
- Easy to learn and use for scripting, automation, and web development
- Powerful libraries for stats and data (pandas, numpy, scipy, statsmodels, scikit-learn)
- Better for integrating with other software and production systems
- Some advanced statistical methods may require extra packages or more code than R
Bottom line: Both are excellent. If you’re just starting, pick one (R or Python) and stick with it for your stats work. Mastering one is better than mixing both. If you want to focus on statistics and academic analysis, R is a great choice. If you want to combine stats with programming, automation, or machine learning, Python is ideal.
Practical note: Even though this is a really good guide, your choice will also depend on your work environment and what is approved or preferred. For example, maybe your department prefers that you use Python, or maybe R, or maybe it is a flexible place and you can use both. Python is free and so is R, but there is also a commercial version of R that is expensive. You also have to consider how your company views open-source software—do they approve of that or not? If you end up working for a bank or government, don’t be surprised if you end up using something like SAS.
What Should You Learn Next?
- ANOVA (Analysis of Variance): For comparing means across more than two groups. Example: Comparing test scores across three teaching methods.
- Correlation & Simple Linear Regression: Correlation measures how two variables move together; regression predicts one variable from another (e.g., predicting height from age).
- Chi-Square Tests: For categorical data—tests if distributions differ from expected (e.g., is there an association between gender and voting preference?).
- Confidence Intervals: Learn how to estimate a range for means, proportions, etc., not just a single value.
- Non-parametric Tests: For data that doesn’t meet t-test/ANOVA assumptions (e.g., Mann-Whitney U, Wilcoxon, Kruskal-Wallis).
Mastering these will give you a solid foundation for most undergraduate statistics and data analysis tasks!