This post is based on the videos by StatQuest with Josh Starmer.
Are you familiar with the concept of $p$-hacking? What insights do you think $p$-hacking provides? Similarly, have you encountered terms like data dredging, data fishing, data snooping, or data butchery? These terms all refer to the practice of misusing data analysis to find results that appear statistically significant, even when they aren’t truly meaningful. Such misleading outcomes are known as false positives. Before diving into an example of $p$-hacking, let’s review some key concepts that will help us understand this issue better.
- Hypothesis Testing
- Null Hypothesis ($H_0$): The default assumption that there is no effect or no difference between groups.
- Alternative Hypothesis ($H_1$): The hypothesis that there is an effect or a difference, which we aim to test.
- $P$-value
- The $p$-value is the probability of obtaining a result equal to or more extreme than the observed outcome under the assumption that the null hypothesis is true. In simpler terms, it quantifies how likely it is to get the observed results (or more extreme) purely by chance if the null hypothesis holds.
- Low $p$-value (typically < 0.05): Suggests that the observed data are unlikely under the null hypothesis, leading to the rejection of $H_0$ in favor of $H_1$. A common threshold for significance is 0.05, implying less than a 5% chance that the results are due to random chance.
- High $p$-value (typically >= 0.05): Indicates that the observed data are consistent with the null hypothesis, meaning there is not enough evidence to reject $H_0$. However, it doesnot prove $H_0$ to be true.
- False Positives
- False positives occur when a study identifies a statistically significant effect that does not actually exist.
- We incorrectly reject the null hypothesis.
Example of $P$-Hacking
Now that we’ve reviewed these concepts, let’s consider an example where p-hacking might occur. Humans have around 20,000 to 25,000 genes, and identical twins share a substantial amount of genetic material. Suppose we randomly select two men and compare their 20,000 genes to see if they are identical twins. Assuming a significance level of 0.05, if we conduct 20,000 tests, approximately 5% of the time (or 1,000 tests), we might obtain $p$-values below 0.05. This could lead to an incorrect conclusion that there is a statistically significant difference between the two men, falsely suggesting they are not identical twins.
As the number of tests increases, so does the number of false positives. This situation is known as the multiple testing problem, which arises when conducting numerous hypothesis tests simultaneously, as in our 20,000 tests.
If you perform $m$ independent hypothesis tests, each with a significance level of $\alpha$, the probability of at least one false positive (Type I error) among the $m$ tests can be approximated as:
\[1-(1-\alpha)^m\]For example, if we perform 20,000 tests with $\alpha=0.05$, the probability of getting at least one significant result purely by chance is approximately:
\[1-(0.95)^20000 = 1\]This means that there is nearly a 100% chance of observing at least one false positive among the 20,000 tests.
Addressing the Multiple Testing Problem
One common approach to addressing this issue is by employing the False Discovery Rate (FDR). The FDR is the expected proportion of false positives (incorrectly rejected null hypotheses) among all the rejected hypotheses.
Before delving into the FDR, it’s crucial to understand how p-values behave under different scenarios: when the null hypothesis is true and when it is not. Under the null hypothesis (e.g., that the two men share the same DNA), the distribution of $p$-values would be uniformly spread between 0 and 1. If $H_0$ is true, the observed test statistic is just a random draw from the null distribution. Since the test statistic is randomly sampled from this distribution, the $p$-value, being the cumulative probability associated with this test statistic, is uniformly distributed between 0 and 1.
Conversely, when the alternative hypothesis is true, meaning that the two samples come from distinct populations, the distribution of $p$-values will exhibit right-skewness. This occurs because most $p$-values will be smaller than 0.05. However, $p$-values that exceed the 0.05 threshold represent false negatives, where true effects go undetected, often due to some overlap between the two distributions. Increasing the sample size can reduce the number of false negatives by making it easier to detect true differences.
To distinguish the true positives from the false positives, one strategy is to order the $p$-values in ascending order and focus on those within a certain range, starting with the smallest values. When the null hypothesis is false, the $p$-values within the 0 to 0.05 range will show a significantly greater degree of skewness compared to those under the null hypothesis, which are evenly distributed.
The Bengamini-Hochberg (BH) method builds on this idea, offering a computational approach to adjust $p$-values, slightly increasing them to control the occurrence of false positives. The procedure is as follows:
- Order the $p$-values
- Suppose you’ve conducted $m$ independent hypothesis tests, resulting in $p$-values $p_1, p_2, …, p_m$. Sort these $p$-values in ascending order: $p_{(1)}\le p_{(2)} \le … \le p_{(m})$.
- Choose a significance level
- Decide on a desired FDR, denoted by $q$ (e.g., $q=0.05$).
- Find the largest $p$-value that meets the threshold
- For each sorted $p_{(i)}$ (where $i$ is the rank after sorting), calculate the threshold:
- Identify the largest $i$ such that:
Let’s call $i$ the critical rank $i^*$.
- Reject the null hypothesis
- Reject the null hypothesis corresponding to the $p$-values $p_{(1)}, p_{(2)}, … , p_{(i^})$. These are the hypotheses where the $p$-values are small enough to be considered statistically significant under the BH procedure. If no such $i^$ exists (i.e., all $p$-vaues are larger than their respective thresholds), then none of the null hypotheses are rejected.
By using the Bengamini-Hochberg method, the adjusted $p$-values effectively control the false positive rate, leading to a more accurate analysis of statistical significance.
Another Form of $P$-Hacking and How to Prevent It
Consider a scenario similar to before, where we’re testing whether two samples originate from the same distribution. If we obtain a $p$-value that is close to 0.05 but slightly greater than 0.05, there might be a temptation to collect additional data, hoping to push the $p$-value below 0.05, and declare a significant difference between the samples. However, this practice is problematic because it significantly increases the risk of a false positive, essentially by calculating the $p$-value twice to reach a decision. To avoid this error, it’s crucial to determine the appropriate sample size before conducting the experiment, using a technique called power analysis.
Power Analysis
As the name suggests, power refers to the probability of correctly rejecting the null hypothesis. Power is primarily by two main factors: the sample size and the degree of overlap between the two distributions of the two populations. For a fixed number of observations, if the two populations have significant overlap, the power decreases because it becomes more difficult to distinguish between them. Conversely, with a sufficiently large sample size, even if there is overlap, the power can remain high. A large sample size allows for better estimation of population parameters, such as the mean, leading to less overlap and increased power. This principle holds regardless of the underlying distribution, thanks to the central limit theorem, which states that the distribution of sample means approaches normality as the sample size increases. Thus, having a large enough sample size enables confident decision-making, regardless of the $p$-value.
So, how is power analysis conducted?
-
Determine the desired level of power level: Power is typically set between 0 and 1, with 0.8 being a commonly chosen value.
-
Set the significance threshold: This is often set to 0.05.
-
Estimate the extent of overlap between the distributions.
- Overlap is influenced by factors such as the distance between population means and the standard deviations. This is often combined into a single measure called effect size ($d$), which is the estimated difference in means divided by the pooled estimated standard deviations.
where
\[s_{pooled}=\sqrt{\frac{(n_1-1)s_1^2+(n_2-1)s_2^2}{n_1+n_2-2}}\]- Use an online statistical power calculator: This tool helps determines the minimum number of subjects needed to achieve the desired power level.
By conducting a power analysis before starting an experiment, you can ensure that your study has an appropriate sample size, reducing the risk of false positives and enabling confident decision-making based on statistical results.
To conclude, understanding and addressing p-hacking and the multiple testing problem is crucial for ensuring the integrity and reliability of statistical analyses. By applying methods such as the Benjamini-Hochberg procedure to control the False Discovery Rate (FDR) and conducting power analyses to determine appropriate sample sizes, researchers can minimize the risk of false positives and make more confident, accurate conclusions.