Why Only 50% of Findings Can Be Replicated: Exploratory Thoughts

Thinking about the “replication crisis” in social sciences, I did some back-of-the-envelope type analysis of the probability that a statistically significant finding is true. The result I get is that there is only a 50% probability that a result is true, given a finding of statistical significance. Moreover, this 50% probability of being true is not affected by raising the threshold for statistical significance.

Imagine, for example, one is examining whether variable X1 causes variable Y to vary. X1 is one of many possible variables, X2, X3, X4, … that could cause variable Y to vary. To assess whether X1 is related to Y, one would apply some statistical analysis and conclude that X1 has a statistically significant relationship to Y is if the result of the analysis would occur .05 of the time of less by chance. This means that, in the universe of potential causes (all the Xs), .05 of them will be deemed statistically significant.

What, then, is the probability a finding is true (T) given the existence of a statistically significant relationship (S)? What we’re interested in is P(T | S).

I start with this basic probability statement:

P(T | S) * P(S) = P(S | T) * P(T)

which given us the Bayes rule:

P(T | S) = [P(S | T) * P(T)] / P(S)

The bottom half of the equation above needs to be expanded. The probability that a relationship will be deemed statistically significant, P(S), is equal to the sum of the probability of true positive findings, P(S | T) * P(T), and the probability of false positive findings, P(S | ~T) * P(~T). Incorporating this into the equation:

P(T | S) = [P(S | T) * P(T)] / [P(S | T) * P(T) + P(S | ~T) * P(~T)]

Based on the definition of statistical significance, .05 of X’s are considered true causes of Y, and .95 of X’s are not true causes of Y. Thus, P(T) = .05 and P(~T) = .95. The probability of a false positive, P(S | ~T), is the Type 1 error rate of .05. The true positive rate, P(S | T), is the power of the statistical test used, but it is relatively high, let’s say .95 of true relationships will be correctly identified as statistically significant. This gives us all the values needed to calculate P(T | S) using the formula above.

P(T | S) = [.95 * .05] / [.95 * .05 + .05 * .95]

= .0475 / (.0475 + .0475)

= .50

The probability that a finding is true given it is statistically significant, based on the foregoing, is .50. The ability is detect true relationships with statistical significance tests is fairly low, but this is consistent with the “replication crisis” literature which indicates that only about 50% of statistically significant research findings can be replicated.

Even if one raises the threshold for statistical significance, deeming relationships statistically significant if they only 1% of the time by chance, it’s easy to see that the value of P(T | S) remains .50.

P(T | S) = [.99 * .01] / [.99 * .01 + .01 * .99] = .50

This is my back-of-the-envelope analysis. I made some simplifying assumptions that are not entirely accurate, but I think accurate enough to generate an interesting result. Someone probably has developed this further in the literature on replication crisis and publication bias. To me, P(T) = .05 seems like the most questionable assumption because researchers should not be investigating relationships without some strong basis to believe a true relationship exists.

Why Only 50% of Findings Can Be Replicated: Exploratory Thoughts

Leave a Reply Cancel reply