Imagine if there were a simple single statistical measure everybody could use with any set of data and it would reliably separate true from false. Oh, the things we would know! Unrealistic to expect such wizardry though, huh?
Yet, statistical significance is commonly treated as though it is that magic wand. Take a null hypothesis or look for any association between factors in a data set and abracadabra! Get a “p value” over or under 0.05 and you can be 95% certain it’s either a fluke or it isn’t. You can eliminate the play of chance! You can separate the signal from the noise!
Except that you can’t. That’s not really what testing for statistical significance does. And therein lies the rub.
Testing for statistical significance estimates the probability of getting roughly that result if the underlying hypothesis is assumed to be true. It can’t on its own tell you whether this assumption was right, or whether the results would hold true in different circumstances. It provides a limited picture of probability, because it takes limited information about the data into account.
What’s more, the finding of statistical significance itself can be a “fluke,” and that becomes more likely in bigger data and when you run the test on multiple comparisons in the same data. You can read more about that here.
Statistical significance testing can easily sound as though it sorts the wheat from the chaff, but it’s not enough to do that on its own – and it can break down in the face of many challenges. Nor do all tests of statistical significance work the same way on all data sets. And what’s more, “significant” doesn’t mean it’s important either. A sliver of an effect can reach the less-than-5% threshold. We’ll come back to what all this means practically shortly.
The common approach to statistical significance testing was so simple to grasp, though, and so easy to do even before there were computers, that it took the science world by storm. As Stephen Stigler explains in his piece on Fisher and the 5% level, “it opened the arcane domain of statistical calculation to a world of experimenters and research workers.”
But it also led to something of an avalanche of abuses. The over-simplistic approach to statistical significance has a lot for which to answer. As John Ioannidis points out here, this is a serious player in science’s failure to replicate results.