Words fail us. Numbers, too, for they are only condensed words. Words and numbers are tools of communication and calculation. As tools, they cannot make certain that which is uncertain, though they often convey a false sense of certainty.

Yes, arithmetic seems certain: 2 + 2 = 4 is always and ever (in base-10 notation). But that is only because the conventions of arithmetic require 2 + 2 to equal 4. Neither arithmetic nor any higher form of mathematics reveals the truth about the world around us, though mathematics (and statistics) can be used to find approximate truths — approximations that are useful in practical applications like building bridges, finding effective medicines, and sending rockets into space (though the practicality of that has always escaped me).

But such practical things are possible only because the uncertainty surrounding them (e.g., the stresses that may cause a bridge to fail) is hedged against by making things more robust than they would need to be under perfect conditions. And, even then, things sometimes fail: bridges collapse, medicines have unforeseen side effects, rockets blow up, etc.

I was reminded of uncertainty by a recent post by Timothy Taylor (*Conversable Economist*):

For the uninitiated, “statistical significance” is a way of summarizing whether a certain statistical result is likely to have happened by chance, or not. For example, if I flip a coin 10 times and get six heads and four tails, this could easily happen by chance even with a fair and evenly balanced coin. But if I flip a coin 10 times and get 10 heads, this is extremely unlikely to happen by chance. Or if I flip a coin 10,000 times, with a result of 6,000 heads and 4,000 tails (essentially, repeating the 10-flip coin experiment 1,000 times), I can be quite confident that the coin is not a fair one. A common rule of thumb has been that if the probability of an outcome occurring by chance is 5% or less–in the jargon, has a p-value of 5% or less–then the result is statistically significant. However, it’s also pretty common to see studies that report a range of other p-values like 1% or 10%.

Given the omnipresence of “statistical significance” in pedagogy and the research literature, it was interesting last year when the American Statistical Association made an official statement “ASA Statement on Statistical Significance and P-Values” (discussed here) which includes comments like: “Scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold. … A p-value, or statistical significance, does not measure the size of an effect or the importance of a result. … By itself, a p-value does not provide a good measure of evidence regarding a model or hypothesis.”

Now, the ASA has followed up with a special supplemental issue of its journal

The American Statisticianon the theme “Statistical Inference in the 21st Century: A World Beyond p < 0.05” (January 2019). The issue has a useful overview essay, “Moving to a World Beyond “p < 0.05.” by Ronald L. Wasserstein, Allen L. Schirm, and Nicole A. Lazar. They write:We conclude, based on our review of the articles in this special issue and the broader literature, that it is time to stop using the term “statistically significant” entirely. Nor should variants such as “significantly different,” “p < 0.05,” and “nonsignificant” survive, whether expressed in words, by asterisks in a table, or in some other way. Regardless of whether it was ever useful, a declaration of “statistical significance” has today become meaningless. … In sum, `statistically significant’—don’t say it and don’t use it.

. . .

So let’s accept the that the “statistical significance” label has some severe problems, as Wasserstein, Schirm, and Lazar write:

[A] label of statistical significance does not mean or imply that an association or effect is highly probable, real, true, or important. Nor does a label of statistical nonsignificance lead to the association or effect being improbable, absent, false, or unimportant. Yet the dichotomization into “significant” and “not significant” is taken as an imprimatur of authority on these characteristics. In a world without bright lines, on the other hand, it becomes untenable to assert dramatic differences in interpretation from inconsequential differences in estimates. As Gelman and Stern (2006) famously observed, the difference between “significant” and “not significant” is not itself statistically significant.

In the middle of the post, Taylor quotes Edward Leamer’s 1983 article, “Taking the Con out of Econometrics” (*American Economic Review*, March 1983, pp. 31-43).

Leamer wrote:

The econometric art as it is practiced at the computer terminal involves fitting many, perhaps thousands, of statistical models. One or several that the researcher finds pleasing are selected for re- porting purposes. This searching for a model is often well intentioned, but there can be no doubt that such a specification search in-validates the traditional theories of inference. … [I]n fact, all the concepts of traditional theory, utterly lose their meaning by the time an applied researcher pulls from the bramble of computer output the one thorn of a model he likes best, the one he chooses to portray as a rose. The consuming public is hardly fooled by this chicanery. The econometrician’s shabby art is humorously and disparagingly labelled “data mining,” “fishing,” “grubbing,” “number crunching.” A joke evokes the Inquisition: “If you torture the data long enough, Nature will confess” … This is a sad and decidedly unscientific state of affairs we find ourselves in. Hardly anyone takes data analyses seriously. Or perhaps more accurately, hardly anyone takes anyone else’s data analyses seriously.”

Economists and other social scientists have become much more aware of these issues over the decades, but Leamer was still writing in 2010 (“Tantalus on the Road to Asymptopia,” J

ournal of Economic Perspectives, 24: 2, pp. 31-46):Since I wrote my “con in econometrics” challenge much progress has been made in economic theory and in econometric theory and in experimental design, but there has been little progress technically or procedurally on this subject of sensitivity analyses in econometrics. Most authors still support their conclusions with the results implied by several models, and they leave the rest of us wondering how hard they had to work to find their favorite outcomes … It’s like a court of law in which we hear only the experts on the plaintiff’s side, but are wise enough to know that there are abundant for the defense.

Taylor wisely adds this:

Taken together, these issues suggest that a lot of the findings in social science research shouldn’t be believed with too much firmness. The results might be true. They might be a result of a researcher pulling out “from the bramble of computer output the one thorn of a model he likes best, the one he chooses to portray as a rose.” And given the realities of real-world research, it seems goofy to say that a result with, say, only a 4.8% probability of happening by chance is “significant,” while if the result had a 5.2% probability of happening by chance it is “not significant.”

Uncertainty is a continuum, not a black-and-white difference[emphasis added].

The italicized sentence expresses my long-held position.

But there is a deeper issue here, to which I alluded above in my brief comments about the nature of mathematics. The deeper issue is the complete dependence of logical systems on the underlying axioms (assumptions) of those systems, which Kurt Gödel addressed in his incompleteness theorems:

Gödel’s incompleteness theorems are two theorems of mathematical logic that demonstrate the inherent limitations of every formal axiomatic system capable of modelling basic arithmetic….

The first incompleteness theorem states that no consistent system of axioms whose theorems can be listed by an effective procedure (i.e., an algorithm) is capable of proving all truths about the arithmetic of natural numbers. For any such consistent formal system, there will always be statements about natural numbers that are true, but that are unprovable within the system. The second incompleteness theorem, an extension of the first, shows that the system cannot demonstrate its own consistency.

This is very deep stuff. I own the book in which Gödel proves his theorems, and I admit that I have to take the proofs on faith. (Which simply means that I have been too lazy to work my way through the proofs.) But there seem to be no serious or fatal criticisms of the theorems, so my faith is justified (thus far).

There is also the view that the theorems aren’t applicable in fields outside of mathematical logic. But any quest for certainty about the physical world necessarily uses mathematical logic (which includes statistics).

This doesn’t mean that the results of computational exercises are useless. It simply means that they are only as good as the assumptions that underlie them; for example, assumptions about relationships between variables, assumptions about the values of the variables, assumption about whether the correct have been chosen (and properly defined), in the first place.

There is nothing new in that, certainly nothing that requires Gödel’s theorems by way of proof. It has long been understood that a logical argument may be valid — the conclusion follows from the premises — but untrue if the premises (axioms) are untrue.

But it bears repeating and repeating — especially in the age of “climate change“. That CO2 is a dominant determinant of “global” temperatures is taken as axiomatic. Everything else flows from that assumption, including the downward revision of historical (actual) temperature readings, to ensure that the “observed” rise in temperatures agrees with — and therefore “proves” — the validity of climate models that take CO2 as dominant variable. How circular can you get?

Check your assumptions at the door.