Expressing Certainty (or Uncertainty)

I have waged war on the misuse of probability for a long time. As I say in the post at the link:

A probability is a statement about a very large number of like events, each of which has an unpredictable (random) outcome. Probability, properly understood, says nothing about the outcome of an individual event. It certainly says nothing about what will happen next.

From a later post:

It is a logical fallacy to ascribe a probability to a single event. A probability represents the observed or computed average value of a very large number of like events. A single event cannot possess that average value. A single event has a finite number of discrete and mutually exclusive outcomes. Those outcomes will not “average out” — only one of them will obtain, like Schrödinger’s cat.

To say that the outcomes will average out — which is what a probability implies — is tantamount to saying that Jack Sprat and his wife were neither skinny nor fat because their body-mass indices averaged to a normal value. It is tantamount to saying that one can’t drown by walking across a pond with an average depth of 1 foot, when that average conceals the existence of a 100-foot-deep hole.

But what about hedge words that imply “probability” without saying it: certain, uncertain, likely, unlikely, confident, not confident, sure, unsure, and the like? I admit to using such words, which are common in discussions about possible future events and the causes of past events. But what do I, and presumably others, mean by them?

Hedge words are statements about the validity of hypotheses about phenomena or causal relationships. There are two ways of looking at such hypotheses, frequentist and Bayesian:

While for the frequentist, a hypothesis is a proposition (which must be either true or false) so that the frequentist probability of a hypothesis is either 0 or 1, in Bayesian statistics, the probability that can be assigned to a hypothesis can also be in a range from 0 to 1 if the truth value is uncertain.

Further, as discussed above, there is no such thing as the probability of a single event. For example, the Mafia either did or didn’t have JFK killed, and that’s all there is to say about that. One might claim to be “certain” that the Mafia had JFK killed, but one can be certain only if one is in possession of incontrovertible evidence to that effect. But that certainty isn’t a probability, which can refer only to the frequency with which many events of the same kind have occurred and can be expected to occur.

A Bayesian view about the “probability” of the Mafia having JFK killed is nonsensical. Even If a Bayesian is certain, based on incontrovertible evidence, that the Mafia had JFK killed, there is no probability attached to the occurrence. It simply happened, and that’s that.

Lacking such evidence, a Bayesian (or an unwitting “man on the street”) might say “I believe there’s a 50-50 chance that the Mafia had JFK killed”. Does that mean (1) there’s some evidence to support the hypothesis, but it isn’t conclusive, or (2) that the speaker would bet X amount of money, at even odds, that incontrovertible evidence (if any) surfaces it will prove that the Mafia had JFK killed? In the first case, attaching a 50-percent probability to the hypothesis is nonsensical; how does the existence of some evidence translate into a statement about the probability of a one-off event that either occurred or didn’t occur? In the second case, the speaker’s willingness to bet on the occurrence of an event at certain odds tells us something about the speaker’s preference for risk-taking but nothing at all about whether or not the event occurred.

What about the familiar use of “probability” (a.k.a., “chance”) in weather forecasts? Here’s my take:

[W]hen you read or hear a statement like “the probability of rain tomorrow is 80 percent”, you should mentally translate it into language like this:

X guesses that Y will (or will not) happen at time Z, and the “probability” that he attaches to his guess indicates his degree of confidence in it.

The guess may be well-informed by systematic observation of relevant events, but it remains a guess. As most Americans have learned and relearned over the years, when rain has failed to materialize or has spoiled an outdoor event that was supposed to be rain-free.

Further, it is true that some things happen more often than other things but

only one thing will happen at a given time and place.

[A] clever analyst could concoct a probability of a person’s being shot by writing an equation that includes such variables as his size, the speed with which he walks, the number of shooters, their rate of fire, and the distance across the shooting range.

What would the probability estimate mean? It would mean that if a very large number of persons walked across the shooting range under identical conditions, approximately S percent of them would be shot. But the clever analyst cannot specify which of the walkers would be among the S percent.

Here’s another way to look at it. One person wearing head-to-toe bullet-proof armor could walk across the range a large number of times and expect to be hit by a bullet on S percent of his crossings. But the hardy soul wouldn’t know on which of the crossings he would be hit.

Suppose the hardy soul became a foolhardy one and made a bet that he could cross the range without being hit. Further, suppose that S is estimated to be 0.75; that is, 75 percent of a string of walkers would be hit, or a single (bullet-proof) walker would be hit on 75 percent of his crossings. Knowing the value of S, the foolhardy fellow offers to pay out $1 million dollars if he crosses the range unscathed — one time — and claim $4 million (for himself or his estate) if he is shot. That’s an even-money bet, isn’t it?

No it isn’t….

The bet should be understood for what it is, an either-or-proposition. The foolhardy walker will either lose $1 million or win $4 million. The bettor (or bettors) who take the other side of the bet will either win $1 million or lose $4 million.

As anyone with elementary reading and reasoning skills should be able to tell, those possible outcomes are not the same as the outcome that would obtain (approximately) if the foolhardy fellow could walk across the shooting range 1,000 times. If he could, he would come very close to breaking even, as would those who bet against him.

I omitted from the preceding quotation a sentence in which I used “more likely”:

If a person walks across a shooting range where live ammunition is being used, he is more likely to be killed than if he walks across the same patch of ground when no one is shooting.

Inasmuch as “more likely” is a hedge word, I seem to have contradicted my own position about the probability of a single event, such as being shot while walking across a shooting range. In that context, however, “more likely” means that something could happen (getting shot) that wouldn’t happen in a different situation. That’s not really a probabilistic statement. It’s a statement about opportunity; thus:

  • Crossing a firing range generates many opportunities to be shot.
  • Going into a crime-ridden neighborhood certainly generates some opportunities to be shot, but their number and frequency depends on many variables: which neighborhood, where in the neighborhood, the time of day, who else is present, etc.
  • Sitting by oneself, unarmed, in a heavy-gauge steel enclosure generates no opportunities to be shot.

The “chance” of being shot is, in turn, “more likely”, “likely”, and “unlikely” — or a similar ordinal pattern that uses “certain”, “confident”, “sure”, etc. But the ordinal pattern, in any case, can never (logically) include statements like “completely certain”, “completely confident”, etc.

An ordinal pattern is logically valid only if it conveys the relative number of opportunities to attain a given kind of outcome — being shot, in the example under discussion.

Ordinal statements about different types of outcome are meaningless. Consider, for example, the claim that the probability that the Mafia had JFK killed is higher than (or lower than or the same as) the probability that the moon is made of green cheese. First, and to repeat myself for the nth time, the phenomena in question are one-of-a-kind and do not lend themselves to statements about their probability, nor even about the frequency of opportunities for the occurrence of the phenomena. Second, the use of “probability” is just a hifalutin way of saying that the Mafia could have had a hand in the killing of JFK, whereas it is known (based on ample scientific evidence, including eye-witness accounts) that the Moon isn’t made of green cheese. So the ordinal statement is just a cheap rhetorical trick that is meant to (somehow) support the subjective belief that the Mafia “must” have had a hand in the killing of JFK.

Similarly, it is meaningless to say that the “average person” is “more certain” of being killed in an auto accident than in a plane crash, even though one may have many opportunities to die in an auto accident or a plane crash. There is no “average person”; the incidence of auto travel and plane travel varies enormously from person to person; and the conditions that conduce to fatalities in auto travel and plane travel vary just as enormously.

Other examples abound. Be on the lookout for them, and avoid emulating them.

Certainty about Uncertainty

Words fail us. Numbers, too, for they are only condensed words. Words and numbers are tools of communication and calculation. As tools, they cannot make certain that which is uncertain, though they often convey a false sense of certainty.

Yes, arithmetic seems certain: 2 + 2 = 4 is always and ever (in base-10 notation). But that is only because the conventions of arithmetic require 2 + 2 to equal 4. Neither arithmetic nor any higher form of mathematics reveals the truth about the world around us, though mathematics (and statistics) can be used to find approximate truths — approximations that are useful in practical applications like building bridges, finding effective medicines, and sending rockets into space (though the practicality of that has always escaped me).

But such practical things are possible only because the uncertainty surrounding them (e.g., the stresses that may cause a bridge to fail) is hedged against by making things more robust than they would need to be under perfect conditions. And, even then, things sometimes fail: bridges collapse, medicines have unforeseen side effects, rockets blow up, etc.

I was reminded of uncertainty by a recent post by Timothy Taylor (Conversable Economist):

For the uninitiated, “statistical significance” is a way of summarizing whether a certain statistical result is likely to have happened by chance, or not. For example, if I flip a coin 10 times and get six heads and four tails, this could easily happen by chance even with a fair and evenly balanced coin. But if I flip a coin 10 times and get 10 heads, this is extremely unlikely to happen by chance. Or if I flip a coin 10,000 times, with a result of 6,000 heads and 4,000 tails (essentially, repeating the 10-flip coin experiment 1,000 times), I can be quite confident that the coin is not a fair one. A common rule of thumb has been that if the probability of an outcome occurring by chance is 5% or less–in the jargon, has a p-value of 5% or less–then the result is statistically significant. However, it’s also pretty common to see studies that report a range of other p-values like 1% or 10%.

Given the omnipresence of “statistical significance” in pedagogy and the research literature, it was interesting last year when the American Statistical Association made an official statement “ASA Statement on Statistical Significance and P-Values” (discussed here) which includes comments like: “Scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold. … A p-value, or statistical significance, does not measure the size of an effect or the importance of a result. … By itself, a p-value does not provide a good measure of evidence regarding a model or hypothesis.”

Now, the ASA has followed up with a special supplemental issue of its journal The American Statistician on the theme “Statistical Inference in the 21st Century: A World Beyond p < 0.05” (January 2019).  The issue has a useful overview essay, “Moving to a World Beyond “p < 0.05.” by Ronald L. Wasserstein, Allen L. Schirm, and  Nicole A. Lazar. They write:

We conclude, based on our review of the articles in this special issue and the broader literature, that it is time to stop using the term “statistically significant” entirely. Nor should variants such as “significantly different,” “p < 0.05,” and “nonsignificant” survive, whether expressed in words, by asterisks in a table, or in some other way. Regardless of whether it was ever useful, a declaration of “statistical significance” has today become meaningless. … In sum, `statistically significant’—don’t say it and don’t use it.

. . .

So let’s accept the that the “statistical significance” label has some severe problems, as Wasserstein, Schirm, and Lazar write:

[A] label of statistical significance does not mean or imply that an association or effect is highly probable, real, true, or important. Nor does a label of statistical nonsignificance lead to the association or effect being improbable, absent, false, or unimportant. Yet the dichotomization into “significant” and “not significant” is taken as an imprimatur of authority on these characteristics. In a world without bright lines, on the other hand, it becomes untenable to assert dramatic differences in interpretation from inconsequential differences in estimates. As Gelman and Stern (2006) famously observed, the difference between “significant” and “not significant” is not itself statistically significant.

In the middle of the post, Taylor quotes Edward Leamer’s 1983 article, “Taking the Con out of Econometrics” (American Economic Review, March 1983, pp. 31-43).

Leamer wrote:

The econometric art as it is practiced at the computer terminal involves fitting many, perhaps thousands, of statistical models. One or several that the researcher finds pleasing are selected for re- porting purposes. This searching for a model is often well intentioned, but there can be no doubt that such a specification search in-validates the traditional theories of inference. … [I]n fact, all the concepts of traditional theory, utterly lose their meaning by the time an applied researcher pulls from the bramble of computer output the one thorn of a model he likes best, the one he chooses to portray as a rose. The consuming public is hardly fooled by this chicanery. The econometrician’s shabby art is humorously and disparagingly labelled “data mining,” “fishing,” “grubbing,” “number crunching.” A joke evokes the Inquisition: “If you torture the data long enough, Nature will confess” … This is a sad and decidedly unscientific state of affairs we find ourselves in. Hardly anyone takes data analyses seriously. Or perhaps more accurately, hardly anyone takes anyone else’s data analyses seriously.”

Economists and other social scientists have become much more aware of these issues over the decades, but Leamer was still writing in 2010 (“Tantalus on the Road to Asymptopia,” Journal of Economic Perspectives, 24: 2, pp. 31-46):

Since I wrote my “con in econometrics” challenge much progress has been made in economic theory and in econometric theory and in experimental design, but there has been little progress technically or procedurally on this subject of sensitivity analyses in econometrics. Most authors still support their conclusions with the results implied by several models, and they leave the rest of us wondering how hard they had to work to find their favorite outcomes … It’s like a court of law in which we hear only the experts on the plaintiff’s side, but are wise enough to know that there are abundant for the defense.

Taylor wisely adds this:

Taken together, these issues suggest that a lot of the findings in social science research shouldn’t be believed with too much firmness. The results might be true. They might be a result of a researcher pulling out “from the bramble of computer output the one thorn of a model he likes best, the one he chooses to portray as a rose.” And given the realities of real-world research, it seems goofy to say that a result with, say, only a 4.8% probability of happening by chance is “significant,” while if the result had a 5.2% probability of happening by chance it is “not significant.” Uncertainty is a continuum, not a black-and-white difference [emphasis added].

The italicized sentence expresses my long-held position.

But there is a deeper issue here, to which I alluded above in my brief comments about the nature of mathematics. The deeper issue is the complete dependence of logical systems on the underlying axioms (assumptions) of those systems, which Kurt Gödel addressed in his incompleteness theorems:

Gödel’s incompleteness theorems are two theorems of mathematical logic that demonstrate the inherent limitations of every formal axiomatic system capable of modelling basic arithmetic….

The first incompleteness theorem states that no consistent system of axioms whose theorems can be listed by an effective procedure (i.e., an algorithm) is capable of proving all truths about the arithmetic of natural numbers. For any such consistent formal system, there will always be statements about natural numbers that are true, but that are unprovable within the system. The second incompleteness theorem, an extension of the first, shows that the system cannot demonstrate its own consistency.

This is very deep stuff. I own the book in which Gödel proves his theorems, and I admit that I have to take the proofs on faith. (Which simply means that I have been too lazy to work my way through the proofs.) But there seem to be no serious or fatal criticisms of the theorems, so my faith is justified (thus far).

There is also the view that the theorems aren’t applicable in fields outside of mathematical logic. But any quest for certainty about the physical world necessarily uses mathematical logic (which includes statistics).

This doesn’t mean that the results of computational exercises are useless. It simply means that they are only as good as the assumptions that underlie them; for example, assumptions about relationships between variables, assumptions about the values of the variables, assumptions as to whether the correct variable have been chosen (and properly defined), in the first place.

There is nothing new in that, certainly nothing that requires Gödel’s theorems by way of proof. It has long been understood that a logical argument may be valid — the conclusion follows from the premises — but untrue if the premises (axioms) are untrue.

But it bears repeating and repeating — especially in the age of “climate change“. That CO2 is a dominant determinant of “global” temperatures is taken as axiomatic. Everything else flows from that assumption, including the downward revision of historical (actual) temperature readings, to ensure that the “observed” rise in temperatures agrees with — and therefore “proves” — the validity of climate models that take CO2 as dominant variable. How circular can you get?

Check your assumptions at the door.

More Stock-Market Analysis (II)

Today’s trading on U.S. stock markets left the Wilshire 5000 Total Market Full-Cap index 17 percent below its September high. How low will the market go? When will it bounce back? There’s no way to know, which is the main message of “Shiller’s Folly” and “More Stock-Market Analysis“.

Herewith are three relevant exhibits based on the S&P Composite index as reconstructed by Robert Shiller (commentary follows):

In the following notes, price refers to the value of the index; real price is the inflation-adjusted value of the index; total return is the value with dividends reinvested; real total return is the inflation-adjusted value of total return.

  • The real price trend represents an annualized gain of 1.8 percent (through November 2018).
  • The real total return trend represents an annualized gain of 6.5 percent (through September 2018).
  • In month-to-month changes, real price has gone up 56 percent of the time; real total return has gone up 61 percent of the time.
  • Real price has been in a major decline about 24 percent of the time, where a major decline is defined as a real price drop of more than 25 percent over a span of at least 6 months.
  • The picture is a bit less bleak for total returns (about 20 percent of the time) because the reinvestment of dividends somewhat offsets price drops.
  • Holding a broad-market index fund is never a sure thing. Returns fluctuate wildly. Impressive real returns (e.g., 20 percent and higher) are possible in the shorter run (e.g., 5-10 years), but so are significantly negative returns. Holding a fund longer reduces the risk of a negative return while also suppressing potential gains.
  • Long-run real returns of greater than 5 percent a year are not to be scoffed at. It takes a lot of research, patience, and luck to do better than that with individual stocks and specialized mutual funds.

More Stock-Market Analysis

I ended “Shiller’s Folly” with the Danish proverb, it is difficult to make predictions, especially about the future.

Here’s more in that vein. Shiller uses a broad market index, the S&P Composite (S&P), which he has reconstructed back to January 1871. I keep a record of the Wilshire 5000 Full-Cap Total-Return Index (WLX), which dates back to December 1970. When dividends for stocks in the S&P index are reinvested, its performance since December 1970 is almost identical to that of the WLX:

It is a reasonable assumption that if the WLX extended back to January 1871 its track record would nearly match that of the S&P. Therefore, one might assume that past returns on the WLX are a good indicator of future returns. In fact, the relationship between successive 15-year periods is rather strong:

But that seemingly strong relationship is an artifact of the relative brevity of the track record of the WLX.  Compare the relationship in the preceding graph with the analogous one for the S&P, which goes back an additional 100 years:

The equations are almost identical — and they predict almost the same real returns for the next 15 years: about 6 percent a year. But the graph immediately above should temper one’s feeling of certainty about the long-run rate of return on a broad market index fund or a well-diversified portfolio of stocks.


Related posts:
Stocks for the Long Run?
Stocks for the Long Run? (Part II)
Bonds for the Long Run?
Much Ado about the Price-Earnings Ratio
Whither the Stock Market?
Shiller’s Folly

Shiller’s Folly

Robert Shiller‘s most famous (or infamous) book, is Irrational Exuberance (2000). According to the Wikipedia article about the book,

the text put forth several arguments demonstrating how the stock markets were overvalued at the time. The stock market collapse of 2000 happened the exact month of the book’s publication.

The second edition of Irrational Exuberance was published in 2005 and was updated to cover the housing bubble. Shiller wrote that the real estate bubble might soon burst, and he supported his claim by showing that median home prices were six to nine times greater than median income in some areas of the country. He also showed that home prices, when adjusted for inflation, have produced very modest returns of less than 1% per year. Housing prices peaked in 2006 and the housing bubble burst in 2007 and 2008, an event partially responsible for the Worldwide recession of 2008-2009.

However, as the Wikipedia article notes,

some economists … challenge the predictive power of Shiller’s publication. Eugene Fama, the Robert R. McCormick Distinguished Service Professor of Finance at The University of Chicago and co-recipient with Shiller of the 2013 Nobel Prize in Economics, has written that Shiller “has been consistently pessimistic about prices,”[ so given a long enough horizon, Shiller is bound to be able to claim that he has foreseen any given crisis.

(A stopped watch is right twice a day, but wrong 99.9 percent of the time if read to the nearest minute. I also predicted the collapse of 2000, but four years too soon.)

One of the tools used by Shiller is a cyclically-adjusted price-to-earnings ratio known as  CAPE-10 . It is

a valuation measure usually applied to the US S&P 500 equity market. It is defined as price divided by the average of ten [previous] years of earnings … , adjusted for inflation. As such, it is principally used to assess likely future returns from equities over timescales of 10 to 20 years, with higher than average CAPE values implying lower than average long-term annual average returns.

CAPE-10, like other economic indicators of which I know, is a crude tool:

For example, the annualized real rate of price growth for the S&P Composite Index from October 2003 to October 2018 was 4.6 percent. The value of CAPE-10 in October 2003 was 25.68. According to the equation in the graph (which includes the period from October 2003 through October 2018), the real rate of price growth should have been -0.6 percent. The actual rate is at the upper end of the wide range of uncertainty around the estimate.

Even a seemingly more robust relationship yields poor results. Consider this one:

The equation in this graph produces a slightly better but still terrible estimate: price growth of -0.2 percent over the 15 years ending in October 2018.

If you put stock (pun intended) in the kinds of relationships depicted above, you should expect real growth in the S&P Composite Index to be zero for the next 15 years — plus or minus about 6 percentage points. It’s the plus or minus that matters — a lot — and the equations don’t help you one bit.

As the Danish proverb says, it is difficult to make predictions, especially about the future.