Charles Rotblut recently spoke at the 2017 AAII Investor Conference. For information on how to subscribe to recordings of the presentations, go to www.aaii.com/conferenceaudio for more details.
Philip “Phil” Tetlock is the Annenberg University Professor at the University of Pennsylvania and an author of “Superforecasting: The Art and Science of Prediction” (Crown Publishers, 2015). We spoke about forecasting, forecasters and the traits he’s found that lead to more accurate predictions.
—Charles Rotblut, CFA
Charles Rotblut (CR): In your book, “Superforecasting,” you reference earlier research you conducted that found experts to not be any more accurate than a dart-throwing chimpanzee. Could you comment on the research?
Philip Tetlock (PT): Sure. The earlier work that you’re referring to was published in “Expert Political Judgment: How Good Is It? How Can We Know?” (Princeton University Press) back in 2005. That reported the results on a series of forecasting tournaments that extended back to the mid-1980s, with roughly 300 experts making roughly 30,000 predictions on a wide array of geopolitical/geo-economic events. Now, that study differed in a number of ways from the later work that was the focus of the “Superforecasting” book, but one of the takeaways people took from the “Expert Political Judgment” book was that the average expert is no better than a dart-throwing chimpanzee.
There’s an element of truth to that, but it is a pretty serious oversimplification of the results of the book. Some experts could do significantly better than chance rather consistently and other experts could not. It’s true that when you average it out, the average expert wasn’t massively better than chance. The average expert was slightly better than chance.
CR: On the flip side, in your new book, “Superforecasting,” which is based on another tournament, you talk about a group of ordinary people who would not be considered experts but are comparatively very good at forecasting. In fact, some were so good, you call them “superforecasters.”
PT: Well, that was one of the quite surprising results from the IARPA (Intelligence Advanced Research Projects Activity) forecasting tournaments. The forecasting tournaments were sponsored by the U.S. Intelligence Community and ran from 2011 to 2015. They compared how well the very best amateurs managed to do relative to seasoned professionals inside the U.S. government. The best amateurs beat the seasoned professionals.
CR: Was it the decision process that separated the superforecasters from those who were simply lucky or those who tended to be wrong?
PT: I think there are some deep strands of continuity between the results in the early work on expert political judgment and the later work on superforecasting sponsored by the U.S. Intelligence Community. One of the most important strands of continuity is the importance of a degree of open-mindedness. The very best forecasters don’t travel with a lot of ideological luggage. They’re really quite open-minded and willing to change their minds. “They treat their beliefs as testable hypotheses not sacred possessions,” was one of the lines, I think, from the “Superforecasting” book. I think that was also true in the earlier work on expert political judgment.
CR: Is it fair to say we humans are just too quick to jump to judgment based on our heuristics (mental shortcuts) instead of taking a step back to try to consider the broader picture? Would that be an accurate statement?
PT: I think that is a fair statement. I think that there is a tendency for most forecasters to be too quick to jump to conclusions and then to be too slow to modify those conclusions in response to new evidence. So it’s kind of a double whammy there.
CR: Is this a case also where there’s some cognitive dissonance occurring, with some forecasters downplaying evidence that might cause them to change their opinion? Is that a common behavior among some of the forecasters who are less accurate?
PT: Very common.
CR: In terms of types of behavior the public gravitates to, you discussed hedgehogs and foxes. Is it the willingness of experts who can be categorized as hedgehogs to give simple stories and come across as being confident that allows them to attract a following? Particularly in contrast to foxes, who are more likely to discuss various factors or potential outcomes.
PT: Well, I think that’s fair. We found in the earlier work on expert political judgment that the experts we characterized as hedgehogs were considerably more popular than the experts we characterized as foxes. According to the classical Greek aphorism from the poet Archilochus, the fox knows many things, but the hedgehog knows one big thing.
Now, it doesn’t take a lot of imagination to suppose you’re a harried producer of a television show under continuous pressure to keep your ratings up and consider how to determine which expert you would rather have come on. Would you prefer an expert who has a very compelling sound bite that’s tightly wrapped around a single organizing idea or an expert who offers you a lot of on-the-one-hand-on-the-other-hand talk?
CR: What about our understanding of luck? You used an example of a fund manager who outperforms the markets for six or seven years. While the performance may seem impressive, it could also be attributable to sheer random sampling. If enough people are trying to beat the market, someone, by sheer randomness, will beat it. Is attributing this manager’s returns to skill an example of people misunderstanding the role of randomness and luck?
PT: We were very sensitive to the possibility that the people we were anointing as superforecasters were just super lucky, and that’s why we conducted a lot of tests on how much regression toward the mean there was over time and across topics and different forums. So that’s always a possibility. If you toss thousands of coins thousands of times, some of them are going to wind up heads many, many times in a row. What people are inclined to do is to call the person who predicted the coin landing on heads 25 times in a row the next Warren Buffett.
I want to make clear that I’m not saying Warren Buffett was just lucky. Rather, if you believe in a really strong form of the efficient market hypothesis—which holds that the markets price in all available information—it follows that there should be no systematic individual differences in the ability to predict the market. Whether there are some people who are genuinely good at stock-picking and market trend forecasting is an empirical question. There are some strands of evidence to suggest there are people who are pretty darn good at it, even though pretty darn good means they’re still going to make a lot of mistakes.
CR: Let’s say somebody directly called the 2008 financial crisis. You believe people are too often willing to view the person as having good forecasting skills instead of taking into consideration how many people were trying to predict the next bear market and/or bad recession?
PT: Indeed. It often is the case that the people who are trotted out in front of the media as having predicted major changes had been predicting major changes year after year and finally a major change occurred. There is an element of the broken clock theory there, with a broken clock being right twice a day.
CR: Let’s get into probability. There is much about it in your book, and I found the discussion interesting. One of the things you covered was granularity compared to psychologist Amos Tversky’s quip about most people operating with a three-dial setting when dealing with probabilities: will happen, won’t happen, or maybe. Could you elaborate on that a little bit?
PT: The three-dial setting: I think that was intended as a joke by Professor Tversky. I would say that he was saying that people have a hard time being granular assessors of uncertainty. For many people, the ‘maybe zone’ is a very wide zone.
If you look at the development of probabilistic thinking in children and then through adolescence into adulthood, I think it’s true that initially people have a very hard time distinguishing—they may even start with a two-dial model of uncertainty. Things happen or things don’t happen in very early childhood. As you age, you have a zone of uncertainty and that zone of uncertainty becomes progressively more differentiated.
It’s also true though that most of the time in adulthood, we’re busy with what psychologists call cognitive load, and it’s hard to be very granular. I think what Tversky was joking about was that often we slip into that default mode of processing—yes, maybe, no—or, the three degrees of probability: 0%, 50%, and 100%.
The best forecasters in the forecasting tournaments tend to be much more granular. They distinguish not just three degrees of uncertainty, but often as many as 15 to 20. We have a quotation in the book from the chief risk officer of hedge fund AQR, Aaron Brown, who is also a world-class poker player. Aaron Brown said he could tell the difference between a great poker player and a talented amateur, because the great poker player knew the difference between the 60/40 bet and the 40/60 bet. And then he paused and said, “More like 55/45, 45/55. And sometimes more like 52/48, 48/52.”
Poker is an interesting case, because in poker people who have taken college statistics immediately recognize that the laws of statistics must apply. You’re randomly sampling from a well-defined universe, a deck of cards. You’re getting quick clear feedback on the correctness of your judgments. So poker is what we would call a learning-friendly environment, in which it’s easier to become more granular in your assessments of uncertainty.
The big question confronting people in business and in the investment world is: How granular is it possible to be in assessments of real-world events? It’s almost certainly not as possible as it is for great poker players to be granular in playing poker. But it’s also possible to be a lot more granular than yes, no, maybe. Between those two extremes is the question of how good is it possible to get in making judgments of real-world events of business interest?
Degree of Granularity Inside the Intelligence Community
The National Intelligence Council—which produces the National Intelligence Estimates that inform ultrasensitive decisions such as whether to invade Iraq or negotiate with Iran—asks its analysts to make judgments on a five- or seven-degree scale, as shown below. Though a big improvement over a three-setting dial of “yes,” “no” or “maybe,” it falls short of what the most committed superforecasters can achieve in terms of granularity on many questions.
University of Pennsylvania professor Barbara Mellers has shown that granularity predicts accuracy. The average forecaster who puts measures of probability into groups of 10-point degrees of separation (e.g., 20%, 30%, 40%) is less accurate than the fine-grained forecaster who uses five-point degrees of separation (e.g., 20%, 25%, 30%) and still less accurate than the even the finer-grained forecaster who uses one-point degrees of separation (e.g., 20%, 21%, 22%).
Source: “Superforecasting: The Art and Science of Prediction,” by Philip Tetlock and Dan Gardner (Crown Publishers, 2015).
CR: It did seem, though, that the superforecasters tend to be a lot more granular than other people. If I read the book correctly, it seems like the accuracy goes up with the granularity.
PT: Both of those propositions are correct. The key to finding out how granular you can get is to keep forecasting and receive regular feedback, as the superforecasters did at the Good Judgment Project and continue to do at the commercial spinoff, Good Judgment Inc.
CR: What is it about the granularity that increases the accuracy of the forecast?
PT: Well, I think it’s important to be clear, there are different meanings of accuracy here. How do you know that a probabilistic forecast is accurate in the first place? Technically, when you make a probability judgment of an individual event, there’s no way to tell if you’re right or wrong, is there? I mean, if the surgeon tells you there’s a 90% chance of the operation working and it doesn’t work, was he wrong or did that 10% possibility materialize?
The only way you could determine whether a decision-maker who makes a forecast is right or wrong is if the forecaster is rash enough to say 100% probability and the event does not occur or 0% probability and the event does occur—in which case you can be confident the forecaster was wrong. But anything between zero and one, anything between a definitive yes or definitive no, there’s always wiggle room. You can always say, “Well, the low-probability scenario materialized.”
That means that assessing the accuracy of probabilistic judgment is something that can only be done in the aggregate. And that’s something that people have a very hard time wrapping their heads around. You have to get in the habit of making many probability judgments of many events across many topics to get a reasonable assessment of how accurate you are.
Accuracy takes on two different meanings in that aggregate world. One meaning is how well-calibrated you are. You’re a well-calibrated forecaster if every time you say there’s an 80% chance of rain, you look at all those events, you list all those days you said 80% chance of rain, and it rains on 80% of those days. Or when you say there’s a 60% chance of rain, it rained on 60% of those days. So there’s a close correspondence between your subjective probability judgments and the objective frequency with which events occur when you make those objective probability judgments. Make sense?
CR: Yes, absolutely.
PT: That’s one meaning of accuracy, and we call that calibration. And calibration is really a kind of justified humility. You’re appropriately circumspect about what you do and don’t know. You know what you do know and do not know.
The other meaning of accuracy is something that’s called resolution. Resolution refers to the degree to which the forecaster assigns much higher probabilities to events that occur than to events that don’t occur. We call that justified decisiveness. It may sound like those are the same things, but they’re not always the same thing.
Say it rains 50% of the time in Seattle. If I simply always said, “50% probability of rain” every day, I would be scored as well-calibrated but I would have a terrible resolution score, because I would have no ability to assign higher levels of probability to things that occurred than to things that didn’t occur.
So they’re not the same thing, and you want both of those attributes in a great forecaster. You want them to be appropriately circumspect or appropriately modest in calibration and you also want them to be justifiably decisive, which is resolution.
CR: That makes sense, though I can see how it can also be easily misunderstood or at least not considered by someone trying to assess a forecast.
PT: It’s a tricky thing for people to understand. This is really a key idea, because misunderstanding this can cause people to throw out very good forecasting systems. When the Supreme Court narrowly upheld the Affordable Care Act in 2012, the prediction markets had been predicting with a 75% probability that the Supreme Court would overturn it. One of the smartest journalists around, Dave Leonhardt, remarked in an article that the prediction markets got it wrong. Now that’s not quite right.
Prediction markets make hundreds of predictions on hundreds of things, maybe thousands, and they’re known to be pretty well calibrated. This means that when they say 75% probability, those things happen about 75% of the time and they don’t happen 25% of the time. So if you concluded that your forecasting system is wrong every time it says 75% and the event does not occur, you would be throwing out a well-calibrated system one out of four times. In other words, it would be virtually impossible to develop and maintain a well-calibrated forecasting system if you judged the system along those lines.
CR: To go back to a poker analogy, if you’re playing Texas Hold’em, are dealt two aces and lose, to assume that it is a bad hand to play would mean your calibration is wrong since statistical luck holds that you will not win every time with a pair of aces. [Editor’s note: A pair of aces is statistically the best starting hand in Texas Hold’em.]
PT: Exactly. A good process is a good thing to follow, even though it’s not an infallible guide. If you throw out a good process every time it leads you astray, you will throw out your good process very quickly and you’ll be much worse off in the aggregate.
CR: In this case, what’s making some of these forecasters really good is understanding the calibration, but then also having the logic behind their decision process and their assessment of what the odds are then. Correct?
CR: I also want to talk about uncertainty. You described part of the forecasting process as being able to separate what’s unknown but knowable from what’s unknown and not knowable. Specifically, you addressed epistemic and aleatory uncertainty.
PT: This is a distinction that makes a lot of sense in the philosophical and statistical literature, but it is a little difficult to apply in everyday life. There’s often a lot of uncertainty about what type of uncertainty we’re dealing with. Epistemic uncertainty is uncertainty that could be reduced in principle if you had all of the relevant information. Aleatory uncertainty is often otherwise known as irreducible uncertainty. Even if you know everything, even if you know all the relevant scientific laws and all of the antecedent conditions on which those scientific laws are operating, you still would not be able to predict with 100% accuracy.
It comes back to the classic debate in the 20th century between two giants of physics, Albert Einstein and Niels Bohr, with Einstein representing relativity theory and Bohr representing early quantum mechanics. Einstein famously said that God doesn’t play dice with the cosmos, which I suppose is a way of saying that there is no such thing as aleatory uncertainty. It’s all epistemic, if we knew enough to be able to predict. Whereas, Bohr had the rejoinder to Einstein that he should stop telling God what to do.
CR: Before we finish, I’d like to discuss revisions. You noted that superforecasters are constantly revising their forecasts. Is that something that you see among people who tend to do well in forecasting tournaments? And is it something people should pay attention to when they’re looking at forecasts?
PT: This is another one of these balancing acts that goes into superforecasting. It is the case that the more common error in forecasting is the failure to revise one’s beliefs enough in response to new evidence. That’s sometimes called cognitive conservatism or belief perseverance. You’re too slow of a belief updater. It’s also possible to make the opposite mistake and to be too jumpy and too volatile.
Sometimes markets are described as excessively volatile. They’re overreacting to events. That’s also an error that can be made in human judgment.
I think what you need to be aware of as someone who’s starting out in the hope of becoming a superforecaster is that, yes, excessive conservatism is the more common error in human judgment and it’s really important to be willing to update your beliefs, but it’s also possible to make the opposite error of being excessively jumpy. An interesting attribute of how superforecasters at Good Judgment Inc. go about belief updating is that most of their belief updates are fairly small.
You see a poll result that shows that Donald Trump is narrowing the gap with Hillary Clinton between the Republican National Convention and the Democratic National Convention. You change your probability of a Clinton victory from 73% down to 70%. I’m not saying that’s the correct adjustment by any means, but I’m saying that is roughly the style with which the very best forecasters tend to approach belief updating. It tends to be incremental, with the exception that sometimes big things happen and you have to update your beliefs massively fast.
10 Guidelines for Aspiring Superforecasters
Tetlock offers these guidelines, which distill key themes in “Superforecasting” and in training systems that have been experimentally demonstrated to boost accuracy in real-world forecasting contests.
Source: “Superforecasting: The Art and Science of Prediction,” by Philip Tetlock and Dan Gardner (Crown Publishers, 2015).