Bayes's Dirty Secret

May 09, 2025

In this post I’ll take aim at the combination of two ideas that are often thought to be natural companions. The first is the idea that we should think of our beliefs in probabilistic terms, and in particular that Bayes’s rule—more on this in a sec if you don’t know what it is—provides a good model for how we should update our beliefs when we get new evidence. The second is the idea that we shouldn’t be 100% certain of anything. In a very small nutshell:

Bayesianism: Beliefs should come with numbers, and the numbers should be updated with Bayes’s rule.

Fallibilism: Those numbers should never be 1 or 0 (certain truth or certain falsity, respectively).

Sean Carroll—who I’m picking on only because I have immense respect for him, and so linking to him makes clear I’m not targeting a straw man—explicitly endorses both ideas right next to each other. And he’s not alone. Rather, the ideas form an important part of the Bay Area “rationalist” worldview. However, as I’ll explain, they’re inconsistent with each other. Applying Bayes’s rule requires that you be 100% certain in some things. Moreover, the inconsistency is not easy to avoid. If you like both ideas—as I do—you probably have some serious thinking to do.

Let’s start with Bayesianism. Bayesianism’s most basic commitment is that mathematical probabilities are a good model for rational beliefs. Instead of thinking of the attitudes you can take towards a claim as yes/no/maybe—you believe it, you disbelieve it, or neither—we can think of them as coming in degrees, where the more confident you are of something, the higher the probability you assign it, with 1 as the maximum and 0 as the minimum.1 Another kind of attitude Bayesianism recognizes is a conditional degree of belief. For example, if somebody is tossing a die, my degree of belief that it will land showing 6 is 1/6: there are six possible outcomes, each of which is equally likely. But my conditional degree of belief that it will land showing 6, given that it lands showing an even number, is different. Now, there are just three even faces (2, 4, 6) and each is equally likely, so the chance it lands on six, given that it lands on an even number, is 1/3. We express these beliefs in language with words like “if”. When I say “if my toddler has a nap this afternoon, she probably won’t get a good night’s sleep” I’m expressing a high degree of belief that she won’t get a good night’s sleep, on the condition that she naps in the afternoon. None of this is uncontroversial, and you can find plenty of philosophers and psychologists who already want to get off the bus.2 But I don’t; I like Bayesianism, and will take it for granted in what follows.

The second component of Bayesianism is a rule—Bayes’s rule—for how your degrees of belief should change as you gain new evidence. If this is old hat to you, feel free to skip to “Bayes’s Rule Demands Certainty” below. But if it’s not familiar—perhaps you’ve heard of Bayes’s rule or Bayesian this or that, but have never actually engaged with the equation itself—read on. Ultimately, the explanation for why Bayes’s rule requires certainty will require getting a little bit granular, so we need to know what the rule is and why it works, before we can see why it’s incompatible with fallibilism.

Bayes’s rule says that your new probability in a hypothesis H, after you’ve gotten some new evidence E, should be determined by three things: your old probability in H, your old probability in E, and your old conditional probability in E, given H. This latter value is often called the “likelihood”. When the likelihood is high, the evidence is just what the hypothesis predicted, and that counts in favor of the hypothesis. When it’s low, the evidence is not what the hypothesis predicted, and that counts against it.

More specifically, Bayes’s rule says your new probability in H after you’ve learned E should equal your old probability in H, multiplied by the likelihood, divided by your old probability in E. That’s all displayed in the following equation. If equations scare you, be not afraid! I’m about to explain why the equation makes sense:

\(\begin{equation} Pr _{new}(H) = \frac{Pr(H) * Pr(E|H)}{P(E)} \end{equation}\)

I think the best way to understand this equation is to think about how values on the right could increase or decrease, and to see that they have the impact they intuitively should have on the value on the left. Why does Pr(H) appear on top on the right? Because our old opinions shouldn’t be completely washed away by new evidence; the more likely a hypothesis was before the evidence comes in, the more likely it is afterwards. Why does the likelihood, Pr(E|H) appear on top on the right? Because it counts in favor of a hypothesis when it said we should expect to see some evidence—when Pr(E|H) is high—and then we do in fact get that evidence. And it should count against a hypothesis when it said we should expect not to see some evidence—when Pr(E|H) is low—but we get that evidence after all; as the likelihood increases or decreases, so does the new probability for H. Dividing by the old probability of E is a bit less intuitive, but I think the best way to see it is to think of it as reflecting the power of unexpected evidence. When your old probability for E was high—when E was the evidence you expected to see—then you’re dividing by something close to 1. (Remember that all these values are between 0 and 1.) And just as anything divided by 1 is itself, anything divided by something close to 1 is close to itself. So when Pr(E) is high, we can basically ignore it—the new probability for H will be determined mainly by the values on the top. But when the old probability for E was low, then it can make an important difference. The interesting case is when the old probability for H was low, but so was the old probability for E, while the probability for E|H was high.

Imagine the crazy guy in Times Square—the one who spends all day talking about the illuminati, lizard people, and the like—telling you that you’ll inherit a private island in the Caribbean. Initially, you think the chance that he’s a reliable source of information is very low.

And you also think it’s extremely unlikely that you’ll inherit an island in the Caribbean. But you think that the chance that you’ll inherit an island in the Caribbean, if he’s a reliable source of information, is a good deal higher. And then, lo and behold, the next day you get a letter from the executor of the will of Lord Percival Wadsworth III informing you that you are now the owner of the island of Saint Aurelia. Intuitively, you should now think this guy might in fact be a reliable source of information after all (you should also start worrying about lizard people…) Dividing by Pr(E) is what makes this possible. Let’s rewrite the equation above, subbing in “Reliable” for H, and “low” “low” and “high” for Pr(H), Pr(E), and Pr(E|H), respectively:

\(\begin{equation} Pr _{new}(Reliable) = \frac{low * high}{low} \end{equation}\)

Now, the fact that “low” appears on the bottom on the right of the equation allows it to cancel out the “low” that appears in the numerator—which represents your initial skepticism—letting the probability that the apparently crazy guy is in fact reliable end up high.

Bayes’s Rule Demands Certainty

That was our whirlwind tour of Bayesianism. What about fallibilism? I already mentioned Carroll, who puts it as follows:

The other big idea is that your degree of belief in an idea should never go all the way to either zero or one. It’s never absolutely impossible to gather a certain bit of data, no matter what the truth is—even the most rigorous scientific experiment is prone to errors, and most of our daily data-collecting is far from rigorous. That’s why science never “proves” anything; we just increase our credences in certain ideas until they are almost (but never exactly) 100%.

It’s compelling. And not new: a standard narrative in epistemology is that Descartes tried to rebuild knowledge on a foundation of absolute certainties, but failed. Modern epistemology has learned to live with our pervasive fallibility.

So what’s the problem?

The problem is that it’s built into Bayes’s rule that the new evidence is learned with certainty. Initially I described it as a rule for how to update our beliefs when we get new evidence. But what it is to get new evidence? As it turns out, the rule requires that getting new evidence amounts to becoming newly certain in the truth of that evidence. To see what the rule says about our belief in a piece of evidence once we’ve learned it, we can just treat the evidence as if it were the hypothesis in the rule. That is, we can just substitute “E” for “H” in the equation. After all, it’s an equation that’s supposed to tell us what our new probability should be for any old hypothesis H when we learn any new evidence E. Well, what should our new probability for E be, when we learn E? Here’s what the rule looks like when we just replace the H’s with E’s:

\(\begin{equation} Pr _{new}(E) = \frac{Pr (E) * Pr(E|E)}{Pr(E)} \end{equation}\)

You’ll notice that the old probability for E occurs on both the top and the bottom on the right side of the equation. They cancel out, so we’re left with:

\(Pr _{new}(E) = Pr(E|E)\)

The new probability for E is the old conditional probability of E, given E. But Pr(E|E) is 1—of course, if E is true, then E is true. This means that the new probability for E is 1. What this reveals is that built into Bayes’ rule is the presupposition that getting new evidence means becoming certain in the truth of that new evidence.

I’m being a bit cute in calling this a dirty secret. It’s not a secret. It’s been known for a long time, and philosophers who like both Bayesianism and fallibilism have worried about it. But there are no easy fixes. Richard Jeffrey famously proposed a generalization of Bayes’ rule that allows for evidence to be learned, but not with certainty. He imagined a case where you catch a glimpse of a piece of cloth in poor lighting, and can’t quite tell whether it’s blue or green. Perhaps initially you thought it was equally likely to be either, but after seeing it, you now think it’s 80/20 green versus blue. He developed a mathematical framework that would let you learn this new 80/20 probability—which would then have further ramifications for the rest of your beliefs—rather than learning any evidence proposition with certainty. Without getting into the details, I think it’s fair to say that while Jeffrey’s updating rule is of theoretical interest, it’s not really a practical, general alternative to Bayes’s rule.

Take one of the best known contemporary uses of Bayesian probability: election forecasting. People like Nate Silver build elaborate Bayesian models of elections, public opinion, and polling. His models treat polls as data—when a new pollster releases a result, his model henceforth treats it as certain that the pollster did in fact obtain that result.3 That has further ramifications for what the model predicts about public opinion, which has still further ramifications for what it will predict about elections. It’s very hard to see how such models could run using Jeffrey’s rule; instead of learning that some poll had a particular outcome, what would they get to learn? Some new set of probabilities for all the different outcomes the poll might have had instead? This would both face problems of arbitrariness—which new set of probabilities should be used? There are infinitely many options—as well as, I strongly suspect, computational intractability: if for every poll, the model had to keep track not just of the result, but of probabilities assigned to all possible results, and for all further computations to be sensitive to all these probabilities, that’s a lot more information to process. It’s no accident that basically no practicing statisticians use Jeffrey’s rule in their models.4 Moreover, I suspect the same factors that make it cumbersome for explicit, intentional use also mean that it’s not what’s going on “under the hood” in automatic, unconscious cognition.

Still, claims about what results pollsters obtained don’t seem like great candidates for infallible certainty. Data falsification, or even just typographical errors, are possible. So where does this leave us? What can be salvaged of fallibilism, if we also think that Bayes’ rule captures deep truths about how we do and should learn from evidence?

At this point I’ll be coy, and say merely that I don’t think there’s an easy answer, and that I won’t try to provide a hard answer in this post. I do have academic work—both previously published and in progress—that treats these questions at greater length, but I don’t get all that far there either. Suffice to say, I feel the pull of both Bayesianism and fallibilism, and I think it’s a worthwhile but non-trivial project to find attractive versions of them that can be jointly maintained. Are you a Bayesian fallibilist? How do you think we should reconcile the tension above? Let me know in the comments.

Another component of Bayesianism I’m leaving out is the idea that degrees of belief should conform to the probability calculus.

Here’s one on substack.

Of course, that doesn’t mean anything interesting follows with certainty from that fact. Certainty that the pollster obtained a particular result doesn’t entail certainty about the underlying feature of public opinion the pollster is trying to measure.

In response to a question about whether any applied Bayesian statisticians use Jeffrey conditionalization rather than traditional, strict conditionalization, ChatGPT(pro) gives me the following reply:

“Short answer: very, very rarely, but there are some niche cases where something like Jeffrey conditionalization appears in applied Bayesian statistics, usually under different names or in modified forms.”

May 10

For both Chris Samp and Some Lawyer, yeah a natural way to avoid the tension is to constrain fallibilism. It's not that you can't be 100% certain about *anything*. Rather, it's that your 100% certainties should be restricted to some limited class of claims. Some Lawyer is suggesting they should just be claims about your experiences. Chris Samp seems happy for certainties to include claims about poll results. But maybe both will agree no certainty in abstract scientific theories.

I've got my reasons for not loving this style of answer, but I don't think they're knock down. For claims about experiences, I tend to think that for facts about what you've experienced to be the sorts of facts that you're comfortable treating as absolutely certain, indubitable, etc., you have to drain them of quite a lot of content. E.g., it can't be something like "I'm hungry", because you might be mistaking hunger for some other kind of stomach pain. And I'm not sure that experiences drained of that much content can play the roles they'd need to as evidence. Certainly it's very hard to imagine actually working with runnable models that treat as evidence stuff about what experiences I have, rather than just the poll results themselves. The state space you'd need for such models would be intractably large and unstructured. So once you make this move, you're necessarily, I think, moving to a version of Bayesianism that's got to be pretty detached from either models we explicitly build for forecasting, or even, I think, the ones we implicitly reason with in everyday life; I tend to think actual reasoning involves taking a lot more for granted--thinking about "small worlds", as Savage said--than this kind of picture suggests.

But my preferred approach leads to puzzles too, because you need non-Bayesian ways of ceasing to treat as certain evidence stuff that you previously were treating as certain evidence. E.g., if you treat as evidence that some poll obtained a certain result, and then later learn that there was a typo in the report, the process by which you "unlearn" the evidence about the poll result can't just be a straightforward Bayesian update in the model you started with.

Expand full comment

Godshatter

May 11

This is a nice point but doesn’t seem too worrying for the Bayesian fallibilist. Because it seems like being certain in one’s evidence is just an artifact of the idealization required to cast learning episodes into this mathematical framework. In order to model these updates, you need certain elements that you treat as fixed for the sake of the model, and then update from there. But you can always go back to revise those elements… you don’t take those to be fixed forever. (Neurath’s boat and all that.) Like you can always treat the evidence as a further hypothesis and then ask what the evidence for that is; eg, is the reporting agency of the polls is usually reliable etc.

34 more comments...

Greco & Wansley

Discussion about this post