Guesswork Made Precise

Look out the window. What temperature is it outside? Take a guess.

How did you do it? You must have made some observations, incorporated everything you know about the local climate and the current season, then decided on, say, 24 degrees Celsius. Let’s formalize this process and devise notation that is "common sense reduced to calculation", as Laplace put it.

Let \(X\) be the temperature outside in degrees Celsius. Let \(D\) be your data. Then we write:

\[ \newcommand{\e}[1]{\langle #1\rangle} \e{X|D} = 24 \]

to mean that you estimate \(X\) to be 24, based on \(D\). One could imagine the angle brackets to indicate a thought bubble.

Rescaling

What if you also want to guess the temperature in Farenheit? The choice of temperature scale should have no effect on your guess, so if your estimate in Celsius is 24, then your estimate in Farenheit must be:

\[ \begin{aligned} \e{1.8 X + 32|D} &= 1.8 \e{X|D} + 32 \\ &= 1.8 \times 24 + 32 \\ & = 75.2 \end{aligned} \]

More generally, we have the axiom of strong rescaling:

\[ \e{rX + s|D} = r \e{X|D} + s \]

for all \(\newcommand{\RR}{\mathbb{R}} r, s \in \RR\).

Setting \(r = 0\) implies \(\e{s|D} = s\) for all \(s\in\RR\), which is unimpressive, but essential. An estimate of a known number ought to be exact.

Think Indifferent

Suppose \(X\) and \(Y\) are unknowns that satisfy \(X + Y = s\) for some real \(s\). Then by strong rescaling:

\[ \begin{aligned} \e{X|D} + \e{Y|D} &= \e{X|D} + \e{s - X|D} \\ &= \e{X|D} + s - \e{X|D} \\ &= s \end{aligned} \]

In other words, our estimates for \(X\) and \(Y\) must sum to \(s\).

Furthermore, suppose the only fact we know about \(X\) and \(Y\) is that they sum to \(s\). Since addition is symmetric, this means we could switch the labels \(X\) and \(Y\) and draw the same conclusions, an argument known as the principle of indifference (Keynes, A Treatise of Probability, Chapter IV). Then:

\[ \e{X|D} = \e{Y|D} = s/2 \]

As they say, your guess is as good as mine.

For example, suppose we flip a coin, and we represent heads with 0 and tails with 1. Let \(X\) be the side that is showing and \(Y\) be the side that is hidden. Then our knowledge \(D\) includes \(X + Y = 1\). If \(D\) contains no reason to guess one outcome over the other, then:

\[ \e{X|D} = \e{Y|D} = 0.5 \]

More generally, if \(X_1, …, X_n\) are unknowns that sum to \(s\), and \(D\) is such that we may arbitrarily relabel them without changing our findings, then for all \(i\) we have:

\[ \e{X_i|D} = s/n \]

I prefer the term estimate to Dupré and Tipler’s plausible value, because to me, it sounds strange to say that 0.5 is a "plausible" value for a coin toss; surely only 0 and 1 are plausible? Estimating a coin toss to be 0.5 may also sound strange, but at least the word "estimate" gives us license to pick any real number.

Also, the word "estimate" resembles "estimator", a term bandied around by sampling theorists to vaguely mean a number near a given unknown. We’re doing the same, except with less hand-waving. As Mackay writes, "there is no clear principle for deciding which criterion to use to measure the performance of an estimator". In contrast, we shall clearly state properties satisfied by estimates.

Order

If you live on the equator at sea level, then it’s not freezing outside, so your estimate of the temperature ought to be at least zero degrees Celsius.

More generally, if \(D\) implies \(X \le Y\), then we require:

\[\e{X|D} \le \e{Y|D}\]

This order consistency axiom injects a large dose of sanity. If we know \(X\) lies in a certain range, then any estimate of \(X\) better lie within that range. In the extreme, if \(D\) implies \(X = Y\) then \(\e{X|D} = \e{Y|D}\).

A tinge of insanity lingers in that strict inequalities have no extra power. We’re free to give identical estimates to \(X\) and \(Y\) even if we know \(X \gt Y\). On the other hand, perhaps it’s reasonable to permit the same estimate for unknowns that are arbitrarily close.

Sums

Ideally, we’d like to compute estimates from other estimates.

Take \(\e{X+Y|D}\). If it is at all possible to compute this from other estimates, it must be a computation of the form:

\[ \e{X+Y|D} = f(\e{X|D},\e{Y|D}) \]

for some function \(f\). By considering the special cases when \(X\) or \(Y\) is real, we conclude \(f\) must be addition:

\[ \e{X+Y|D} = \e{X|D} + \e{Y|D} \]

It turns out we can derive this from a weaker assumption: the axiom of restricted additivity:

\[ \e{X_1|D} = \e{X_2|D} \implies \e{X_1 + Y|D} = \e{X_2 + Y|D} \]

That is, as a function of \(X\), the value \(\e{X+Y|D}\) only depends on \(\e{X|D}\).

We’ll have no use for this property, but we’ll soon meet another like it that we do want.

Interval Arithmetic?

What about \(\e{XY|D}\)? If it is possible to compute the estimate of a product from other estimates, then arguing along the same lines leads to:

\[ \e{XY|D} \overset{?}{=} \e{X|D} \e{Y|D} \]

John Napier pioneered interval arithmetic which shares similarities with our methods: we pin each unknown between two bounds, and compute on unknowns by performing arithmetic on the corresponding bounds, confident that the true answer lies somewhere in between the output bounds. Interval arithmetic uses something like this equation.

However, interval arithmetic is plagued by the "dependency" problem. For example, if we know \(X\) lies in \([0..1]\), then so does \(1 - X\). Multiplying corresponding bounds tells us that \(X(1 - X)\) also lies in \([0..1]\), but this upper bound is too large by a factor of four; in fact \( X(1 - X) \) lies in \([0..0.25]\), attaining the maximum when \(X = 0.5\).

It’s even worse for our coin flip example where \(X\) and \(Y = 1 - X\) are two sides of the same coin. We know \(XY = 0\) either way, but interval arithmetic will never discover this.

Our questionable equation above suffers from the same affliction and fails to show \(\e{XY|D} = 0\). Like interval arithmetic, we get the wrong answer because the equation treats \(X\) and \(1 - X\) independently.

Violating bounds can be tolerable in practice, but not for us. We’re not merely finding approximations that suffice for a problem at hand; we’re trying to precisely model reasoning in the face of uncertainty, so our estimates must always respect the tightest known bounds. Hence we must accept an uncomfortable truth: in general, there is no way to estimate products from estimates of factors.

Why are sums easy but products hard? I got as far as follows. The unknown \(X\) trivially has a dependency on the unknown \(X\), namely itself, and indeed this is about as dependent as it can get. But adding these together, even repeatedly, merely produces a constant multiple of \(X\), a special case of strong rescaling. Meanwhile, squaring \(X\) is a more intricate computation. Take order consistency: does \(X \le 0\) mean \(X^2 \le 0^2?\) One does not simply square both sides.

The Cox Axiom

Incredibly, we can salvage something valuable from the failed product above. We know trouble arises when a dependency exists between \(X\) and \(Y\). But it turns out we can naturally quantify such dependencies in some cases, leading to an axiom of fundamental importance.

Our task may seem impossible at first glance. The \(X\) and \(D\) of \(\e{X|D}\) inhabit different worlds, as \(X\) is a quantity we wish to estimate, while \(D\) represents all our knowledge and experience. How can we quantify a logical dependency between two numerical unknowns buried deep in our background knowledge?

The trick is to transport \(D\) to the world of \(X\). We express \(D\) as a logical proposition. We forbid \(D\) to be equivalent to FALSE to avoid something like division by zero: if our assumptions are inconsistent, then anything goes, so our estimates don’t matter and can take any value!

Next, we map FALSE and TRUE to 0 and 1 respectively, a time-honoured tradition, so:

conjunction becomes multiplication
\(A \implies B\) becomes \(A \le B\)
\(\bar{A}\) becomes \(1 - A\)

We saw a disguised example of this earlier when we mapped heads to 0 and tails to 1 for a coin flip. The unknown we constructed is equivalent to the proposition "the coin shows tails".

Then given a proposition \(A\), we may not know if it’s 1 or 0 (true or false), but we can estimate it just like any other unknown. We have gained the ability to place a proposition on either side of the vertical bar, thus If it is at all possible to compute the estimate of a product \(\e{XA|B}\) from other estimates, we now have:

\[ \e{XA|B} = f(\e{X|AB}, \e{A|B}, \e{X|B}) \]

for some function \(f\). Comparing against our previous attempt, we notice a new term: \(\e{X|AB}\). Will its presence help?

Set \(X = r \bar{A} + s\). Then:

\[ \e{X A | B} = \e{r \bar{A} A + s A | B} = s \e{A | B} \]

is independent of \(r\).

Setting \(k = \e{\bar{A}|B}\):

\[ \e{X | B} = r \e{\bar{A}|B} + s = r k + s \]

If \(k \ne 0\), then we can choose \(r\) so that \(\e{X|B}\) has any desired value, hence the function \(f\) must be independent of \(\e{X|B}\):

\[ \e{XA|B} = f(\e{X|AB}, \e{A|B}) \]

If \(k = 0\), then \(B\) implies \(A\) hence \(\e{X|B} = \e{XA|B}\), so again \(f\) can be computed from \(\e{X|AB}\) and \(\e{A|B}\).

This equation seems plausible, and looks useful. Nothing seems to break, so we propose the Cox axiom:

\[ \e{X_1|AB} = \e{X_2|AB} \implies \e{X_1 A|B} = \e{X_2 A|B} \]

That is, the value of \(\e{XA|B}\) depends only on \(\e{X|AB}\) when viewed as a function of \(X\). The Cox axiom is weaker than assuming the above function \(f\) because we allow a different function for each choice of \(A\) and \(B\).

Let’s perform a quick sanity check. This axiom makes sense when \(A\) is independent of \(X_1\) and \(X_2\). How about the other extreme? Suppose \(X_1 = A\) and \(X_2\) is some estimate independent of \(A\) that happens to be 1 when \(B\) holds. Then both sides of the first equation in the axiom are equal to 1, and both sides of the second equation in the axiom are equal to \(A\) (since Boolean algebra decrees that \(A^2 = A\).)

Probability

From the Cox axiom, for fixed propositions \(A\) and \(B\), we have:

\[ \e{XA|B} = f(\e{X|AB}) \]

for some function \(f : \mathbb{R} \rightarrow \mathbb{R} \).

When \(X = r\) for real \(r \in \mathbb{R}\), we have

\[ r \e{A|B} = \e{rA|B} = f(\e{r|AB}) = f(r) \]

Since this holds for all \(r\), we see \(f\) is multiplication by \(\e{A|B}\). Thus:

\[ \e{XA|B} = \e{X|AB} \e{A|B} \]

For any proposition \(A\), define the probability of \(A\) under the condition \(B\) by:

\[ P(A|B) = \e{A|B} \]

Thus a probability is an estimate of the truth of a proposition. This fits the Bayesian view that a probability is a degree of belief.

If \(A, B, C\) are propositions, we have:

\[ P(AB|C) = P(A|BC) P(B|C) \]

This is the product rule of probability.

Recall the strong rescaling axiom implies:

\[ P(A|C) + P(\bar{A}|C) = 1 \]

This is the sum rule of probability.

Hence we can derive the remaining usual rules of probability. For example, following Jaynes, via the product and sum rules:

\[ \begin{aligned} P(A\vee B|C) &= 1 - P(\bar{A}\bar{B}|C) = 1 - P(\bar{A}|C) P(\bar{B}|\bar{A}C) \\ &= 1 - P(\bar{A}|C) (1 - P(B|\bar{A}C)) = P(A|C) + P(\bar{A}B|C) \\ &= P(A|C) + P(B|C) P(\bar{A}|BC) = P(A|C) + P(B|C)(1 - P(A|BC)) \\ &= P(A|C) + P(B|C) - P(AB|C) \end{aligned} \]

We can show this more directly via the additivity axiom since \(A \vee B = A + B - AB\) but it’s nice we can do without it.

Recap

What an adventure! We started by formalizing some common-sense rules for estimating an unknown number, and a few axioms later, we discover the laws of probability!

Our five axioms:

The set of unknowns \(T\) is a partially-ordered commutative algebra over \(\RR\) with identity. The set of propositions \(E\) is a sub-Boolean algebra of the idempotents of \(T\) inheriting its partial order, and contains 0 and 1. Write \(E_0 = E \setminus\{0\}\). We assume a function:

\[T\times E_0 \rightarrow \RR\]

whose value on \((X,D)\) we denote by \(\e{X|D}\).
STRONG RESCALING: For all \( r, s\in\RR, X\in T, D\in E_0 \)

\[\e{rX+s | D} = r\e{X|D} + s\]
ORDER CONSISTENCY: For all \( X, Y\in T, D\in E_0 \)

\[(D \implies X \le Y) \implies \e{X|D} \le \e{Y|D}\]
THE COX AXIOM: For all \( X_1, X_2, Y\in T, A,B\in E, AB \ne 0 \)

\[\e{X_1|AB} = \e{X_2|AB} \implies \e{X_1A|B} = \e{X_2A|B}\]
RESTRICTIVE ADDITIVITY: For all \( X_1, X_2\in T, D\in E_0 \)

\[\e{X_1|D} = \e{X_2|D} \implies \e{X_1 + Y|D} = \e{X_2 + Y|D}\]

The last axiom is nice to have, but unneeded for probability theory.

For \(A \in E, B \in E_0\), we define the conditional probability of \(A\) given \(B\) by:

\[ P(A|B) = \e{A|B} \]

The axioms imply the sum rule: for all \(A \in E, C \in E_0\)

\[ P(A|C) + P(\bar{A}|C) = 1 \]

as well as the product rule: for all \(A, B \in E, C \in E_0, BC \ne 0\)

\[ P(AB|C) = P(A|BC) P(B|C) \]

References

Most of the above comes from Maurice J. Dupré and Frank J. Tipler, New Axioms for Rigorous Bayesian Probability, which opens with a brief history of the foundations of Bayesian probability theory:

Late 18th century - Laplace got the ball (or is it dice?) rolling; the authors even write "Laplacian probability" as a synonym for Bayesian probability. The last paragraph of Laplace’s introduction features the famous phrase: "la théorie de probabilités n’est, au fond, que le bon sens réduit au calcul" (probability theory is, at bottom, nothing but common sense reduced to calculation).
1961 - R. T. Cox derives probability theory from three "desiderata".
1974 - Bruno de Finetti derives probability theory with the help of previsions, which we call "estimates".
1999 - Joseph Halpern finds a counterexample to Cox’s Theorem, which implies Cox’s proof was non-rigorous and requires more assumptions to work.

Dupré and Tipler combine ideas from Cox and de Finetti to get a simpler and cleaner justification of probability theory. See also their earlier paper: Maurice J. Dupré and Frank J. Tipler, The Cox Theorem: Unknowns and Plausible Value.

E. T. Jaynes, Probability Theory: The Logic of Science, champions Cox’s approach. He admires de Finetti, but has reservations about deriving probability from betting preferences. He also explains and develops the principle of indifference.

Jaynes also sketches a proof of Cox’s theorem, which requires assuming continuity to allow us to apply a theorem on "the associativity equation" (whose proof takes eleven pages of Aczél 1966), or alternatively, assuming differentiability for a shorter proof, though it remains hairy, unlike the elementary proofs of Dupré and Tipler.

Ben Lynn blynn@cs.stanford.edu 💡