# 9.1 Probability Theory

The branch of mathematics known as probability theory provides one way of making
inferences regarding uncertain propositions. But it is not a priori clear that it is the only
reasonable way to go about making such inferences. This is important for psychology because it
would be nice to assume, as a working hypothesis, that the mind uses the rules of probability
theory to process its perceptions. But if the rules of probability theory were just an arbitrary
selection from among a disparate set of possible schemes for uncertain inference, then there
would be little reason to place faith in this hypothesis.
Historically, most attempts to derive general laws of probability have been "frequentist" in
nature. According to this approach, in order to say what the statement "the probability of X
occurring in situation E is 1/3" means, one must invoke a whole "ensemble" of situations. One
must ask: if I selected an situation from among an ensemble of n situations "identical" to E, what
proportion of the time would X be true? If, as n tended toward infinity, this proportion tended
toward 1/3, then it would be valid to say that the probability of X occurring in situation E is 1/3.
In some cases this approach is impressively direct. For instance, consider the proposition:
"The face showing on the fair six-sided die I am about to toss will be either a two or a three".
Common sense indicates that this proposition has probability 1/3. And if one looked at a large
number of similar situations — i.e. a large number of tosses of the same die or "identical" dice —
then one would indeed find that, in the long run, a two or a three came up 1/3 of the time.
But often it is necessary to assign probabilities to unique events. In such cases, the frequency
interpretation has no meaning. This occurs particularly often in geology and ecology: one wishes
to know the relative probabilities of various outcomes in a situation which is unlikely ever to
recur. When the problem has to do with a bounded region of space, say a forest, it is possible to
justify this sort of probabilistic reasoning using complicated manipulations of integral calculus.
But what is really required, in order to justify the generalapplication of probability theory, is
some sort of proof that the rules of probability theory are uniquely well-suited for probable
inference.
Richard Cox (1961) has provided such a proof. First of all, he assumes that any possible rule
for assigning a "probability" to a proposition must obey the following two rules:
The probability of an inference on given evidence determines the probability of its
contradictory on the same evidence (p.3)
The probability on given evidence that both of two inferences are true is determined by their
separate probabilities, one on the given evidence, the other on this evidence with the additional
assumption that the first inference is true (p.4)
The probability of a proposition on certain evidence is the probability that logically should be
assigned to that proposition by someone who is aware only of this evidence and no other
evidence. In Boolean notation, the first of Cox’s rules says simply that if one knows the
probability of X on certain evidence, then one can deduce the probability of -X on that same
evidence without using knowledge about anything else. The second rule says that if one knows
the probability of X given certain evidence E, and the probability that Y is true given EX, then
one can deduce the probability that Y is true without using knowledge about anything else.
These requirements are hard to dispute; in fact, they don’t seem to say very much. But their
simplicity is misleading. In mathematical notation, the first requirement says that P(XY%E)=
F[(X%E),(Y%XE)], and the second requirement says that P(-X%E)=f[P(X%E)], where F and f
are unspecified functions. What is remarkable is that these functions need not remain
unspecified. Cox has shown that the laws of Boolean algebra dictate specific forms for these
functions.
For instance, they imply that G[P(XY%E)] = CG[P(X%E)]G[P(Y%XE)], where C is some
constant and G is some function. This is almost a proof that for any measure of probability P,
P(XY%E)=P(X%E)P(Y%XE). For if one sets G(x)=x, this rule is immediate. And, as Cox points
out, if P(X%E) measures probability, then so does G[P(X%E)] — at least, according to the two
axioms given above. The constant C may be understood by setting X=Y and recalling that
XX=X according to the axioms of Boolean algebra. It follows by simple algebra that C =
G[P(X%XE)] — i.e., C is the probability of X on the evidence X, the numerical value of
certainty. Typically, in probability theory, C=1. But this is a convention, not a logical
requirement.
As for negation, Cox has shown that if P(X)=f[P(-X)], Boolean algebra leads to the formula
Xr+[f(X)]r=1. Given this, we could leave r unspecified and use P(X)r as the symbol of
probability; but, following Cox, let us take r=1.
Cox’s analysis tells us in exactly what sense the laws of probability theory are arbitrary. All
the laws of probability theory can be derived from the rules P(X%E)=1-P(-X%E),
P(XY%E)=P(X%E)P(Y%XE). And these rules areessentially the only ways of dealing with
negation and conjunction that Boolean algebra allows. So, if we accept Boolean algebra and
Cox’s two axioms, we accept probability theory.
Finally, for a more concrete perspective on these issues, let us turn to the work of Krebs,
Kacelnik and Taylor (1978). These biologists studied the behavior of birds (great tits) placed in
an aviary containing two machines, each consisting of a perch and a food dispenser. One of the
machines dispenses food p% of the times that its perch is landed on, and the other one dispenses
food q% of the times that its perch is landed on. They observed that the birds generally visit the
two machines according to the optimal strategy dictated by Bayes’ rule and Laplace’s Principle of
Indifference — a strategy which is not particularly obvious. This is a strong rebuttal to those who
raise philosophical objections against the psychological use of probability theory. After all, if a
bird’s brain can use Bayesian statistics, why not a human brain?
BAYES’ RULE
Assume that one knows that one of the propositions Y1,Y2,…,Yn is true, and that only one of
these propositions can possibly be true. In mathematical language, this means that the collection
{Y1,…,Yn) is exhaustive and mutually exclusive. Then, Bayes’ rule says that
P(Yn)P(X%Yn)
P(Yn%X) = %%%%%%%%%%%%%%%%
P(Y1)P(X%Y1)+…+P(Yn)P(X%Yn)
In itself this rule is unproblematic; it is a simple consequence of the two rules of probable
inference derived in the previous section. But it lends itself to controversial applications.
For instance, suppose Y1 is the event that a certain star system harbors intelligent life which is
fundamentally dissimilar from us, Y2 is the event that it harbors intelligent life which is
fundamentally similar to us, and Y3 is the event that it harbors no intelligent life at all. Assume
these events have somehow been precisely defined. Suppose that X is a certain sequence of radio
waves which we have received from that star system, and that one wants to compute P(Y2%X):
the probability, based on the message X, that the system has intelligent life which is
fundamentally similar to us. Then Bayes’ rule applies: {Y1,Y2,Y3} is exhaustive and mutually
exclusive. Suppose that we have a good estimate of P(X%Y1), P(X%Y2), and P(X%Y3): the
probability that an intelligence dissimilar to us would send out message X, the probability that an
intelligence similar to us would send out message X, and the probability that an unintelligent star
system would somehow emit message X. But how do we know P(Y1), P(Y2) and P(Y3)?
We cannot deduce these probabilities directly from the nature of messages received from star
systems. So where does P(Yi%X) come from? This problem,at least in theory, makes the
business of identifying extraterrestrial life extremely tricky. One might argue that it makes it
impossible, because the only things we know about stars are derived from electromagnetic
"messages" of one kind or another — light waves, radio waves, etc. But it seems reasonable to
assume that spectroscopic information, thermodynamic knowledge and so forth are separate from
the kind of message-interpretation we are talking about. In this case there might be some kind of
a priori physicochemical estimate of the probability of intelligent life, similar intelligent life, and
so forth. Carl Sagan, among others, has attempted to estimate such probabilities. The point is that
we need some kind of prior estimate for the P(Yi), or Bayes’ rule is useless here.
This example is not atypical. In general, suppose that X is an effect, and {Yi} is the set of
possible causes. Then to estimate P(Y1%X) is to estimate the probability that Y1, and none of the
other Yi, is the true cause of X. But in order to estimate this using Bayes’ rule, it is not enough to
know how likely X is to follow from Yi, for each i. One needs to know the probabilities P(Yi) —
one needs to know how likely each possible cause is, in general.
One might suppose these problems to be a shortcoming of Bayes’ rule, of probability theory.
But this is where Cox’s demonstration proves invaluable. Any set of rules for uncertain reasoning
which satisfy his two simple, self-evident axioms — must necessarily lead to Bayes’ rule, or
something essentially equivalent with a few G’s and r’s floating around. Any reasonable set of
rules for uncertain reasoning must be essentially identical to probability theory, and must
therefore have no other method of deducing causes from effects than Bayes’ rule.
The perceptive reader might, at this point, accuse me of inconsistency. After all, it was
observed above that quantum events may be interpreted to obey a different sort of logic. And in
Chapter 8 I raised the possibility that the mind employs a weaker "paraconsistent" logic rather
than Boolean logic. How then can I simply assume that Boolean algebra is applicable?
However, the inconsistency is only apparent. Quantum logic and paraconsistent logic are both
weaker than Boolean logic, and they therefore cannot not lead to any formulas which are not also
formulas of Boolean logic: they cannot improve on Bayes’ rule.
So how do we assign prior probabilities, in practice? It is not enough to say that it comes
down to instinct, to biological programming. It is possible to say something about how this
programming works.
THE PRINCIPLE OF INDIFFERENCE
Laplace’s "Principle of Indifference" states that if a question is known to have exactly n
possible answers, and these answers are mutually exclusive, then in the absence of any other
knowledge one should assume each of these answers to have probability 1/n of being correct.
For instance, suppose you were told that on the planet Uxmylarqg, thepredominant intelligent
life form is either blue, green, or orange. Then, according to the Principle of Indifference, if this
were the only thing you knew about Uxmylargq, you would assign a probability of 1/3 to the
statement that it is blue, a probability of 1/3 to the statement that it is green, and a probability of
1/3 to the statement that it is orange. In general, according to the Principle of Indifference, if one
had no specific knowledge about the n causes {Y1,…,Yn} which appear in the above formulation
of Bayes’ rule, one would assign a probability P(Yi)=1/n to each of them.
Cox himself appears to oppose the Principle of Indifference, arguing that "the knowledge of a
probability, though it is knowledge of a particular and limited kind, is still knowledge, and it
would be surprising if it could be derived from… complete ignorance, asserting nothing". And in
general, that is exactly what the Principle of Indifference does: supplies knowledge from
ignorance. In certain specific cases, it may be proved to be mathematically correct. But, as a
general rule of uncertain inference, it is nothing more or less than a way of getting something out
of nothing. Unlike Cox, however, I do not find this surprising or undesirable, but rather exactly
what the situation calls for.
Kaynak: A New Mathematical Model of Mind
belgesi-958