# 9.2 The Maximum Entropy Principle

If the Principle of Indifference tells us what probabilities to assign given no background
knowledge, what is the corresponding principle for the case when one does have some
background knowledge? Seeking to answer this question, E.T. Jaynes studied the writings of J.
Willard Gibbs and drew therefrom a rule called the maximum entropy principle. Like the
Principle of Indifference, the maximum entropy principle is provably correct in certain special
cases, but in the general case, justifying it or applying it requires ad hoc, something-out-of-
nothing assumptions.
The starting point of the maximum entropy principle is the entropy function
H(p1,…,pn) = – [p1logp1 + p2logp2 + … + pnlogpn],
where {Yi} is an exhaustive, mutually exclusive collection of events and pi=P(Yi). This function
first emerged in the work of Boltzmann, Gibbs and other founders of thermodynamics, but its
true significance was not comprehended until Claude Shannon published The Theory of
Communication (1949). It is a measure of the uncertainty involved in the distribution {pi}.
The entropy is always positive. If, say, (p1,…,pn)=(0,0,1,..,0,0,0), then the entropy H(p1,…,pn) is
zero — because this sort of distribution has the minimum possible uncertainty. It is known which
of the Yi is the case, with absolute certainty. On the other hand, if (p1,…,pn)=(1/n,1/n,…,1/n), then
H(p1,…,pn)=logn, which is the maximum possible value. This represents themaximum possible
uncertainty: each possibility is equally likely.
The maximum entropy principle states that, for any exhaustive, mutually exclusive set of
events (Y1,…,Yn), the most likely probability distribution (p1,…,pn) with respect to a given set of
constraints on the Yi is that distribution which, among all those that satisfy the constraints, has
maximum entropy. The "constraints" represent particular knowledge about the situation in
question; they are what distinguishes one problem from another.
For instance, what if one has absolutely no knowledge about the various possibilities Yi?
Then, where pi=P(Yi), can we determine the "most likely" distribution (p1,…,pn) by finding the
distribution that maximizes H(p1,…,pn)? It is easy to see that, given no additional constraints, the
maximum of H(p1,…,pn) occurs for the distribution (p1,…pn)= (1/n,1/n,…,1/n). In other words,
when there is no knowledge whatsoever about the Yi, the maximum entropy principle reduces to
the Principle of Indifference.
MAXIMUM ENTROPY WITH LINEAR CONSTRAINTS
In thermodynamics the Yi represent, roughly speaking, the different possible regions of space
in which a molecule can be; pi is the probability that a randomly chosen molecule is in region Yi.
Each vector of probabilities (p1,…,pn) is a certain distribution of molecules amongst regions. The
question is, what is the most likely way for the molecules to be distributed? One assumes that
one knows the energy of the distribution, which is of the form E(p1,…,pn)=c1p1+…+cnpn, where
the {ci} are constants obtained from basic physical theory. That is, one assumes that one knows
an equation E(p1,…,pn)=K. Under this assumption, the answer to the question is: the most likely
(p1,…,pn) is the one which, among all those possibilities that satisfy the equation E(p1,…,pn)=K,
maximizes the entropy H(p1,…,pn). There are several other methods of obtaining the most likely
distribution, but this is by far the easiest.
What is remarkable is that this is not just an elegant mathematical feature of classical
thermodynamics. In order to do the maximum entropy principle justice, we should now consider
its application to quantum density matrices, or radio astronomy, or numerical linear algebra. But
this would take us too far afield. Instead, let us consider Jaynes’s "Brandeis dice problem", a
puzzle both simple and profound.
Consider a six-sided die, each side of which may have any number of spots between 1 and 6.
The problem is (Jaynes, 1978):
suppose [this] die has been tossed N times, and we are told only that the average number of
spots up was not 3.5, as we might expect from an ‘honest’ die but 4.5. Given this information,
and nothing else, what probability should we assign to i spots on the next toss? (p.49)
Let Yi denote the event that the next toss yields i spots; let pi=P(Yi). The information we have
may be expressed as an equation of the formA(p1,…,pn)=4.5, where A(p1,…,pn)=(p1+…+pn)/n is
the average of the pi. This equation says: whatever the most likely distribution of probabilities is,
it must yield an average of 4.5, which is what we know the average to be.
The maximum entropy principle says: given that the average number of spots up is 4.5, the
most likely distribution (p1,…,pn) is the one that, among all those satisfying the constraint
A(p1,…,pn)=4.5, maximizes the entropy H(p1,…,pn). This optimization problem is easily solved
using Lagrange multipliers, and it has the approximate solution (p1,…,pn) = (.05435, .07877,
.11416, .16545, .23977, .34749). If one had A(p1,…,pn)=3.5, the maximum entropy principle
would yield the solution (p1,…,pn)=(1/6, 1/6, 1/6, 1/6, 1/6, 1/6); but, as one would expect,
knowing that the average is 4.5 makes the higher numbers more likely and the lower numbers
less likely.
For the Brandeis dice problem, as in the case of classical thermodynamics, it is possible to
prove mathematically that the maximum entropy solution is far more likely than any other
solution. And in both these instances the maximization of entropy appears to be the most
efficacious method of locating the optimal solution. The two situations are extremely similar:
both involve essentially random processes (dice tossing, molecular motion), and both involve
linear constraints (energy, average). Here the maximum entropy principle is at its best.
MAXIMUM ENTROPY AS A GENERAL RULE OF INFERENCE
The maximum entropy principle is most appealing when one is dealing with linear constraints.
There is a simple, straightforward proof of its correctness. But when talking about the general
task of intelligence, we are not necessarily restricted to linear constraints. Evans (1978) has
attempted to surmount this obstacle by showing that, given any constraint F(p1,…,pn)=K, the
overwhelmingly most likely values pi=P(Yi) may be found by maximizing
H(p1,…,pn) – H(k1,…,kn) = p1log(p1/k1) + … + pnlog(pn/kn)
where k=(k1,k2,…,kn) is some "background distribution". The trouble with this approach is that
the only known way of determining k is through a complicated sequence of calculations
involving various ensembles of events.
Shore and Johnson (1980) have provided an alternate approach, which has been refined
considerably by Skilling (1989). Extending Cox’s proof that probability theory is the only
reasonable method for uncertain reasoning, Shore and Johnson have proved that if there is any
reasonably general method for assigning prior probabilities in Bayes’ Theorem, it has to depend
in a certain way upon the entropy. Here we will not require all the mathematical details; the
general idea will suffice.
Where D is a subset of {Yi}, and C is a set of constraints, let f[D%C] denote the probability
distribution assigned to the domain D on the basis of the constraints C. Let m={m1,m2,…mn}
denote some set of "background information" probabilities. For instance, if one actually has no
backgroundinformation, one might want to implement the Principle of Indifference and assume
mi=1/n, for all i.
Assume f[D%C] is intended to give the most likely probability distribution for D, given the
constraints C. Then one can derive the maximum entropy principle from the following axioms:
Axiom I: Subset Independence
If constraint C1 applies in domain D1 and constraint C2 applies in domain D2, then
f[D1%C1]%f[D2%C2] = f[D1%D2%C1%C2]. (Basically, this means that if the constraints involved
do not interrelate D1 and D2, neither should the answer). This implies that f[D%C] can be
obtained by maximizing over a sum of the form S(p,m)=m1Q(p1)+…+mnQ(pn), where Q is some
function.
Axiom II: Coordinate Invariance
This is a technical requirement regarding the way that f[(p1,…,pn)%C] relates to
f[(p1/q1,….,pn/qn)%C]: it states that if one expresses the regions in a different coordinate system,
the probabilities do not change. It implies that S(p,m)=m1Q(p1/m1)+…+mnQ(pn/mn).
Axiom III: System Independence
Philosophically, this is the crucial requirement. "If a proportion q of a population has a certain
property, then the proportion of any sub-population having that property should properly be
assigned as q…. For example, if 1/3 of kangaroos have blue eyes… then [in the absence of
knowledge to the contrary] the proportion of left-handed kangaroos having blue eyes should be
1/3"
It can be shown that these axioms imply that f[Y%C] is proportional to the maximum of the
entropy H(p1,…,pn) subject to the constraints C, whatever the constraints C may be (linear or
not). And since it must be proportional to the entropy, one may as well take it to be equal to the
entropy.
These axioms are reasonable, though nowhere near as compelling as Cox’s stunningly simple
axioms for probable inference. They are not simply mathematical requirements; they have a great
deal of philosophical substance. What they do not tell you, however, is by what amount the most
likely solution f[Y%C] is superior to all other solutions. This requires more work.
More precisely, one way of summarizing what these axioms show is as follows. Let
m=(m1,…,mn) be some vector of "background" probabilities. Then f[D%C] must be assigned by
maximizing the function
S(p,m)=[p1-m1-p1log(p1/m1)]+…+[pn-mn-pnlog(pn/mn)].
Evans has shown that, for any constraint C, there is some choice of m for which the maximum
entropy principle gives an distribution which is not only correct but dramatically more likely
than any other distribution. It is implicit, though not actually stated, in his work that given the
correct vector (m1,…,mn), the prior probabilities {pi} in Bayes’ formula must be given by
pi = exp[aS/Z],
where S= S(p,m) as given above, Z=exp(aS)/[n(p1p2…pn)1/2], and a is aparameter to be discussed
below. Skilling has pointed out that, in every case for which the results have been calculated for
any (m1,…,mn), with linear or nonlinear constraints, this same formula has been the result. He
has given a particularly convincing example involving the Poisson distribution.
In sum: the maximum entropy principle appears to be a very reasonable general method for
estimating the best prior probabilities; and it often seems to be the case that the best prior
probabilities are considerably better than any other choice. Actually, none of the details of the
maximum entropy method are essential for our general theory of mentality. What is important is
that, in the maximum entropy principle, we have a widely valid, practically applicable method
for estimating the prior probabilities required by Bayes’ Theorem, given a certain degree of
background knowledge. The existence of such a method implies the possibility of a unified
treatment of Bayesian reasoning.
DEDUCTION, INDUCTION
In order to use Bayes’ rule to determine the P(Yi%X), one must know the P(X%Yi), and one
must know the P(Yi). Determining the P(X%Yi) is, I will propose, a fundamentally deductive
problem; it is essentially a matter of determining a property of the known quantity Yi. But the
P(Yi) are a different matter. The maximum entropy principle is remarkable but not magical: it
cannot manufacture knowledge about the P(Yi) where there isn’t any. All it can do is work with
given constraints C and given background knowledge m, and work these into a coherent overall
guess at the P(Yi). In general, the background information about these probabilities must be
determined by induction. In this manner, Bayes’ rule employs both inductive and deductive
reasoning.
THE REGULARIZATION PARAMETER
It is essential to note that the maximum entropy method is not entirely specified. Assuming the
formulas given above are accurate, there is still the problem of determining the parameter a. It
appears that there is no way to assign it a universal value once and for all — its value must be set
in a context-specific way. So if the maximum entropy principle is used for perception, the value
of a must be set differently for different perceptual acts. And, furthermore, it seems to me that
even if the maximum entropy principle is not a central as I am assuming, the problem of the
parameter a is still relevant: any other general theory of prior probability estimation would have
to give rise to a similar dilemma.
Gull (1989) has demonstrated that the parameter a may be interpreted as a "regularizing
parameter". If a is large, prior probabilities are computed in such a way that distributions which
are far from the background model m are deemedrelatively unlikely. But if a is very small, the
background model is virtually ignored.
So, for instance, if there is no real background knowledge and the background model m is
obtained by the Principle of Indifference, the size of a determines the tendency of the maximum
entropy method to assign a high probability to distributions in which all the probabilities are
about the same. Setting a high would be "over-fitting". But, on the other hand, if m is derived
from real background knowledge and the signal of which the Yi are possible explanations is very
"noisy," then a low a will cause the maximum entropy principle to yield an optimal distribution
with a great deal of random oscillation. This is "under-fitting". In general, one has to keep the
parameter a small to get any use out of the background information m, but one has to make it
large to prevent the maximum entropy principle from paying too much attention to chance
fluctuations of the data.
BAYESIAN PARAMETER ESTIMATION
As an alternative to setting the parameter a by intuition or ad hoc mathematical techniques,
Gull has given a method of using Bayesian statistics to estimate the most likely value of a for
particular p and m. Often, as in radioastronomical interferometry, this tactic or simpler versions
of it appear to work well. But, as Gull has demonstrated, vision processing presents greater
difficulties. He tried to use the maximum entropy principle to turn blurry pictures of a woman
into accurate photograph-like images, but he found that the Bayesian derivation of a yielded
fairly unimpressive results.
He devised an ingenious solution. He used the maximum entropy principle to take the results
of a maximum entropy computation using the value of a arrived at by the Bayesian method —
and get a new background distribution m’=(m1′,…,mn’). Then he applied the maximum entropy
principle using this new background knowledge, m’. This yielded beautiful results — and if it
hadn’t, he could have applied the same method again. This is yet another example of the power
of hierarchical structures to solve perceptual problems.
Of course, one could do this over and over again — but one has to stop somewhere. At some
level, one simply has to set the value of a based on intuition, based on what value a usually has
for the type of problem one is considering. This is plainly a matter of induction.
In general, when designing programs or machines to execute the maximum entropy principle,
we can set a by trial and error or common sense. But this, of course, means that we are using
deduction, analogy and induction to set a. I suggest that similar processes are used when the
mind determines a internally, unconsciously. This hypothesis has some interesting consequences,
as we shall see.
As cautioned above, if the maximum entropy method were proved completely incorrect, it
would have no effect on the overall model of mind presented here — so long as it were replaced
by a reasonably simple formula, or collection of formulas, for helping to compute the priors in
Bayes’ formula; and so long as this formula or collection of formulas was reasonably amenable
to inductive adjustment. However, I do not foresee the maximum entropy principle being
"disproved" in any significant sense. There may be indeed be psychological systems which have
nothing to do with it. But the general idea of filling in the gaps in incomplete data with the "most
likely" values seems so obvious as to be inevitable. And the idea of using the maximum entropy
values — the values which "assume the least", the most unbiased values — seems almost as
natural. Furthermore, not only is it conceptually attractive and intuitively attractive — it has been
shown repeatedly to work, under various theoretical assumptions and in various practical
situations.
Kaynak: A New Mathematical Model of Mind
belgesi-959