If the Principle of Indifference tells us what probabilities to assign given no background

knowledge, what is the corresponding principle for the case when one does have some

background knowledge? Seeking to answer this question, E.T. Jaynes studied the writings of J.

Willard Gibbs and drew therefrom a rule called the maximum entropy principle. Like the

Principle of Indifference, the maximum entropy principle is provably correct in certain special

cases, but in the general case, justifying it or applying it requires ad hoc, something-out-of-

nothing assumptions.

The starting point of the maximum entropy principle is the entropy function

H(p1,…,pn) = – [p1logp1 + p2logp2 + … + pnlogpn],

where {Yi} is an exhaustive, mutually exclusive collection of events and pi=P(Yi). This function

first emerged in the work of Boltzmann, Gibbs and other founders of thermodynamics, but its

true significance was not comprehended until Claude Shannon published The Theory of

Communication (1949). It is a measure of the uncertainty involved in the distribution {pi}.

The entropy is always positive. If, say, (p1,…,pn)=(0,0,1,..,0,0,0), then the entropy H(p1,…,pn) is

zero — because this sort of distribution has the minimum possible uncertainty. It is known which

of the Yi is the case, with absolute certainty. On the other hand, if (p1,…,pn)=(1/n,1/n,…,1/n), then

H(p1,…,pn)=logn, which is the maximum possible value. This represents themaximum possible

uncertainty: each possibility is equally likely.

The maximum entropy principle states that, for any exhaustive, mutually exclusive set of

events (Y1,…,Yn), the most likely probability distribution (p1,…,pn) with respect to a given set of

constraints on the Yi is that distribution which, among all those that satisfy the constraints, has

maximum entropy. The "constraints" represent particular knowledge about the situation in

question; they are what distinguishes one problem from another.

For instance, what if one has absolutely no knowledge about the various possibilities Yi?

Then, where pi=P(Yi), can we determine the "most likely" distribution (p1,…,pn) by finding the

distribution that maximizes H(p1,…,pn)? It is easy to see that, given no additional constraints, the

maximum of H(p1,…,pn) occurs for the distribution (p1,…pn)= (1/n,1/n,…,1/n). In other words,

when there is no knowledge whatsoever about the Yi, the maximum entropy principle reduces to

the Principle of Indifference.

MAXIMUM ENTROPY WITH LINEAR CONSTRAINTS

In thermodynamics the Yi represent, roughly speaking, the different possible regions of space

in which a molecule can be; pi is the probability that a randomly chosen molecule is in region Yi.

Each vector of probabilities (p1,…,pn) is a certain distribution of molecules amongst regions. The

question is, what is the most likely way for the molecules to be distributed? One assumes that

one knows the energy of the distribution, which is of the form E(p1,…,pn)=c1p1+…+cnpn, where

the {ci} are constants obtained from basic physical theory. That is, one assumes that one knows

an equation E(p1,…,pn)=K. Under this assumption, the answer to the question is: the most likely

(p1,…,pn) is the one which, among all those possibilities that satisfy the equation E(p1,…,pn)=K,

maximizes the entropy H(p1,…,pn). There are several other methods of obtaining the most likely

distribution, but this is by far the easiest.

What is remarkable is that this is not just an elegant mathematical feature of classical

thermodynamics. In order to do the maximum entropy principle justice, we should now consider

its application to quantum density matrices, or radio astronomy, or numerical linear algebra. But

this would take us too far afield. Instead, let us consider Jaynes’s "Brandeis dice problem", a

puzzle both simple and profound.

Consider a six-sided die, each side of which may have any number of spots between 1 and 6.

The problem is (Jaynes, 1978):

suppose [this] die has been tossed N times, and we are told only that the average number of

spots up was not 3.5, as we might expect from an ‘honest’ die but 4.5. Given this information,

and nothing else, what probability should we assign to i spots on the next toss? (p.49)

Let Yi denote the event that the next toss yields i spots; let pi=P(Yi). The information we have

may be expressed as an equation of the formA(p1,…,pn)=4.5, where A(p1,…,pn)=(p1+…+pn)/n is

the average of the pi. This equation says: whatever the most likely distribution of probabilities is,

it must yield an average of 4.5, which is what we know the average to be.

The maximum entropy principle says: given that the average number of spots up is 4.5, the

most likely distribution (p1,…,pn) is the one that, among all those satisfying the constraint

A(p1,…,pn)=4.5, maximizes the entropy H(p1,…,pn). This optimization problem is easily solved

using Lagrange multipliers, and it has the approximate solution (p1,…,pn) = (.05435, .07877,

.11416, .16545, .23977, .34749). If one had A(p1,…,pn)=3.5, the maximum entropy principle

would yield the solution (p1,…,pn)=(1/6, 1/6, 1/6, 1/6, 1/6, 1/6); but, as one would expect,

knowing that the average is 4.5 makes the higher numbers more likely and the lower numbers

less likely.

For the Brandeis dice problem, as in the case of classical thermodynamics, it is possible to

prove mathematically that the maximum entropy solution is far more likely than any other

solution. And in both these instances the maximization of entropy appears to be the most

efficacious method of locating the optimal solution. The two situations are extremely similar:

both involve essentially random processes (dice tossing, molecular motion), and both involve

linear constraints (energy, average). Here the maximum entropy principle is at its best.

MAXIMUM ENTROPY AS A GENERAL RULE OF INFERENCE

The maximum entropy principle is most appealing when one is dealing with linear constraints.

There is a simple, straightforward proof of its correctness. But when talking about the general

task of intelligence, we are not necessarily restricted to linear constraints. Evans (1978) has

attempted to surmount this obstacle by showing that, given any constraint F(p1,…,pn)=K, the

overwhelmingly most likely values pi=P(Yi) may be found by maximizing

H(p1,…,pn) – H(k1,…,kn) = p1log(p1/k1) + … + pnlog(pn/kn)

where k=(k1,k2,…,kn) is some "background distribution". The trouble with this approach is that

the only known way of determining k is through a complicated sequence of calculations

involving various ensembles of events.

Shore and Johnson (1980) have provided an alternate approach, which has been refined

considerably by Skilling (1989). Extending Cox’s proof that probability theory is the only

reasonable method for uncertain reasoning, Shore and Johnson have proved that if there is any

reasonably general method for assigning prior probabilities in Bayes’ Theorem, it has to depend

in a certain way upon the entropy. Here we will not require all the mathematical details; the

general idea will suffice.

Where D is a subset of {Yi}, and C is a set of constraints, let f[D%C] denote the probability

distribution assigned to the domain D on the basis of the constraints C. Let m={m1,m2,…mn}

denote some set of "background information" probabilities. For instance, if one actually has no

backgroundinformation, one might want to implement the Principle of Indifference and assume

mi=1/n, for all i.

Assume f[D%C] is intended to give the most likely probability distribution for D, given the

constraints C. Then one can derive the maximum entropy principle from the following axioms:

Axiom I: Subset Independence

If constraint C1 applies in domain D1 and constraint C2 applies in domain D2, then

f[D1%C1]%f[D2%C2] = f[D1%D2%C1%C2]. (Basically, this means that if the constraints involved

do not interrelate D1 and D2, neither should the answer). This implies that f[D%C] can be

obtained by maximizing over a sum of the form S(p,m)=m1Q(p1)+…+mnQ(pn), where Q is some

function.

Axiom II: Coordinate Invariance

This is a technical requirement regarding the way that f[(p1,…,pn)%C] relates to

f[(p1/q1,….,pn/qn)%C]: it states that if one expresses the regions in a different coordinate system,

the probabilities do not change. It implies that S(p,m)=m1Q(p1/m1)+…+mnQ(pn/mn).

Axiom III: System Independence

Philosophically, this is the crucial requirement. "If a proportion q of a population has a certain

property, then the proportion of any sub-population having that property should properly be

assigned as q…. For example, if 1/3 of kangaroos have blue eyes… then [in the absence of

knowledge to the contrary] the proportion of left-handed kangaroos having blue eyes should be

1/3"

It can be shown that these axioms imply that f[Y%C] is proportional to the maximum of the

entropy H(p1,…,pn) subject to the constraints C, whatever the constraints C may be (linear or

not). And since it must be proportional to the entropy, one may as well take it to be equal to the

entropy.

These axioms are reasonable, though nowhere near as compelling as Cox’s stunningly simple

axioms for probable inference. They are not simply mathematical requirements; they have a great

deal of philosophical substance. What they do not tell you, however, is by what amount the most

likely solution f[Y%C] is superior to all other solutions. This requires more work.

More precisely, one way of summarizing what these axioms show is as follows. Let

m=(m1,…,mn) be some vector of "background" probabilities. Then f[D%C] must be assigned by

maximizing the function

S(p,m)=[p1-m1-p1log(p1/m1)]+…+[pn-mn-pnlog(pn/mn)].

Evans has shown that, for any constraint C, there is some choice of m for which the maximum

entropy principle gives an distribution which is not only correct but dramatically more likely

than any other distribution. It is implicit, though not actually stated, in his work that given the

correct vector (m1,…,mn), the prior probabilities {pi} in Bayes’ formula must be given by

pi = exp[aS/Z],

where S= S(p,m) as given above, Z=exp(aS)/[n(p1p2…pn)1/2], and a is aparameter to be discussed

below. Skilling has pointed out that, in every case for which the results have been calculated for

any (m1,…,mn), with linear or nonlinear constraints, this same formula has been the result. He

has given a particularly convincing example involving the Poisson distribution.

In sum: the maximum entropy principle appears to be a very reasonable general method for

estimating the best prior probabilities; and it often seems to be the case that the best prior

probabilities are considerably better than any other choice. Actually, none of the details of the

maximum entropy method are essential for our general theory of mentality. What is important is

that, in the maximum entropy principle, we have a widely valid, practically applicable method

for estimating the prior probabilities required by Bayes’ Theorem, given a certain degree of

background knowledge. The existence of such a method implies the possibility of a unified

treatment of Bayesian reasoning.

DEDUCTION, INDUCTION

In order to use Bayes’ rule to determine the P(Yi%X), one must know the P(X%Yi), and one

must know the P(Yi). Determining the P(X%Yi) is, I will propose, a fundamentally deductive

problem; it is essentially a matter of determining a property of the known quantity Yi. But the

P(Yi) are a different matter. The maximum entropy principle is remarkable but not magical: it

cannot manufacture knowledge about the P(Yi) where there isn’t any. All it can do is work with

given constraints C and given background knowledge m, and work these into a coherent overall

guess at the P(Yi). In general, the background information about these probabilities must be

determined by induction. In this manner, Bayes’ rule employs both inductive and deductive

reasoning.

THE REGULARIZATION PARAMETER

It is essential to note that the maximum entropy method is not entirely specified. Assuming the

formulas given above are accurate, there is still the problem of determining the parameter a. It

appears that there is no way to assign it a universal value once and for all — its value must be set

in a context-specific way. So if the maximum entropy principle is used for perception, the value

of a must be set differently for different perceptual acts. And, furthermore, it seems to me that

even if the maximum entropy principle is not a central as I am assuming, the problem of the

parameter a is still relevant: any other general theory of prior probability estimation would have

to give rise to a similar dilemma.

Gull (1989) has demonstrated that the parameter a may be interpreted as a "regularizing

parameter". If a is large, prior probabilities are computed in such a way that distributions which

are far from the background model m are deemedrelatively unlikely. But if a is very small, the

background model is virtually ignored.

So, for instance, if there is no real background knowledge and the background model m is

obtained by the Principle of Indifference, the size of a determines the tendency of the maximum

entropy method to assign a high probability to distributions in which all the probabilities are

about the same. Setting a high would be "over-fitting". But, on the other hand, if m is derived

from real background knowledge and the signal of which the Yi are possible explanations is very

"noisy," then a low a will cause the maximum entropy principle to yield an optimal distribution

with a great deal of random oscillation. This is "under-fitting". In general, one has to keep the

parameter a small to get any use out of the background information m, but one has to make it

large to prevent the maximum entropy principle from paying too much attention to chance

fluctuations of the data.

BAYESIAN PARAMETER ESTIMATION

As an alternative to setting the parameter a by intuition or ad hoc mathematical techniques,

Gull has given a method of using Bayesian statistics to estimate the most likely value of a for

particular p and m. Often, as in radioastronomical interferometry, this tactic or simpler versions

of it appear to work well. But, as Gull has demonstrated, vision processing presents greater

difficulties. He tried to use the maximum entropy principle to turn blurry pictures of a woman

into accurate photograph-like images, but he found that the Bayesian derivation of a yielded

fairly unimpressive results.

He devised an ingenious solution. He used the maximum entropy principle to take the results

of a maximum entropy computation using the value of a arrived at by the Bayesian method —

and get a new background distribution m’=(m1′,…,mn’). Then he applied the maximum entropy

principle using this new background knowledge, m’. This yielded beautiful results — and if it

hadn’t, he could have applied the same method again. This is yet another example of the power

of hierarchical structures to solve perceptual problems.

Of course, one could do this over and over again — but one has to stop somewhere. At some

level, one simply has to set the value of a based on intuition, based on what value a usually has

for the type of problem one is considering. This is plainly a matter of induction.

In general, when designing programs or machines to execute the maximum entropy principle,

we can set a by trial and error or common sense. But this, of course, means that we are using

deduction, analogy and induction to set a. I suggest that similar processes are used when the

mind determines a internally, unconsciously. This hypothesis has some interesting consequences,

as we shall see.

As cautioned above, if the maximum entropy method were proved completely incorrect, it

would have no effect on the overall model of mind presented here — so long as it were replaced

by a reasonably simple formula, or collection of formulas, for helping to compute the priors in

Bayes’ formula; and so long as this formula or collection of formulas was reasonably amenable

to inductive adjustment. However, I do not foresee the maximum entropy principle being

"disproved" in any significant sense. There may be indeed be psychological systems which have

nothing to do with it. But the general idea of filling in the gaps in incomplete data with the "most

likely" values seems so obvious as to be inevitable. And the idea of using the maximum entropy

values — the values which "assume the least", the most unbiased values — seems almost as

natural. Furthermore, not only is it conceptually attractive and intuitively attractive — it has been

shown repeatedly to work, under various theoretical assumptions and in various practical

situations.

Kaynak: A New Mathematical Model of Mind

belgesi-959

0 kişi bu belgeyi faydalı buldu

0 kişi bu belgeyi faydalı buldu