I’ll let Caticha summarize the principle of maximum entropy:
Among all possible probability distributions that agree with whatever we know select that particular distribution that reflects maximum ignorance about everything else. Since ignorance is measured by entropy, the method is mathematically implemented by selecting the distribution that maximizes entropy subject to the constraints imposed by the available information.
It appears to have been introduced by Jaynes in 57, and has seen a resurgence in the past decade with people taking bayesian inference more seriously. (As an aside, Jayne’s posthumously published book is well worth a read, in spite of some cringeworthy rants peppered throughout.) I won’t dwell too much on the philosophy as the two previously mentioned sources have already gone into great detail to justify the method.
Usually we consider constraints which are linear in the probabilities, namely we constrain the probability distribution to have specific expectation values. Consider that we know the expectation values of a certain set of functions $f^k$. Then, $p(x)$ should be such that
for all k. Let’s omit the notation $(x)$ for simplicity. Then, we can use variational calculus to find p which minimizes the functional
The constraint with $\alpha$ is the normalization condition and $S$ is the shannon entropy.
The solution to this is
with
the partition function (which is just the normalization constant). Now, we can find the remaining multipliers by solving the system of equations
I’ll let you confirm that if we fix the mean and variance we get a gaussian distribution. Go on, I’ll wait.
Power law distributions with maxent
I read a recent paper on PNAS which shows how to use this formalism to derive an intuition on why distributions with power law tails are so ubiquitous. It is not very well written, but I think I successfully deduced what the authors meant. Suppose you have a system with a given number of particles, where you ‘pay’ a fixed price for each particle to join. If you consider the first one to already be there, the total paid cost is $f(k)=(k-1)\mu$, with mu the chemical potential (now we are talking about discrete states, now the k indexes each state, or rather the number of particles in the cluster).
By bundling every constant into a $\mu^{\circ}$ factor, its not necessary to specify the actual value of the expectation and determine the lagrange multiplier: whatever it is, the distribution will look like $p_k=\frac{\exp(-\mu^{\circ} k)}{Z}$ with this arbitrary $\mu’$ factor (remember Z is just a normalization constant). This is an exponential distribution, which means events far away from the mean are comparatively very rare. The authors now propose to look at economies of scale - what if it gets cheaper to add a new particle as the cluster grows? Then, the cost is $k_0\mu/(k+k_0)$, which is a hill type function and where $k_0$ describes how much the cost is spread out with each additional particle. So the function f becomes $f(k)=\sum_{j=1}^{k-1} k_0\mu/(j+k_0)$ . Then you repeat the calculation and you get
Here, $\Psi$ is the digamma function and it’s just a fancy way of hiding the sum you saw in the function $f$ (when you do that, you also get a constant term which then cancels out). You can expand it for large $k$ and get $\Psi \sim \log(k+k_0-1/2)$. Then the probability distribution is $p_k \sim \left(k+k_0-1/2\right)^{-\mu^{\circ} k_0}$. Cool. The authors tested this by fitting $\mu^{\circ}$ and $k_0$ to various datasets with power law distributions, which doesn’t really show much more since we already know they are power laws, and the expression they fit is a power law. The main message here is that you can get a power law from the maximum entropy principle, which suggests a sort of universality among systems brought about by this kind of cost amortization.
In the next post, I’ll talk about a more complicated application of the maximum entropy principle to neural codes.