Tiago Ramalho AI research in Tokyo

Simple pattern formation with cellular automata

A cellular automaton is a dynamical system where space, time and dynamic variable are all discrete. The system is thus composed of a lattice of cells (discrete space), each described by a state (discrete dynamic variable) which evolve into the next time step (discrete time) according to a dynamic rule.

$$ x_i^{t+1} = f(x_i^t, \Omega_i^t, \xi) $$

This rule generally depends on the state of the target cell $x_i^t$, the state of its neighbors $\Omega_i^t$, and a number of auxiliary external variables $\xi$. Since all these inputs are discrete, we can enumerate them and then define the dynamic rule by a transition table. The transition table maps each possible input to the next state for the cell. As an example consider the elementary 1D cellular automaton. In this case the neighborhood consists of only the 2 nearest neighbors $\Omega_i^t = {x_{i-1}^t, x_{i+1}^t}$ and no external variables.

In general, there are two types of neighborhoods, commonly classified as Moore or Von Neumann. A Moore neighborhood of radius $r$ corresponds to all cells within a hypercube of size $r$ centered at the current cell. In 2D we can write it as $\Omega_{ij}^t = {x^t_{kl}:|i-k|\leq r \wedge |j-l|\leq r}\setminus x^t_{ij}$. The Von Neumann neighborhood is more restrictive: only cells within a manhattan distance of $r$ belong to the neighborhood. In 2D we write $\Omega_{ij}^t = {x^t_{kl}:|i-l|+|j-k| \leq r}\setminus x^t_{ij}$.

Finally it is worth elucidating the concept of totalistic automata. In high dimensional spaces, the number of possible configurations of the neighborhood $\Omega$ can be quite large. As a simplification, we may consider instead as an input to the transition table the sum of all neighbors in a specific state $N_k = \sum_{x \in \Omega}\delta(x = k)$. If there are only 2 states, we need only consider $N_1$, since $N_0 = r - N_1$. For an arbitrary number $m$ of states, we will obviously need to consider $m-1$ such inputs to fully characterize the neighborhood. Even then, each input $N_k$ can take $r+1$ different values, which might be too much. In such cases we may consider only the case when $N_k$ is above some threshold. Then we can define as an input the boolean variable

$$ P_{k,T}=\begin{cases} 1& \text{if}\; N_k \geq T,\\ 0& \text{if}\; N_k < T. \end{cases} $$

In the simulation you can find here, I considered a cellular automaton with the following properties: number of states $m=2$; moore neighborhood with radius $r=1$; lattice size $L_x \times L_y$; and 3 inputs for the transition table:

  • Current state $x_{ij}^t$
  • Neighborhood state $P_{1,T}$ with $T$ unspecified
  • One external input $\xi$
$$ \xi_{ij}=\begin{cases} 1& \text{if}\; i \geq L_x/2,\\ 0& \text{if}\; i < L_x/2. \end{cases} $$
  • Initial condition $x_{ij} = 0 \; \forall_{ij}$

For these conditions a deterministic simulation of these conditions yields only a few steady states: homogeneous 1 or 0, half the lattice 1 and the other 0, and oscillation between a combination of the previous.

One possibility would be to add noise to the cellular automaton in order to provide more interesting dynamics. There are two ways to add noise to a cellular automaton:

The most straightforward way is to perform the following procedure at each time step:

  • Apply the deterministic dynamics to the whole lattice
  • For each lattice site $ij$, invert the state $x_{ij}$ with probability $p$

This procedure only works of course for $m=2$. In the case of more states there is no obvious way to generalize the procedure and we need to use a proper monte carlo method to get the dynamics.

A second way is to implement a probabilistic cellular automaton. In this case the transition table is generalized to a markov matrix: each input is now mapped not to a specific state but rather to a set of probabilities for a transition to each state ($m$ probabilities). Naturally for each input these sum to one. In this case we have $m$ times more parameters than before.

Pilot waves in fluid dynamics

This is so cool. Source

The link between thermodynamics and inference

In recent blog posts I talked a bit about how many aspects of maximum entropy were analogous to methods in statistical physics. In this short post, I’ll summarize the most interesting similarities. In bayesian inference, we are usually interested in the posterior distribution of some parameters $\theta$ given the data d. This posterior can be written as a boltzmann distribution:

$$P(\theta|d)=\frac{P(\theta,d)}{P(d)}=\left.\frac{e^{-\beta H(\theta,d)}}{Z}\right|_{\beta=1}$$

with $H(\theta,d) = -\log P(\theta,d)/\beta$ and $Z=\int d\theta\;e^{-\beta H(\theta,d)}$. I’ll note that we are working with units such that $k_B=1$ and thus $\beta=1/T$.

The energy is just the expectation value of the hamiltonian H (note that the expectation is taken with respect to $P(\theta|d)$):

$$E = \langle H \rangle = -\frac{\partial \log Z}{\partial \beta}$$

And the entropy is equal to

$$S=-\int d\theta\;P(\theta|d)\log P(\theta|d)=\beta\langle H \rangle - \log Z$$

We can also define the free energy, which is

$$F=E\, - \frac{S}{\beta}=-\frac{\log Z}{\beta}$$

A cool way to approximate Z if we can’t calculate it analytically (we usually can’t calculate it numerically for high dimensional problems because the integrals take a very long time to calculate) is to use laplace’s approximation

$$Z=\int d\theta\;e^{-\beta H(\theta,d)}\simeq\sqrt{\frac{2\pi}{\beta|H'(\theta^*)|}}e^{-\beta H(\theta^*)}$$

where $|H’(\theta^*)|$ is the determinant of the hessian of the hamiltonian (say that 3 times real fast) and $\theta^*$ is such that $H(\theta^*)=\min H(\theta)$ (minimum because of the minus sign). Needless to say this approximation works best for small temperature ($\beta\rightarrow\infty$) which might not be close to the correct value at $\beta=1$. $\theta^*$ is known as the maximum a posteriori (MAP) estimate. Expectation values can also be approximated in a similar way:

$$\langle f(\theta) \rangle = \int d\theta \; f(\theta) P(\theta|d) \simeq\sqrt{\frac{2\pi}{\beta|H'(\theta^*)|}} f(\theta^*)P(\theta^*|d)$$

So the MAP estimate is defined as $\text{argmax}_{\theta} P(\theta|d)$. The result won’t change if we take the log of the posterior, which leads to a form similar to the entropy:

\begin{align}\theta_{\text{MAP}}&=\text{argmax}_{\theta} (-\beta H - \log Z)\\&=\text{argmax}_{\theta} (-2\beta H + S)\end{align}

Funny, huh? For infinite temperature ($\beta=0$) the parameters reflect total lack of knowledge: the entropy is maximized. As we lower the temperature, the energy term contributes more, reflecting the information provided by the data, until at temperature zero we would only care about the data contribution and ignore the entropy term.

(This is also the basic idea for the simulated annealing optimization algorithm, where in that case the objective function plays the role of the energy and the algorithm walks around phase space randomly, with jump size proportional to the temperature. The annealing schedule progressively lowers the temperature, restricting the random walk to regions of high objective function value, until it freezes at some point.)

Another cool connection is the fact that the heat capacity is given by

$$C(\beta)=\beta^2\langle (\Delta H)^2 \rangle=\beta^2\langle (H-\langle H \rangle)^2 \rangle=\beta^2\frac{\partial^2 \log Z}{\partial \beta^2}$$

In the paper I looked at last time, the authors used this fact to estimate the entropy: they calculated $\langle (\Delta H)^2 \rangle$ by MCMC for various betas and used the relation

Review of Searching for Collective Behavior in a Large Network of Sensory Neurons

Last time I reviewed the principle of maximum entropy. Today I am looking at a paper which uses it to create a simplified probabilistic representation of neural dynamics. The idea is to measure the spike trains of each neuron individually (in this case there are around 100 neurons from a salamander retina being measured) and simultaneously. In this way, all correlations in the network are preserved, which allows the construction of a probability distribution describing some features of the network.

Naturally, a probability distribution describing the full network dynamics would need a model of the whole network dynamics, which is not what the authors are aiming at here. Instead, they wish to just capture the correct statistics of the network states. What are the network states? Imagine you bin time into small windows. In each window, each neuron will be spiking or not. Then, for each time point you will have a binary word with 100 bits, where each a 1 corresponds to a spike and a -1 to silence. This is a network state, which we will represent by $\boldsymbol{\sigma}$.

So, the goal is to get $P(\boldsymbol{\sigma})$. It would be more interesting to have something like $P(\boldsymbol{\sigma}_{t+1}|\boldsymbol{\sigma}_t)$ (subscript denoting time) but we don’t always get what we want, now do we? It is a much harder problem to get this conditional probability, so we’ll have to settle for the overall probability of each state. According to maximum entropy, this distribution will be given by

$$P(\boldsymbol{\sigma})=\frac{1}{Z}\exp\left(-\sum_i \lambda_i f_i(\boldsymbol{\sigma})\right)$$

So now it is necessary to define which expectation values will be constrained. The authors have decided to constrain the following:

1. The mean. The function is $f_i=\sigma_i$, with one $i$ for each dimension, and thus there will be a term $-\sum_i h_i \sigma_i$ in the exponential (the $\lambda$ was renamed to $h$ due to thermodynamical reasons, as the first two terms in this model are equivalent to the ising model). 2. Pairwise correlations. This is equivalent to the second order moment, with $f_{ij}=\sigma_i\sigma_j$. The term added to the exponential will be $-1/2\sum_{i,j}J_{ij}\sigma_i\sigma_j$, again with the lagrange multipliers renamed. 3. The proportion of spiking neurons vs. silent ones. To define this, the authors propose a distribution $P(K)=\sum_{\boldsymbol{\sigma}}\, P(\boldsymbol{\sigma})\,\delta(\sum_i \sigma_i, 2K-N)$ (because K spins will cancel out K other spins, you’re left with $N-2K$ spins). This distribution is fixed by its N moments, which reduce to

$$\langle K^k\rangle=\sum_{\boldsymbol{\sigma}} \left(\sum_i \sigma_i\right)^k P(\boldsymbol{\sigma})$$

once you kill the delta. It is now clear that the max entropy function will be $f_k=\left(\sum_i \sigma_i\right)^k$ corresponding to a term

$$-\sum_k^n \lambda_k \left(\sum_i \sigma_i\right)^k$$

in the exponential (no renaming convention here). I am a bit suspicious of this term here since the different powers of $\sum_i \sigma_i$ are not really independent quantities, which might mean that the model is being overfit. I would be interested in seeing a comparison between this term and a simpler one, with just a single term ($k=1$) considered.

So their probability distribution looks like 

$$P(\boldsymbol{\sigma})\propto\exp\left(-\sum_i h_i \sigma_i-1/2\sum_{i,j}J_{ij}\sigma_i\sigma_j-\sum_k^n \lambda_k \left(\sum_i \sigma_i\right)^k\right)$$

which is kind of the ising model with the addition of that last term. The crucial step is now to find ${\mathbf{h}, \mathbf{J},\boldsymbol{\lambda}}$ such that the expectations of the distribution match the ones calculated from the data. So one can run an optimization algorithm which will change the values of those parameters until the expectations for the distribution match (or are close enough to) the experimental ones.

But $P(\boldsymbol{\sigma})$ as defined has no analytic expression for the desired expectation values,  so they must be calculated numerically. The authors use MCMC to calculate them, because of the high dimensionality of the problem. Of course, this is super computationally expensive, so they use a little trick to make it go faster. The idea is that if you only change the parameters slightly, the distribution will also only change slightly, and therefore the MCMC samples can be reused. Consider a max ent distribution with a vector of parameters $\lambda$. You can rewrite the expectation value of a function g as

\begin{aligned}\langle g\rangle_{\lambda'}&=\sum_{\boldsymbol{\sigma}} g(\boldsymbol{\sigma}) P_{\lambda'}(\boldsymbol{\sigma})\\&=\sum_{\boldsymbol{\sigma}} g(\boldsymbol{\sigma}) \frac{P_{\lambda'}(\boldsymbol{\sigma})}{P_{\lambda}(\boldsymbol{\sigma})}P_{\lambda}(\boldsymbol{\sigma})\\&=\sum_{\boldsymbol{\sigma}}  g(\boldsymbol{\sigma}) \frac{Z_{\lambda}}{Z_{\lambda'}} \exp\left(-(\lambda'-\lambda).f(\boldsymbol{\sigma})\right) P_{\lambda}(\boldsymbol{\sigma})\\&=\frac{Z_{\lambda}}{Z_{\lambda'}}  \left\langle g \exp\left(-(\lambda'-\lambda).f\right) \right\rangle_{\lambda}\\&=\frac{\left\langle g \exp\left(-(\lambda'-\lambda).f\right) \right\rangle_{\lambda}}{\left\langle \exp\left(-(\lambda'-\lambda).f\right) \right\rangle_{\lambda}} \end{aligned}

So if you have the samples for the parameter set $\lambda$ you can estimate the value for the parameter set $\lambda’$ with little error (indeed, the above formula is exact, any error comes from the fact that we have finite samples from MC). After you moved too far away from the distribution, your samples will probably not do a very good job of approximating the distribution, so the authors propose resampling after a given number of evaluations (but they don’t mention how they chose the max number of evaluations, this is something that probably could use a bit more discussion). Okay, so now they just plug this in their favorite (derivative-free) optimizer and they’re golden.

They then go on to discuss the results, where they show that this procedure produces a pretty good fit to the data with apparently low overfitting. I’d invite you to read the full paper, which has a lot of discussion on the actual neuroscience and information theory.

Maximum entropy: a primer and some recent applications

I’ll let Caticha summarize the principle of maximum entropy:

Among all possible probability distributions that agree with whatever we know select that particular distribution that reflects maximum ignorance about everything else. Since ignorance is measured by entropy, the method is mathematically implemented by selecting the distribution that maximizes entropy subject to the constraints imposed by the available information.

It appears to have been introduced by Jaynes in 57, and has seen a resurgence in the past decade with people taking bayesian inference more seriously. (As an aside, Jayne’s posthumously published book is well worth a read, in spite of some cringeworthy rants peppered throughout.) I won’t dwell too much on the philosophy as the two previously mentioned sources have already gone into great detail to justify the method.

Usually we consider constraints which are linear in the probabilities, namely we constrain the probability distribution to have specific expectation values. Consider that we know the expectation values of a certain set of functions $f^k$. Then, $p(x)$ should be such that

$$\langle f^k \rangle = \int dx \; p(x) f^k(x)$$

for all k. Let’s omit the notation $(x)$ for simplicity. Then, we can use variational calculus to find p which minimizes the functional

$$S[p]\; - \alpha \int dx\; p\; - \sum_k \lambda_k \langle f^k \rangle$$

The constraint with $\alpha$ is the normalization condition and $S$ is the shannon entropy.

The solution to this is

$$p = \frac{1}{Z}\exp\left(-\sum_k\lambda_k f^k \right) $$

with

$$Z=\int dx \; \exp \left(-\sum_k\lambda_k f^k \right)$$

the partition function (which is just the normalization constant). Now, we can find the remaining multipliers by solving the system of equations

$$-\frac{\partial \log Z}{\partial \lambda_k} = \langle f^k \rangle$$

I’ll let you confirm that if we fix the mean and variance we get a gaussian distribution. Go on, I’ll wait.

Power law distributions with maxent

I read a recent paper on PNAS which shows how to use this formalism to derive an intuition on why distributions with power law tails are so ubiquitous. It is not very well written, but I think I successfully deduced what the authors meant. Suppose you have a system with a given number of particles, where you ‘pay’ a fixed price for each particle to join. If you consider the first one to already be there, the total paid cost is $f(k)=(k-1)\mu$, with mu the chemical potential (now we are talking about discrete states, now the k indexes each state, or rather the number of particles in the cluster).

By bundling every constant into a $\mu^{\circ}$ factor, its not necessary to specify the actual value of the expectation and determine the lagrange multiplier: whatever it is, the distribution will look like $p_k=\frac{\exp(-\mu^{\circ} k)}{Z}$ with this arbitrary $\mu’$ factor (remember Z is just a normalization constant). This is an exponential distribution, which means events far away from the mean are comparatively very rare. The authors now propose to look at economies of scale - what if it gets cheaper to add a new particle as the cluster grows? Then, the cost is $k_0\mu/(k+k_0)$, which is a hill type function and where $k_0$ describes how much the cost is spread out with each additional particle. So the function f becomes $f(k)=\sum_{j=1}^{k-1} k_0\mu/(j+k_0)$ . Then you repeat the calculation and you get

$$p_k=\frac{\exp(-\mu^{\circ} k_0 \Psi(k+k_0))}{Z}$$

Here, $\Psi$ is the digamma function and it’s just a fancy way of hiding the sum you saw in the function $f$ (when you do that, you also get a constant term which then cancels out). You can expand it for large $k$ and get $\Psi \sim \log(k+k_0-1/2)$. Then the probability distribution is $p_k \sim \left(k+k_0-1/2\right)^{-\mu^{\circ} k_0}$. Cool. The authors tested this by fitting $\mu^{\circ}$ and $k_0$ to various datasets with power law distributions, which doesn’t really show much more since we already know they are power laws, and the expression they fit is a power law. The main message here is that you can get a power law from the maximum entropy principle, which suggests a sort of universality among systems brought about by this kind of cost amortization.

In the next post, I’ll talk about a more complicated application of the maximum entropy principle to neural codes.