MCMC and the Ising Model: How to Sample Efficiently¶

Last blog post, I talked about how we can write algorithms so that a computer can draw samples from an arbitrary probability distribution $\pi$. The techniques we developed were very practical, and are used in machine learning, computer graphics, and many other fields. However, one assumption we made was that we could efficiently compute the value $\pi(x)$ for every $x$ in our state space $\mathcal{X}$. What do we do if this does not hold anymore? For example, What if $\pi(x)$ involves a sum that is computationally intractable (ex. a sum with $2^{100}$ terms)? Do we give up? Of course not! We can use Markov Chain Monte Carlo algorithm to save the day.

The Ising Model¶

The Ising Model is an example of a situation where computing $\pi$ is computationally intractable. First developed by the German physicist Ernst Ising, this is a model for ferromagnetism. Note that in this post, I will focus on the statistical physics discussion of the Ising model. However, this model is used in various other fields, from machine learning and computer vision to quantum computing.

The Ising model considers a lattice of points (particles), each with a spin of $\pm 1$ representing the alignment of the magnetic field of the particles. More formally, let $\Lambda \subseteq \mathbb{R}^d$ bet a finite set representing all the lattice points, and $\sigma: \Lambda \to \{-1, 1\}$ be a function which assigns spins to each particle in the lattice (called a configuration).

Let's visualize what this looks like for a specific case. Say $d = 2$, and $\Lambda = \{(i,j) \in \mathbb{Z}^2∣ 1 \leq i \leq 5, 1 \leq j \leq 5\}$, and $$ \sigma(i, j) = \begin{cases} +1 & \text{if } i \leq j \\ -1 & \text{otherwise} \\ \end{cases} $$

In [62]:
Lambda = [(i, j) for i in range(1, 5 + 1) for j in range(1, 5 + 1)]
sigma = {(i, j): 1 if i <= j else -1 for (i, j) in Lambda}
plot_lattice(Lambda, sigma)
No description has been provided for this image

Of course, in the real world, the spin's might not be so symmetrical like this. So, let's see what a configuration that assigns a spin uniformly randomly for each lattice point looks like on a bigger lattice.

In [64]:
size = 15
Lambda = [(i, j) for i in range(1, size + 1) for j in range(1, size + 1)]
sigma = {(i, j): np.random.choice([-1, 1]) for (i, j) in Lambda}
plot_lattice(Lambda, sigma)
No description has been provided for this image

Hopefully this gives you an intuition on what we will be working with. Now, given a lattice $\Lambda$, for some $i, j \in \Lambda$ let's say that $i \sim j$ iff $i$ is $j$'s neighbor in the lattice (i.e, $i$ is the up/down/left/right neighbor of $j$). Now, given some spin configuration $\sigma$, we can define the "energy" of the configuration $$ E(\sigma) := J \sum_{\substack{i, j \in \Lambda \\ i \sim j}}\sigma(i) \sigma(j) - B \sum_{i \in \Lambda} \sigma(i) $$ Here, $J \neq 0$ is the interaction strength. For ferromagnetic materials, particles favor alignment and we have $J > 0$ , and for anti-ferromagnetic materials particles do not favor alignment and we have $J < 0$. The quantity $B$ represents the strength of the external magnetic field. Having spins align with the external magnetic field (i.e. sign $\sigma(i) = $ sign $\sigma(j)$) reduces the total energy.

Now, let's consider the set of all spin configurations $\mathcal{X} := \{\sigma: \Lambda \to \{+1, -1\}\}$. The Ising model assumes that spin configurations for particles in a lattice behave like a stochastic process with sample space $\mathcal{X}$. More specifically, for a material with interaction strength $J$ and external magnetic strength $B$, it assumes that we expect to see a spin configuration $\sigma$ with probability proportional to

$$ \pi_u (\sigma) := e^{- \beta H(\sigma)}, \text{ where } \beta = \frac{1}{k_B T} $$ and $k_B$ is the Boltzmann constant, and $T$ is the temperature. Note that $\pi$ is not a probability distribution since it does not sum to $1$. Therefore, we need to compute the normalizing constant $$ Z := \sum_{\sigma \in \mathcal{X}} \pi_u (\sigma) $$ and then we have that $$ \pi (\sigma) = \frac{1}{Z} \pi_u(\sigma) $$

At this point, alarm bells should be going off in your head. Remember what $\mathcal{X}$ was. It is the set of all possible spin configurations for a lattice $\Lambda$. Having a sum over all possible configurations is bad! We can view this as choosing a spin from $\{-1, 1\}$ to assign to each point in the lattice. Therefore, the total number of configurations is $$ \mathcal{X} = 2^{|\Lambda|} $$ Even if $|\Lambda| = 100$ (which is a fairly small lattice), then computing the normalizing constant $Z$ is hopeless. If you don't believe me, try running the following python code in your terminal and see how long it takes to terminate (hint: it will keep running until the end of the universe):

size = 100
Z = 0

for i in range(2 ** size):
    Z += math.exp(-i)
print(Z)

Do we really need $Z$?¶

So we're stuck. We want to sample from $\pi(\sigma)$, but computing $Z$ requires summing over $2^{|\Lambda|}$ terms, which is completely hopeless.

Here's where Markov Chain Monte Carlo (MCMC) algorithm comes in to save the day. The main idea behind MCMC is that instead of sampling from $\pi$ directly, we can build Markov chain whose states are spin configurations. We can then carefully define transition probabilities so that after running the chain for a while, the stationary distribution of states converges to $\pi$. Note, if you don't know what a Markov Chain is or have not heard of the terms "stationary distribution", don't worry I'll cover everything you need below. I might make another post going more into depth about these topics.

Markov Chains and the stationary distribution¶

A Markov chain is a sequence of random states $X_0, X_1, X_2, \ldots$ where each state only depends on the previous one. More formally, for any states $x_0, x_1, \ldots, x_t, x_{t+1}$, we have $$ P(X_{t+1} = x_{t+1} | X_t = x_t, X_{t-1} = x_{t-1}, \ldots, X_0 = x_0) = P(X_{t+1} = x_{t+1} | X_t = x_t) $$ In plain English: the future only depends on the present, not the past. This is called the Markov property.

We can describe a Markov chain by its transition probabilities $P(x \to y)$, which tell us the probability of moving from state $x$ to state $y$ in one step. Under reasonable conditions (finite state space, irreducible, and aperiodic), a Markov chain has a stationary distribution $\pi$. This is a probability distribution with the special property that if $X_t \sim \pi$, then $X_{t+1} \sim \pi$ as well (or equivalently, $(\pi P)(x) = \pi(x)$). Even better, no matter what state we start in, after running the chain for long enough, the distribution of $X_t$ converges to $\pi$.

Let's see this in action with a simple example. Consider a Markov chain with three weather states: Sunny, Cloudy, and Rainy. Each day, the weather transitions according to some fixed probabilities.

In [ ]:
# P[i,j] = P(tomorrow is j | today is i)
P = np.array([[0.7, 0.2, 0.1],   # From Sunny
              [0.3, 0.4, 0.3],   # From Cloudy
              [0.2, 0.3, 0.5]])  # From Rainy

dist1 = np.array([1.0, 0.0, 0.0])  # Start with 100% Sunny
dist2 = np.array([0.0, 1.0, 0.0])  # Start with 100% Cloudy
dist3 = np.array([0.0, 0.0, 1.0])  # Start with 100% Rainy


history1, history2, history3 = [dist1], [dist2], [dist3]

for _ in range(20):
    dist1 = dist1 @ P
    dist2 = dist2 @ P
    dist3 = dist3 @ P
    history1.append(dist1.copy())
    history2.append(dist2.copy())
    history3.append(dist3.copy())

history1, history2, history3 = np.array(history1), np.array(history2), np.array(history3)

fig, (ax1, ax2, ax3) = plt.subplots(1, 3, figsize=(15, 4))

ax1.plot(history1[:, 0], 'o-', linewidth=2, markersize=4, label='Start Sunny')
ax1.plot(history2[:, 0], 'o-', linewidth=2, markersize=4, label='Start Cloudy')
ax1.plot(history3[:, 0], 's--', linewidth=2, markersize=4, label='Start Rainy')
ax1.set_xlabel('Day'); ax1.set_ylabel('Probability')
ax1.set_title('Sunny'); ax1.legend(); ax1.grid(True, alpha=0.3)

ax2.plot(history1[:, 1], 'o-', linewidth=2, markersize=4, label='Start Sunny')
ax2.plot(history2[:, 1], 'o-', linewidth=2, markersize=4, label='Start Cloudy')
ax2.plot(history3[:, 1], 's--', linewidth=2, markersize=4, label='Start Rainy')
ax2.set_xlabel('Day'); ax2.set_ylabel('Probability')
ax2.set_title('Cloudy'); ax2.legend(); ax2.grid(True, alpha=0.3)

ax3.plot(history1[:, 2], 'o-', linewidth=2, markersize=4, label='Start Sunny')
ax3.plot(history2[:, 2], 'o-', linewidth=2, markersize=4, label='Start Cloudy')
ax3.plot(history3[:, 2], 's--', linewidth=2, markersize=4, label='Start Rainy')
ax3.set_xlabel('Day'); ax3.set_ylabel('Probability')
ax3.set_title('Rainy'); ax3.legend(); ax3.grid(True, alpha=0.3)

plt.suptitle('Convergence to Stationary Distribution from Different Starting Points')
plt.tight_layout(); plt.show()
No description has been provided for this image

Notice that all three starting distributions converge to the same stationary distribution! This is exactly the property we want to exploit. If we can design a Markov chain on spin configurations whose stationary distribution is $\pi$, then we can just run the chain for a while and sample configurations that are (approximately) distributed according to $\pi$. Problem solved! But how do we actually design such a Markov chain? We need to specify transition probabilities $P(\sigma \to \tau)$ that make $\pi$ the stationary distribution. This seems tricky, especially since we can't even compute $\pi$ itself (remember, we don't know $Z$). But it turns out that we don't need to compute $\pi$ to design these transition probabilities. We just need to make sure that the transition probabilities satisfy a condition called detailed balance: $$ \pi(\sigma) P(\sigma \to \tau) = \pi(\tau) P(\tau \to \sigma) $$ If this equation holds for all pairs of states $\sigma, \tau$, then $\pi$ is guaranteed to be the stationary distribution of the chain (proof of this is omitted, but try to convince yourself why). And look what happens when we rearrange this equation: $$ \frac{P(\sigma \to \tau)}{P(\tau \to \sigma)} = \frac{\pi(\tau)}{\pi(\sigma)} = \frac{1/Z \cdot \pi_u(\tau)}{1/Z \cdot \pi_u(\sigma)} = \frac{ \pi_u(\tau)}{\pi_u(\sigma)} $$

The $Z$ cancels out! So we can design transition probabilities that satisfy detailed balance using only ratios of $\pi_u$, which we can compute.

The Metropolis-Hastings Algorithm (MCMC)¶

The Metropolis-Hastings algorithm is a clever way to construct transition probabilities that satisfy detailed balance. Again, the proof of this is omitted. Here's how it works:

  1. Start with some initial configuration $\sigma^{(0)}$ (you can even pick it randomly)
  2. For $t = 0, 1, 2, \ldots$:
    • Propose a new configuration $\tau$ by randomly picking a site and flipping its spin
    • Compute the acceptance probability: $$ \alpha = \min\left(1, \frac{\pi_u(\tau)}{\pi_u(\sigma^{(t)})}\right) = \min\left(1, e^{-\beta(E(\tau) - E(\sigma^{(t)}))}\right) $$
    • Accept or reject: With probability $\alpha$, set $\sigma^{(t+1)} = \tau$. Otherwise, set $\sigma^{(t+1)} = \sigma^{(t)}$ (i.e., stay where you are)

Notice we only need to compute energies $E(\tau)$ and $E(\sigma^{(t)})$. Even better, since we're only flipping one spin at a time, we only need to look at the neighbors of that spin to compute the change in energy. If we run this chain long enough, the configurations $\sigma^{(t)}$ will eventually be distributed (approximately) according to $\pi$.

Below is an implementation of Metropolis-Hastings for the Ising model. We'll start with a completely random spin configuration and watch as the algorithm explores the state space. The animation shows how the spins evolve over time. Notice that even though we start from chaos, domains of aligned spins begin to form. This is the algorithm finding configurations with lower energy (remember, $J > 0$ means aligned spins are favored). The fact that we're seeing this emergent structure without ever computing $Z$ is pretty cool!

In [56]:
from matplotlib.animation import FuncAnimation
from IPython.display import HTML

def E(sigma, Lambda, J, B):
    energy = 0

    for (i, j) in Lambda:
        if (i, j+1) in Lambda:
            energy += J * sigma[(i, j)] * sigma[(i, j+1)]
        if (i+1, j) in Lambda:
            energy += J * sigma[(i, j)] * sigma[(i+1, j)]

    for site in Lambda:
        energy -= B * sigma[site]

    return energy

def delta_E(sigma, Lambda, site, J, B):
    i, j = site
    current_spin = sigma[site]

    neighbor_sum = 0
    for neighbor in [(i-1, j), (i+1, j), (i, j-1), (i, j+1)]:
        if neighbor in Lambda:
            neighbor_sum += sigma[neighbor]

    delta_E = 2 * current_spin * (J * neighbor_sum - B)
    return delta_E

def metropolis_hastings(sigma, Lambda, J, B, beta):
    site = Lambda[np.random.randint(len(Lambda))]
    dE = delta_E(sigma, Lambda, site, J, B)
    alpha = min(1, np.exp(-beta * dE))

    if np.random.rand() < alpha:
        sigma[site] *= -1

    return sigma


size = 20
Lambda = [(i, j) for i in range(1, size + 1) for j in range(1, size + 1)]
sigma = {site: np.random.choice([-1, 1]) for site in Lambda}


J = 1.0
B = 0.0
T = 2.27
k_B = 1.0
beta = 1 / (k_B * T)

n_steps = 5000
save_every = 50
configurations = [sigma.copy()]

for step in range(n_steps):
    sigma = metropolis_hastings(sigma, Lambda, J, B, beta)
    if step % save_every == 0:
        configurations.append(sigma.copy())

xs = sorted(set(i for i, _ in Lambda))
ys = sorted(set(j for _, j in Lambda))
size_x, size_y = len(xs), len(ys)

fig, ax = plt.subplots(figsize=(8, 8))
for i in range(size_x):
    for j in range(size_y):
        if i < size_x - 1:
            ax.plot([j, j], [i, i + 1], color='black', lw=0.8)
        if j < size_y - 1:
            ax.plot([j, j + 1], [i, i], color='black', lw=0.8)

scatters = {}
for (i, j) in Lambda:
    color = 'red' if configurations[0][(i, j)] == 1 else 'blue'
    scatter = ax.scatter(j - 1, i - 1, color=color, s=180, zorder=3,
                        edgecolors='k', linewidth=0.3)
    scatters[(i, j)] = scatter


spin_up = plt.Line2D([], [], color='red', marker='o', linestyle='None',
                     markersize=10, label='Spin +1')
spin_down = plt.Line2D([], [], color='blue', marker='o', linestyle='None',
                       markersize=10, label='Spin -1')
ax.legend(handles=[spin_up, spin_down], loc='center left',
          bbox_to_anchor=(1.02, 0.5), frameon=True)

ax.set_aspect('equal')
ax.set_xticks([])
ax.set_yticks([])
ax.set_xlim(-0.5, size_y - 0.5)
ax.set_ylim(-0.5, size_x - 0.5)
title = ax.set_title(f'Metropolis-Hastings: Step 0 (T={T:.2f})', fontsize=14)
ax.invert_yaxis()

def animate(frame):
    config = configurations[frame]
    for (i, j) in Lambda:
        color = 'red' if config[(i, j)] == 1 else 'blue'
        scatters[(i, j)].set_color(color)
    title.set_text(f'Metropolis-Hastings: Step {frame * save_every} (T={T:.2f})')
    return list(scatters.values()) + [title]

anim = FuncAnimation(fig, animate, frames=len(configurations),
                     interval=100, blit=True, repeat=True)

plt.tight_layout()
HTML(anim.to_jshtml())
Out[56]:
No description has been provided for this image
No description has been provided for this image