Hopfield Networks

01 // Concept

MEMORY AS ENERGY MINIMA

A Hopfield Network is a recurrent neural network that functions as an associative memory — given a partial or corrupted pattern, it retrieves the closest stored memory. John Hopfield introduced it in 1982, and it was immediately recognized as a landmark: for the first time, memory storage and retrieval were understood as a single physical process — energy minimization.

Every configuration of the network's N binary neurons corresponds to a point in a 2^N-dimensional state space. The network defines an energy function over this space, and stored memories are local energy minima — valleys in the landscape. Given a starting state (a noisy or incomplete query), the network's update rule always moves downhill. The system converges to a nearby minimum, retrieving the stored pattern.

The biological analogy is direct: this is how the brain might store and retrieve episodic memories. A smell triggers a whole scene. A fragment of a song surfaces the whole song. The Hopfield model gave the first mathematically rigorous account of how this could work in a neural substrate — using nothing but physics.

1982

Hopfield
paper

0.14N

Classic
capacity

2^N

Modern
capacity

2021

Nobel
Prize

02 // Interactive // Energy Landscape

MEMORY RETRIEVAL AS GRADIENT DESCENT

The energy landscape has valleys at stored memories. Click anywhere on the landscape to start a state — watch it roll downhill to the nearest memory. The update rule always decreases energy until a fixed point is reached.

Current Energy

—

Converged To

—

Steps

Click landscape to release a particle. It rolls to the nearest memory (energy minimum).

03 // Classical Hopfield

THE 1982 MODEL

N binary neurons, each s_i ∈ {-1, +1}. The energy function (from physics — identical to the Ising spin glass model):

E = -½ Σᵢⱼ wᵢⱼ \cdot sᵢ \cdot sⱼ sum over all pairs i\neqj, wᵢⱼ = wⱼᵢ (symmetric)

To store M patterns ξ¹,...,ξᴹ, use Hebbian learning — neurons that fire together wire together:

wᵢⱼ = (1/N) Σ_μ ξᵢᵘ \cdot ξⱼᵘ outer product rule — one shot, no backprop needed

The update rule: flip neuron i to sign(Σⱼ wᵢⱼ sⱼ). Each flip is guaranteed to decrease E. The system converges — guaranteed — to a local minimum. The stored patterns are exactly those minima.

Capacity limit: stores at most ~0.14N patterns reliably. Beyond this, "spurious memories" (local minima that aren't stored patterns) proliferate and retrieval fails.

Hebbian0.14N limitIsing model

04 // Classical // Pattern Demo

5x5 BINARY MEMORY

Store binary patterns (5x5 pixels). Corrupt one and watch the network retrieve the original through iterated updates.

Stored A

Stored B

Corrupted Query

Retrieved

05 // Modern Hopfield Networks // 2020

EXPONENTIAL CAPACITY

Ramsauer et al. (2020) showed that replacing the quadratic energy function with an exponential one shatters the 0.14N capacity limit:

E = -lse(β, Xᵀξ) + ½ξᵀξ + (1/β)log(N) + C lse = log-sum-exp. β = inverse temperature

The new update rule is the softmax over dot products between the query and all stored patterns — which is exactly the attention mechanism in transformers.

ξ_new = X \cdot softmax(β \cdot Xᵀξ) one update step = one attention operation

Capacity scales exponentially with N: can store up to 2^(N/2) patterns without interference. The price: continuous-valued patterns instead of binary, and a polynomial energy function of high degree (tied to β).

This is one of the most surprising theoretical results in recent deep learning: the transformer attention mechanism is mathematically equivalent to one update step of a modern Hopfield network trying to retrieve a stored pattern.

Transformer attention: Attn(Q,K,V) = V \cdot softmax(QKᵀ/\sqrtd) Modern Hopfield update: ξ_new = X \cdot softmax(β \cdot Xᵀξ) Q = query ξ, K = stored patterns X, V = stored patterns X β = 1/\sqrtd (inverse temperature = scale factor)

The correspondence: keys and values are both the stored pattern matrix X. The query ξ is the retrieval probe. The scale factor β controls retrieval sharpness — high β focuses on the closest pattern (winner-take-all), low β averages across many.

exponential capacity= attentionRamsauer 2020

06 // Interactive // Attention as Hopfield Retrieval

TEMPERATURE DEMO

In the modern Hopfield model, β (inverse temperature) controls retrieval sharpness. Drag the slider to see how it transitions from averaging across all memories (low β, soft attention) to sharp winner-take-all retrieval (high β, hard attention). This is exactly what the attention scale factor 1/√d does in transformers.

Beta (β)

1.0

inverse temperature

Low β: spreads weight across all patterns. Equivalent to low attention temperature — mixes context.

High β: concentrates on nearest pattern. Hard retrieval. Winner-take-all.

β=1 softβ=50 hard

07 // Physics Connection

SPIN GLASSES + STAT MECH

The Hopfield network energy function is identical to the Ising spin glass model — a statistical mechanics model of disordered magnetic systems, studied since the 1970s. Hopfield saw that the same mathematics that described frustrated magnets could describe memory.

Ising Hamiltonian: H = -Σᵢⱼ Jᵢⱼ σᵢ σⱼ Hopfield energy: E = -½ Σᵢⱼ wᵢⱼ sᵢ sⱼ Formally identical. J = w, σ = s.

Amit, Gutfreund, and Sompolinsky (1985) did the full statistical mechanics analysis of Hopfield networks using replica theory — borrowed from spin glass theory. They derived the 0.14N capacity limit rigorously. The tools of physics became tools of neural network theory.

Ising modelspin glassreplica theory

08 // In Modern ML

WHERE HOPFIELD LIVES NOW

The 2020 reinterpretation of transformers as Hopfield networks isn't just theoretical — it's changed how people think about what attention is doing:

Transformer KV cachestored pattern matrix X

Query vector qretrieval probe ξ

Attention weightssoftmax(β·Xᵀξ)

Attention outputretrieved pattern

Scale 1/√dβ (inverse temp)

This framing also explains why attention heads behave as "lookup operations" — each head is an associative memory retrieval with different stored patterns. It also motivated new architectures: LSTM-inspired recurrent models using the Hopfield retrieval update as an explicit memory module.

Hopfield layerattention = retrievalRamsauer 2020

09 // Boltzmann Machines

STOCHASTIC HOPFIELD

Add temperature T to the Hopfield update: instead of deterministic sign(.), flip neuron i with probability P(s_i=1) = σ(2/T · Σⱼ wᵢⱼ sⱼ). At T→0: deterministic Hopfield. At T→∞: random. At the right T: a Boltzmann distribution over states. This is the Boltzmann Machine (Hinton, Sejnowski 1986) — the first model of learned probabilistic inference in neural networks.

P(s) \propto exp(-E(s)/T) thermal equilibrium = Boltzmann distribution

Hinton 1986stochasticRBM ancestor

10 // Capacity

CLASSICAL vs MODERN

Classical storage~0.14N patterns

Modern (β→∞)2^(N/2) patterns

Pattern typebinary → continuous

Update stepsmany → one

Spurious memoriesyes → no

The exponential capacity comes at a cost: energy function requires polynomial terms of degree 2n (where n controls capacity-error tradeoff), making it more computationally expensive. The β→∞ limit achieves exponential capacity but requires exact nearest-neighbor lookup — equivalent to hard attention.

2^N vs 0.14N

11 // Nobel Prize

2024 PHYSICS

John Hopfield and Geoffrey Hinton were awarded the 2024 Nobel Prize in Physics for "foundational discoveries and inventions that enable machine learning with artificial neural networks." Hopfield specifically for the associative memory network; Hinton for the Boltzmann Machine and backpropagation work.

The Nobel committee explicitly highlighted the physics connection: the Hopfield network imported the Ising model and spin glass theory directly into neuroscience and AI. Statistical mechanics became a design tool for computing systems.

Nobel 2024PhysicsHopfield + Hinton

HOPFIELDNETWORKS

HOPFIELD
NETWORKS