Transformers work by re-reading the entire sequence at every step. Every token attends to every other token. This is powerful — nothing gets lost, all context is available — but the cost is brutal. Attention is O(n²) in both time and memory. Double the context length, quadruple the compute.
At 4096 tokens that's manageable. At 100k tokens it becomes a serious engineering problem. At 1M tokens — the kind of context you'd need to reason over an entire codebase or a long book — it's essentially intractable without approximation tricks that tend to hurt quality.
The deeper issue is architectural: the attention matrix grows with the sequence. There's no compression. Every past token is stored at full resolution forever. This is unlike how biological memory or most real-world signal processing works — both operate on compressed state that evolves over time and naturally forgets irrelevant detail.
complexity
complexity
A State Space Model describes how a hidden state h(t) evolves over time in response to an input signal x(t), and how that state produces an output y(t). In continuous time:
y(t) = C·h(t) + D·x(t)
A: state transition matrix — how state evolves
B: input projection — how input enters state
C: output projection — how state becomes output
D: skip connection (often 0)
This is a linear dynamical system — the same equations used in control theory since the 1960s to model physical systems like aircraft autopilots, economic models, and signal filters. The insight of S4 and Mamba is that these can be neural network layers if A, B, C are learned parameters.
The state h is a fixed-size vector — maybe 64 or 256 dimensions — that compresses the entire history of the input seen so far. No matter how long the sequence, the state stays the same size.
The same SSM can be computed three different ways depending on what you need. This is the architectural sleight-of-hand that makes it practical:
Process one token at a time, update state h. O(1) per step, O(n) total. Fast autoregressive generation. Constant memory regardless of sequence length.
Unroll the recurrence into a convolution kernel. Parallelize over the entire sequence. Fast on GPU. The kernel is derived from A, B, C analytically.
The underlying math is a continuous-time ODE. Discretization (zero-order hold or bilinear) converts it to the recurrent form with timestep Δ.
Train as convolution (fast, parallelizable), deploy as RNN (fast, constant memory). You get the best of both worlds.
The central problem with vanilla RNNs is that they can't reliably store information from many steps ago — gradients vanish and state gets corrupted. LSTMs hack around this with gates. SSMs solve it differently: initialize A to a specific structured matrix with provably good memory properties.
The HiPPO (High-order Polynomial Projection Operators) framework derives this matrix mathematically. The idea: at each timestep, the optimal state is the one that best reconstructs the entire history of the input as a Legendre polynomial expansion. The A matrix that achieves this has a closed-form solution.
= n+1 if n = k
= 0 if n < k
HiPPO-LegS matrix // "memorize measure" = Legendre
Each row n is a coefficient for the nth Legendre polynomial
The result: the hidden state continuously encodes a compressed polynomial approximation of the input history. Longer history = higher-degree polynomial = more coefficients needed. But crucially, recent events are always represented accurately while older events gracefully degrade — exactly like human working memory.
Neural networks process discrete tokens, not continuous signals. The continuous SSM (h' = Ah + Bx) must be converted to a discrete recurrence. This is done via the Zero-Order Hold (ZOH) method with a learned timestep parameter Δ:
B̄ = (Δ·A)⁻¹(exp(Δ·A) - I)·Δ·B
Ā, B̄ = discretized matrices used in recurrence
Δ = timestep — how much time each token represents
The discrete recurrence is then just:
y_t = C·h_t
The timestep Δ is a learned parameter. Large Δ = the model "zooms out," each token represents a long time span, old state decays faster. Small Δ = zoom in, state is preserved more carefully. This is one knob Mamba later makes input-dependent.
The visualization below shows a simplified SSM processing a token sequence. Each token updates the hidden state (visualized as a heatmap of state dimensions). Watch how information from early tokens persists — and how the state compresses rather than grows. Click Step to advance one token, or Play to run continuously.
S4 (Structured State Space Sequence model, 2021) was the first SSM to match transformers on long-range benchmarks. The key innovation: parameterize A as a diagonal plus low-rank (DPLR) matrix. This makes the computation tractable — instead of full matrix exponentials, you get fast eigenvalue decompositions.
Λ = diagonal, P/Q = low-rank correction
Enables O(N log N) convolution via Cauchy kernel
S4 crushed the Long Range Arena benchmark — tasks requiring dependencies over thousands of tokens where transformers struggle. It proved that linear recurrence with the right initialization could be competitive with attention.
The limitation: A, B, C are fixed per channel — they don't change based on the input. The model processes every input the same way regardless of content. This makes it a great filter but a poor selector.
Gu and Dao's key observation: language is selective. When you read a sentence, some words matter a lot for understanding the next word, and others barely matter. A fixed filter can't know which is which — it has to treat all inputs the same.
The Mamba fix: make B, C, and Δ functions of the current input x_t. They're computed by small linear projections of the input at each step:
C_t = Linear_C(x_t)
Δ_t = softplus(Linear_Δ(x_t))
Now the state transition changes per token
Model decides what to remember and what to forget
This breaks the convolutional view — you can no longer precompute a fixed kernel because the kernel changes with each input. Instead, Mamba uses a parallel scan (prefix sum algorithm) that computes all steps simultaneously in O(n log n) operations. The key tradeoff: slightly more compute than S4's pure convolution, but now competitive with transformers on tasks requiring content-based recall.
In Mamba, the "forgetting rate" of the hidden state depends on the current token. The visualization shows how Δ_t (timestep / forget gate) varies per token. High Δ = fast decay, the model is choosing to forget old state and focus on the current token. Low Δ = slow decay, the model preserves the accumulated context.
Compare: S4 would use a fixed Δ for all tokens (flat line). Mamba learns to spike Δ at "reset" boundaries — sentence ends, topic changes, paragraph breaks — and keep it low within coherent context windows. This is the mechanism that lets it do in-context learning.
The selective scan naively requires materializing intermediate states — expensive in high-bandwidth memory (HBM). Mamba's implementation uses a kernel fusion trick analogous to Flash Attention: keep intermediate states in fast SRAM, never write them to HBM.
The implication for inference is significant: a Mamba model generating a 100k-token sequence uses the same memory as generating a 100-token sequence. A transformer's KV cache grows linearly — at 100k tokens it becomes the dominant memory cost.
Mamba-2 (2024) revealed something unexpected: attention and SSMs are mathematically dual. Under certain constraints, a selective SSM computes exactly the same thing as a linear attention mechanism. The state space is the "dual" representation of the key-value cache.
Attn: o_t = Σ_s (q_t·k_s^T) v_s
State Space Duality: both compute the same
function when Ā_t is structured appropriately
This led to the SSD (Structured State Space Duality) layer — a unified formulation that can be computed either as an SSM recurrence or as a structured matrix multiplication, whichever is faster on the given hardware.
Mamba-2 is also easier to make work with tensor parallelism (splitting across multiple GPUs) — a practical advantage for training at scale. The architecture is now being integrated into hybrid models (Jamba, Zamba) that interleave attention and SSM layers.
The lossy state is the fundamental weakness. Transformers can do exact key-value lookup (retrieve a specific token from position 3). Mamba compresses history — if that token got overwritten by later inputs, it's gone. For tasks requiring precise retrieval, transformers still win. For tasks requiring long coherent context, Mamba scales better.