Mamba / State Space Models

01 // The Problem With Transformers

THE LEDGER THAT NEVER FORGETS (OR SCALES)

Transformers work by re-reading the entire sequence at every step. Every token attends to every other token. This is powerful — nothing gets lost, all context is available — but the cost is brutal. Attention is O(n²) in both time and memory. Double the context length, quadruple the compute.

At 4096 tokens that's manageable. At 100k tokens it becomes a serious engineering problem. At 1M tokens — the kind of context you'd need to reason over an entire codebase or a long book — it's essentially intractable without approximation tricks that tend to hurt quality.

The deeper issue is architectural: the attention matrix grows with the sequence. There's no compression. Every past token is stored at full resolution forever. This is unlike how biological memory or most real-world signal processing works — both operate on compressed state that evolves over time and naturally forgets irrelevant detail.

O(n²)

attention
complexity

O(n)

SSM
complexity

02 // Core Concept

A LAYER THAT IS A DIFFERENTIAL EQUATION

A State Space Model describes how a hidden state h(t) evolves over time in response to an input signal x(t), and how that state produces an output y(t). In continuous time:

h'(t) = A\cdoth(t) + B\cdotx(t) y(t) = C\cdoth(t) + D\cdotx(t) A: state transition matrix — how state evolves B: input projection — how input enters state C: output projection — how state becomes output D: skip connection (often 0)

This is a linear dynamical system — the same equations used in control theory since the 1960s to model physical systems like aircraft autopilots, economic models, and signal filters. The insight of S4 and Mamba is that these can be neural network layers if A, B, C are learned parameters.

The state h is a fixed-size vector — maybe 64 or 256 dimensions — that compresses the entire history of the input seen so far. No matter how long the sequence, the state stays the same size.

03 // Three Views of One Model

THE DUALITY TRICK

The same SSM can be computed three different ways depending on what you need. This is the architectural sleight-of-hand that makes it practical:

Recurrent // Inference

Process one token at a time, update state h. O(1) per step, O(n) total. Fast autoregressive generation. Constant memory regardless of sequence length.

Convolutional // Training

Unroll the recurrence into a convolution kernel. Parallelize over the entire sequence. Fast on GPU. The kernel is derived from A, B, C analytically.

Continuous // Intuition

The underlying math is a continuous-time ODE. Discretization (zero-order hold or bilinear) converts it to the recurrent form with timestep Δ.

Train as convolution (fast, parallelizable), deploy as RNN (fast, constant memory). You get the best of both worlds.

train=convinfer=rnnmath=ode

04 // HiPPO Initialization

THE MATRIX THAT REMEMBERS HISTORY

The central problem with vanilla RNNs is that they can't reliably store information from many steps ago — gradients vanish and state gets corrupted. LSTMs hack around this with gates. SSMs solve it differently: initialize A to a specific structured matrix with provably good memory properties.

The HiPPO (High-order Polynomial Projection Operators) framework derives this matrix mathematically. The idea: at each timestep, the optimal state is the one that best reconstructs the entire history of the input as a Legendre polynomial expansion. The A matrix that achieves this has a closed-form solution.

A_nk = -(2n+1)^½ \cdot (2k+1)^½ if n > k = n+1 if n = k = 0 if n < k HiPPO-LegS matrix // "memorize measure" = Legendre Each row n is a coefficient for the nth Legendre polynomial

The result: the hidden state continuously encodes a compressed polynomial approximation of the input history. Longer history = higher-degree polynomial = more coefficients needed. But crucially, recent events are always represented accurately while older events gracefully degrade — exactly like human working memory.

HiPPO 2020Gu et al.Legendre basis

05 // Discretization

FROM CONTINUOUS TO DISCRETE

Neural networks process discrete tokens, not continuous signals. The continuous SSM (h' = Ah + Bx) must be converted to a discrete recurrence. This is done via the Zero-Order Hold (ZOH) method with a learned timestep parameter Δ:

Ā = exp(Δ\cdotA) B̄ = (Δ\cdotA)⁻¹(exp(Δ\cdotA) - I)\cdotΔ\cdotB Ā, B̄ = discretized matrices used in recurrence Δ = timestep — how much time each token represents

The discrete recurrence is then just:

h_t = Ā·h_{t-1} + B̄·x_t y_t = C·h_t

The timestep Δ is a learned parameter. Large Δ = the model "zooms out," each token represents a long time span, old state decays faster. Small Δ = zoom in, state is preserved more carefully. This is one knob Mamba later makes input-dependent.

ZOH discretizationlearned Δ

06 // Interactive // State Evolution

WATCH THE HIDDEN STATE COMPRESS HISTORY

The visualization below shows a simplified SSM processing a token sequence. Each token updates the hidden state (visualized as a heatmap of state dimensions). Watch how information from early tokens persists — and how the state compresses rather than grows. Click Step to advance one token, or Play to run continuously.

SSM State Evolution // d_state = 16 token 0 / 24

STATE SIZE STAYS CONSTANT REGARDLESS OF SEQUENCE LENGTH

07 // S4 — The First Big SSM

STRUCTURED STATE SPACES

S4 (Structured State Space Sequence model, 2021) was the first SSM to match transformers on long-range benchmarks. The key innovation: parameterize A as a diagonal plus low-rank (DPLR) matrix. This makes the computation tractable — instead of full matrix exponentials, you get fast eigenvalue decompositions.

A = Λ - PQ* Λ = diagonal, P/Q = low-rank correction Enables O(N log N) convolution via Cauchy kernel

S4 crushed the Long Range Arena benchmark — tasks requiring dependencies over thousands of tokens where transformers struggle. It proved that linear recurrence with the right initialization could be competitive with attention.

The limitation: A, B, C are fixed per channel — they don't change based on the input. The model processes every input the same way regardless of content. This makes it a great filter but a poor selector.

S4 2021LRA benchmarktime-invariant

08 // The Mamba Insight

SELECTIVITY — MAKING THE STATE INPUT-DEPENDENT

Gu and Dao's key observation: language is selective. When you read a sentence, some words matter a lot for understanding the next word, and others barely matter. A fixed filter can't know which is which — it has to treat all inputs the same.

The Mamba fix: make B, C, and Δ functions of the current input x_t. They're computed by small linear projections of the input at each step:

B_t = Linear_B(x_t) C_t = Linear_C(x_t) Δ_t = softplus(Linear_Δ(x_t)) Now the state transition changes per token Model decides what to remember and what to forget

This breaks the convolutional view — you can no longer precompute a fixed kernel because the kernel changes with each input. Instead, Mamba uses a parallel scan (prefix sum algorithm) that computes all steps simultaneously in O(n log n) operations. The key tradeoff: slightly more compute than S4's pure convolution, but now competitive with transformers on tasks requiring content-based recall.

selective scanparallel prefixcontent-aware

09 // Interactive // Selectivity

INPUT-DEPENDENT MEMORY GATING

In Mamba, the "forgetting rate" of the hidden state depends on the current token. The visualization shows how Δ_t (timestep / forget gate) varies per token. High Δ = fast decay, the model is choosing to forget old state and focus on the current token. Low Δ = slow decay, the model preserves the accumulated context.

Selective Scan // Δ_t gating per token

Compare: S4 would use a fixed Δ for all tokens (flat line). Mamba learns to spike Δ at "reset" boundaries — sentence ends, topic changes, paragraph breaks — and keep it low within coherent context windows. This is the mechanism that lets it do in-context learning.

10 // Hardware-Aware Algorithm

BUILT FOR GPU MEMORY HIERARCHY

The selective scan naively requires materializing intermediate states — expensive in high-bandwidth memory (HBM). Mamba's implementation uses a kernel fusion trick analogous to Flash Attention: keep intermediate states in fast SRAM, never write them to HBM.

Naive scanO(n·d) HBM reads/writes

Fused kernelStates stay in SRAM

Speedup3-5x vs naive on A100

MemoryO(1) state vs O(n²) KV cache

The implication for inference is significant: a Mamba model generating a 100k-token sequence uses the same memory as generating a 100-token sequence. A transformer's KV cache grows linearly — at 100k tokens it becomes the dominant memory cost.

kernel fusionSRAM tilingO(1) inference

11 // Mamba-2 + State Space Duality

ATTENTION IS A SPECIAL CASE OF SSM

Mamba-2 (2024) revealed something unexpected: attention and SSMs are mathematically dual. Under certain constraints, a selective SSM computes exactly the same thing as a linear attention mechanism. The state space is the "dual" representation of the key-value cache.

SSM: h_t = Ā_t·h_{t-1} + B_t·x_t Attn: o_t = Σ_s (q_t·k_s^T) v_s State Space Duality: both compute the samefunction when Ā_t is structured appropriately

This led to the SSD (Structured State Space Duality) layer — a unified formulation that can be computed either as an SSM recurrence or as a structured matrix multiplication, whichever is faster on the given hardware.

Mamba-2 is also easier to make work with tensor parallelism (splitting across multiple GPUs) — a practical advantage for training at scale. The architecture is now being integrated into hybrid models (Jamba, Zamba) that interleave attention and SSM layers.

SSD layerlinear attn dualJamba hybrid

12 // vs Transformer

THE TRADEOFFS

Training complexityO(n log n) vs O(n²)

Inference memoryO(d_state) vs O(n·d)

Recall (exact copy)weaker — state is lossy

Long rangestronger at 100k+

In-context learningcompetitive, improving

Interpretabilityharder — state is opaque

Parallelism (training)good but more complex

The lossy state is the fundamental weakness. Transformers can do exact key-value lookup (retrieve a specific token from position 3). Mamba compresses history — if that token got overwritten by later inputs, it's gone. For tasks requiring precise retrieval, transformers still win. For tasks requiring long coherent context, Mamba scales better.

13 // Timeline

FROM CONTROL THEORY TO SOTA

1960

Kalman FilterState space model for optimal estimation. Same A,B,C,D equations. Used in Apollo navigation.

2020

HiPPOGu et al. derive optimal A matrix for online function approximation using polynomial projections.

2021

S4DPLR parameterization makes SSMs computationally feasible. Beats transformers on Long Range Arena.

2022

S4D / DSSFurther simplifications — diagonal A matrices work almost as well. Theory catches up to practice.

2023

MambaSelective scan + hardware-aware kernel. First SSM competitive with transformers on language modeling.

2024

Mamba-2 / JambaState space duality revealed. Hybrid architectures emerge. Active research frontier.

MAMBA / SSM