Mamba-3 | John Wang

Mamba-3 just dropped yesterday. It’s a big milestone towards unseating the stranglehold that transformers have on the modern AI industry.

Mamba-3 is a state space model, and it’s fascinating because it uses an entirely different architecture from transformers (the tech that the big LLMs like Opus 4.6, GPT 5.4, Gemini 3, etc. are based on).

Transformers keep a huge memory layer called the KV cache: this essentially stores all the memory of everything previously said in a conversation when it is computing the next token. It needs this because that ability to look at previous history is core to how it’s able to reason well on large volumes of input data (this is called self-attention).

The downside of a transformer is that as you increase the number of inputs (the prefill phase where it’s reading your system prompt) and outputs (the decoding phase where it’s generating text), you’re increasing the KV cache with each new token. This means by default that transformers are quadratic in their memory constraints, so large inputs slow these models down dramatically over time. Of course the big labs have figured out clever ways to improve performance here, but the math of the base transformer still slows down over time.

Modern state space models (like Mamba) use a very different approach: they keep a single fixed-size hidden state $h$ that adjusts over time: $h_t = A_t \, h_{t-1} + B_t \, x_t$ (where $A_t$ and $B_t$ are data-dependent matrices generated on the fly based on the current input vector $x_t$). This allows the model to selectively choose what to remember and what to forget.

There’s a few magical things about state space models:

They’re much more efficient over long context because computation grows linearly in size (instead of quadratically). This is perfect for audio because there’s a huge amount of data in an audio file, much more than in text. This is one major reason why Cartesia is a leader in the audio space (their lab pioneered the modern state space models).
State space models can use linear algebra tricks to compute the prefill phase incredibly quickly. Notice that $h_1 = A_1 \, h_0 + B_1 \, x_1$ and $h_2 = A_2 \, h_1 + B_2 \, x_2$. This means that you can actually entirely skip the computation of the hidden state $h_1$ if you just use a bit of algebra:
$$h_2 = A_2 \, A_1 \, h_0 + A_2 \, B_1 \, x_1 + B_2 \, x_2$$
Previously, you would need to compute each token and feed that in as input into the next token, but with state space models, you can skip that and compute the last hidden state immediately. Then when you get to the decoding phase where you’re actually doing inference on the new tokens, the state space models switch over to computing the hidden states one at a time.

Mamba-3 in particular does some really interesting stuff to make inference more efficient. I think the team has correctly recognized that there’s a big shift happening in the world of AI: as coding models and LLMs more generally start to run larger and larger workloads, inference has started to become a bigger percentage of GPU usage. It used to be that labs would spend the majority of their GPU fleet on research and training, but now that AI is out in the wild and being used quite extensively, inference is much more important.

Mamba-3 has a few optimizations for this:

Multi-input, multi-output. Previous generations of Mamba models would calculate the output tokens one at a time, similar to what most transformer-based architectures do. But the researchers noticed that GPUs are mostly bottlenecked on moving memory from VRAM to the compute cores. So, they restructured the math to group multiple state updates together into a big matrix multiplication, forcing the GPU to do more math at once while it waits.
Complex numbers for memory. If you apply a real number multiple times, it can only go up or down. For example, if you multiply something by $0.9$ many times, that number will tend to zero. If you multiply by $1.1$ many times, that number will tend towards infinity. One problem of previous Mamba models was that if your memory only contains real numbers, you’ll either definitely forget something or definitely remember something given sufficient time.
Mamba-3 adds complex numbers to its memory, which can rotate in space. For example if you multiply $1$ by $i$ multiple times, you get back to $1$ after 4 multiplications: $1 \cdot i = i$, $\; i \cdot i = -1$, $\; {-1} \cdot i = -i$, $\; {-i} \cdot i = 1$.
This means that Mamba-3 has the ability to track cycles, oscillatory patterns, etc.

It seems like the big labs are still mostly optimizing transformers, but hybrid models like AI21’s Jamba and Google’s Griffin already exist, and I bet that the next wave of models combining Mamba blocks and transformer blocks will be just around the corner.