Mamba-3 SSM Drops With Inference-First Design Beating Transformers at Decode
Together.ai has released Mamba-3, a state space model architecture designed from the ground up for inference workloads rather than training efficiency. The open-source release marks a philosophical shift in how linear architectures are built, arriving as agentic AI workflows have pushed inference demand to unprecedented levels.
At 16,384 sequence length, Mamba-3's SISO variant clocks prefill+decode at 140.61 seconds versus 149.02 seconds for Mamba-2 and a staggering 976.50 seconds for Llama-3.2-1B running on vLLM. That's nearly 7x faster than the Transformer baseline on the same H100 GPU hardware.
Why Inference Matters Now
The timing isn't accidental. While Mamba-2 bet big on training speed back in mid-2024—delivering 2-8x faster training than its predecessor—the landscape has shifted dramatically. Reinforcement learning with verifiable rewards for coding and math requires massive rollout generation. Tools like Codex, Claude Code, and OpenClaw have made inference the bottleneck, not pretraining.
Previous linear architectures simplified their underlying mechanisms to accelerate training, leaving the inference step "too simple" and memory-bound. GPUs weren't computing—they were mostly shuffling data around.
Three Core Improvements
Mamba-3 addresses this through changes rooted in classical control theory rather than trendy deep learning interpretations:
Exponential-trapezoidal discretization creates a more expressive recurrence. This eliminates the short causal convolution that plagued Mamba-1 and Mamba-2—a component that had become standard across linear models since H3 and RWKV-4 popularized it.
Complex-valued SSM systems expand state-tracking capabilities. The model can now handle synthetic tasks like parity and arithmetic reasoning that Mamba-2 couldn't reliably solve.
Multi-input, multi-output (MIMO) architecture runs multiple SSMs in parallel. The MIMO variant boosts downstream accuracy by over 1 percentage point at 1B scale compared to standard Mamba-3, with a crucial catch: training takes longer, but decode latency stays flat.
That last point deserves emphasis. Training is compute-bound; inference is memory-bound. Adding FLOPs per timestep barely touches inference latency because idle GPU cores simply pick up the work.
Benchmark Results
On downstream language modeling evaluations, Mamba-3 outperforms both Mamba-2 and Gated DeltaNet across pretrained model scales. The SISO variant matches Mamba-2's architecture shapes exactly while delivering better accuracy. MIMO pushes further ahead.
Retrieval tasks tell a more nuanced story. Pure linear models naturally underperform Transformers here—that fixed-size state can't match an ever-growing KV cache for exact recall. But Mamba-3 holds its own among sub-quadratic alternatives, and MIMO improves retrieval without increasing state size.
The team predicts hybrid models combining linear layers with global self-attention will dominate language modeling going forward. Their experiments show this combination beats vanilla Transformers on retrieval while maintaining efficiency gains.
Open Source From Day One
Kernels are available at the mamba-ssm repository, built across Triton, TileLang, and CuTe DSL depending on the operation. The stack reflects pragmatic engineering: Triton for standard architecture development, TileLang for fine-grained memory control on MIMO prefill, and CuTe DSL for maximizing Hopper GPU performance during decode.
NVIDIA's recent Nemotron 3 Super release, which uses Mamba-2 layers in a hybrid configuration, suggests enterprise interest in SSM architectures is accelerating. Mamba-3's inference-first approach could accelerate adoption in production environments where token generation speed directly impacts costs and user experience.
The full paper is available on arXiv, with a second blog post covering the mathematical foundations of the three core improvements expected to follow.